Bag-of-Words Quiz

1. What does a “bag-of-words” model do with a collection of text documents?

(A) It treats each document as a mathematical “bag” and counts how often each word appears. (B) It translates every word into emojis. (C) It stores full sentences without breaking them into words. (D) It removes all punctuation and ignores the words.

2. When you call CountVectorizer().fit_transform(documents), what is the type of the returned object?

(A) A dense Python list of lists. (B) A Compressed Sparse Row (CSR) sparse matrix. (C) A pandas DataFrame. (D) A simple integer.

3. In the sparse coordinate output
(0, 3) 1 (1, 2) 1 (2, 3) 1
what do the numbers (0, 3) represent?

(A) Row 0, Column 3 in the matrix, with value 1. (B) Page 0, Paragraph 3 of the document. (C) Index 0 of the word list and index 3 of the document list. (D) The length of the first document and the length of the second.

4. If a sparse matrix has shape (5, 10) and “with 8 stored elements,” how many zeros are implied in the full (dense) array?

(A) 8 zeros. (B) 42 zeros. (C) 50 zeros. (D) 42 nonzero values.

5. What does the method toarray() do when called on a NumPy sparse matrix X?

(A) Converts it into a list of Python strings. (B) Builds and returns a full 2D NumPy array with zeros and nonzeros. (C) Leaves it unchanged because it is already dense. (D) Deletes all zero entries permanently.

6. Why might you want to convert a sparse matrix to a pandas DataFrame after calling toarray()?

(A) To hide all the zeros in the table. (B) To use labeled columns (words) and labeled rows (document names). (C) To translate words into another language. (D) To make the matrix smaller in memory.

7. After fitting a CountVectorizer on three documents, you see that get_feature_names_out() returns:
['and', 'cat', 'dog', 'hello', 'how', 'sat']
What is special about this list?

(A) The words are sorted by frequency in descending order. (B) The words appear in alphabetical (lexicographical) order. (C) The list shows only the two most common words. (D) The words are translated into Morse code.

8. Suppose X is a CSR sparse matrix with shape (4, 7). Which of these statements is true?

(A) Every entry in X is visible when you print X. (B) When you print X, you only see the nonzero coordinates and values. (C) X.toarray() will show only the nonzero cells. (D) The shape indicates four columns and seven rows.

9. In a DataFrame created as
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
what do the columns= represent?

(A) Column numbers automatically assigned from 0 to (n_features – 1). (B) The actual token strings (words) matched to each column index. (C) A random ordering of feature names. (D) The sizes of each document.

10. If a particular word never appears in any document, what will the entire column for that word look like after calling toarray() and converting to a DataFrame?

(A) All values will be 1. (B) All values will be 0. (C) Some values will be 1 and some will be 0. (D) The column will not exist at all.

11. You have a list sentences = ["Hello world", "Hello"]. After vectorizing, the vocabulary becomes ['hello', 'world']. What will X.toarray() produce?

(A) [[1, 1], [1, 0]] (B) [[0, 0], [0, 0]] (C) [[2, 1], [1, 0]] (D) [[1], [1]]

12. Why does the sparse format store only “stored elements” instead of all entries?

(A) Because it randomly discards zeros to speed up processing. (B) Because most entries are zero, so storing only nonzero values saves memory. (C) Because zeros are not allowed in a matrix. (D) Because sparse format only works for square matrices.

13. Which Python library is needed to create a DataFrame from a dense NumPy array?

(A) numpy as np (B) pandas as pd (C) sklearn (D) math

14. How can you find out which column index corresponds to a specific word “apple” after fitting the vectorizer?

(A) By calling vectorizer.get_feature_names_out().index("apple"). (B) By printing the entire matrix manually. (C) By using df.loc["apple"]. (D) By guessing a random number.

15. If you wanted to display the number of times each word appears in each document in a human-friendly table, which sequence of steps would you use?

(A) CountVectorizer.fit_transform(), then toarray(), then pd.DataFrame(...) with labeled columns and rows. (B) Print the sparse matrix directly without conversion. (C) Use toarray() and stop, because printing a NumPy array is already a table with labels. (D) Manually loop over each word and each document, printing counts line by line.

Bag-of-Words & Sparse Matrix Quiz