Bag-of-Words & Sparse Matrix Quiz

1. What does a “bag-of-words” model do with a collection of text documents?

2. When you call CountVectorizer().fit_transform(documents), what is the type of the returned object?

3. In the sparse coordinate output
(0, 3) 1
(1, 2) 1
(2, 3) 1

what do the numbers (0, 3) represent?

4. If a sparse matrix has shape (5, 10) and “with 8 stored elements,” how many zeros are implied in the full (dense) array?

5. What does the method toarray() do when called on a NumPy sparse matrix X?

6. Why might you want to convert a sparse matrix to a pandas DataFrame after calling toarray()?

7. After fitting a CountVectorizer on three documents, you see that get_feature_names_out() returns:
['and', 'cat', 'dog', 'hello', 'how', 'sat']
What is special about this list?

8. Suppose X is a CSR sparse matrix with shape (4, 7). Which of these statements is true?

9. In a DataFrame created as
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
what do the columns= represent?

10. If a particular word never appears in any document, what will the entire column for that word look like after calling toarray() and converting to a DataFrame?

11. You have a list sentences = ["Hello world", "Hello"]. After vectorizing, the vocabulary becomes ['hello', 'world']. What will X.toarray() produce?

12. Why does the sparse format store only “stored elements” instead of all entries?

13. Which Python library is needed to create a DataFrame from a dense NumPy array?

14. How can you find out which column index corresponds to a specific word “apple” after fitting the vectorizer?

15. If you wanted to display the number of times each word appears in each document in a human-friendly table, which sequence of steps would you use?