1. What does a “bag-of-words” model do with a collection of text documents?
2. When you call CountVectorizer().fit_transform(documents), what is the type of the returned object?
CountVectorizer().fit_transform(documents)
3. In the sparse coordinate output (0, 3) 1 (1, 2) 1 (2, 3) 1 what do the numbers (0, 3) represent?
(0, 3) 1 (1, 2) 1 (2, 3) 1
(0, 3)
4. If a sparse matrix has shape (5, 10) and “with 8 stored elements,” how many zeros are implied in the full (dense) array?
shape (5, 10)
5. What does the method toarray() do when called on a NumPy sparse matrix X?
toarray()
X
6. Why might you want to convert a sparse matrix to a pandas DataFrame after calling toarray()?
7. After fitting a CountVectorizer on three documents, you see that get_feature_names_out() returns: ['and', 'cat', 'dog', 'hello', 'how', 'sat'] What is special about this list?
CountVectorizer
get_feature_names_out()
['and', 'cat', 'dog', 'hello', 'how', 'sat']
8. Suppose X is a CSR sparse matrix with shape (4, 7). Which of these statements is true?
shape (4, 7)
X.toarray()
9. In a DataFrame created as df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) what do the columns= represent?
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
columns=
10. If a particular word never appears in any document, what will the entire column for that word look like after calling toarray() and converting to a DataFrame?
11. You have a list sentences = ["Hello world", "Hello"]. After vectorizing, the vocabulary becomes ['hello', 'world']. What will X.toarray() produce?
sentences = ["Hello world", "Hello"]
['hello', 'world']
[[1, 1], [1, 0]]
[[0, 0], [0, 0]]
[[2, 1], [1, 0]]
[[1], [1]]
12. Why does the sparse format store only “stored elements” instead of all entries?
13. Which Python library is needed to create a DataFrame from a dense NumPy array?
numpy as np
pandas as pd
sklearn
math
14. How can you find out which column index corresponds to a specific word “apple” after fitting the vectorizer?
vectorizer.get_feature_names_out().index("apple")
df.loc["apple"]
15. If you wanted to display the number of times each word appears in each document in a human-friendly table, which sequence of steps would you use?
CountVectorizer.fit_transform()
pd.DataFrame(...)