Transformación de textos a características usando CountVectorizer#
Permite convertir un array de strigs a una matriz documento-término
This is the first document.
This is the second second document.
And the third one.
Is this the first document?
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 1 0 1 0 2 1 0 1
2 1 0 0 0 1 0 1 1 0
3 0 1 1 1 0 0 1 0 1
Creación del corpus#
[1]:
corpus = [
"This is the first document.",
"This is the second second document.",
"And the third one.",
"Is this the first document?",
]
Creación del transformador#
[2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
pd.set_option("display.notebook_repr_html", False)
vectorizer = CountVectorizer(
# -------------------------------------------------------------------------
# Convert all characters to lowercase before tokenizing.
lowercase=True,
# -------------------------------------------------------------------------
# Override the preprocessing (strip_accents and lowercase) stage
preprocessor=None,
# -------------------------------------------------------------------------
# Override the string tokenization step while preserving the preprocessing
# and n-grams generation steps. Only applies if analyzer == 'word'.
tokenizer=None,
# -------------------------------------------------------------------------
# If ‘english’, a built-in stop word list for English is used.
stop_words=None,
# -------------------------------------------------------------------------
# {‘ascii’, ‘unicode’}
# Remove accents and perform other character normalization during the
# preprocessing step. ‘ascii’ is a fast method that only works on
# characters that have a direct ASCII mapping. ‘unicode’ is a slightly
# slower method that works on any characters.
strip_accents=None,
# -------------------------------------------------------------------------
# {‘word’, ‘char’, ‘char_wb’}
# Whether the feature should be made of word n-gram or character n-grams.
analyzer="word",
# -------------------------------------------------------------------------
# Regular expression denoting what constitutes a “token”, only used if
# analyzer == 'word'. The default regexp select tokens of 2 or more
# alphanumeric characters (punctuation is completely ignored and always
# treated as a token separator).
token_pattern=r"(?u)\b\w\w+\b",
# -------------------------------------------------------------------------
# When building the vocabulary ignore terms that have a document frequency
# strictly higher than the given threshold (corpus-specific stop words). If
# float, the parameter represents a proportion of documents, integer
# absolute counts. This parameter is ignored if vocabulary is not None.
max_df=1.0,
# -------------------------------------------------------------------------
# When building the vocabulary ignore terms that have a document frequency
# strictly lower than the given threshold. This value is also called
# cut-off in the literature. If float, the parameter represents a
# proportion of documents, integer absolute counts.
min_df=1,
# -------------------------------------------------------------------------
# If not None, build a vocabulary that only consider the top max_features
# ordered by term frequency across the corpus.
max_features=None,
# -------------------------------------------------------------------------
# If True, all non zero counts are set to 1. This is useful for discrete
# probabilistic models that model binary events rather than integer counts.
binary=False,
)
Creación de la matriz documento-termino#
[3]:
vectorizer.fit(corpus)
X = vectorizer.transform(corpus)
pd.DataFrame(
X.toarray(),
columns=vectorizer.get_feature_names_out(),
)
[3]:
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 1 0 1 0 2 1 0 1
2 1 0 0 0 1 0 1 1 0
3 0 1 1 1 0 0 1 0 1
Transformación de un texto nuevo#
[4]:
vectorizer.transform(["Something completely new."]).toarray()
[4]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])
Construcción de un analizador#
[5]:
analyze = vectorizer.build_analyzer()
analyze("This is a completely new text document to analyze.")
[5]:
['this', 'is', 'completely', 'new', 'text', 'document', 'to', 'analyze']
Posición (columna) del token en la matriz#
[6]:
vectorizer.vocabulary_.get("document")
[6]:
1
Reconocimiento de bigramas#
[7]:
bigram_vectorizer = CountVectorizer(
ngram_range=(1, 2),
token_pattern=r"\b\w+\b",
min_df=1,
)
analyze = bigram_vectorizer.build_analyzer()
analyze("Bi-grams are cool!")
[7]:
['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']
[8]:
bigram_vectorizer.fit(corpus)
X_2 = bigram_vectorizer.transform(corpus)
pd.DataFrame(
X_2.toarray(),
columns=bigram_vectorizer.get_feature_names_out(),
)
[8]:
and and the document first first document is is the is this one \
0 0 0 1 1 1 1 1 0 0
1 0 0 1 0 0 1 1 0 0
2 1 1 0 0 0 0 0 0 1
3 0 0 1 1 1 1 0 1 0
second ... second second the the first the second the third third \
0 0 ... 0 1 1 0 0 0
1 2 ... 1 1 0 1 0 0
2 0 ... 0 1 0 0 1 1
3 0 ... 0 1 1 0 0 0
third one this this is this the
0 0 1 1 0
1 0 1 1 0
2 1 0 0 0
3 0 1 0 1
[4 rows x 21 columns]
Extracción de una columna de la matriz documento-término#
[9]:
feature_index = bigram_vectorizer.vocabulary_.get("is this")
X_2[:, feature_index].toarray()
[9]:
array([[0],
[0],
[0],
[1]])