Transformación de textos a características usando CountVectorizer#

Permite convertir un array de strigs a una matriz documento-término

This is the first document.
This is the second second document.
And the third one.
Is this the first document?

   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         1      0   1    0       2    1      0     1
2    1         0      0   0    1       0    1      1     0
3    0         1      1   1    0       0    1      0     1

Creación del corpus#

[1]:

corpus = [
    "This is the first document.",
    "This is the second second document.",
    "And the third one.",
    "Is this the first document?",
]

Creación del transformador#

[2]:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option("display.notebook_repr_html", False)


vectorizer = CountVectorizer(
    # -------------------------------------------------------------------------
    # Convert all characters to lowercase before tokenizing.
    lowercase=True,
    # -------------------------------------------------------------------------
    # Override the preprocessing (strip_accents and lowercase) stage
    preprocessor=None,
    # -------------------------------------------------------------------------
    # Override the string tokenization step while preserving the preprocessing
    # and n-grams generation steps. Only applies if analyzer == 'word'.
    tokenizer=None,
    # -------------------------------------------------------------------------
    # If ‘english’, a built-in stop word list for English is used.
    stop_words=None,
    # -------------------------------------------------------------------------
    # {‘ascii’, ‘unicode’}
    # Remove accents and perform other character normalization during the
    # preprocessing step. ‘ascii’ is a fast method that only works on
    # characters that have a direct ASCII mapping. ‘unicode’ is a slightly
    # slower method that works on any characters.
    strip_accents=None,
    # -------------------------------------------------------------------------
    # {‘word’, ‘char’, ‘char_wb’}
    # Whether the feature should be made of word n-gram or character n-grams.
    analyzer="word",
    # -------------------------------------------------------------------------
    # Regular expression denoting what constitutes a “token”, only used if
    # analyzer == 'word'. The default regexp select tokens of 2 or more
    # alphanumeric characters (punctuation is completely ignored and always
    # treated as a token separator).
    token_pattern=r"(?u)\b\w\w+\b",
    # -------------------------------------------------------------------------
    # When building the vocabulary ignore terms that have a document frequency
    # strictly higher than the given threshold (corpus-specific stop words). If
    # float, the parameter represents a proportion of documents, integer
    # absolute counts. This parameter is ignored if vocabulary is not None.
    max_df=1.0,
    # -------------------------------------------------------------------------
    # When building the vocabulary ignore terms that have a document frequency
    # strictly lower than the given threshold. This value is also called
    # cut-off in the literature. If float, the parameter represents a
    # proportion of documents, integer absolute counts.
    min_df=1,
    # -------------------------------------------------------------------------
    # If not None, build a vocabulary that only consider the top max_features
    # ordered by term frequency across the corpus.
    max_features=None,
    # -------------------------------------------------------------------------
    # If True, all non zero counts are set to 1. This is useful for discrete
    # probabilistic models that model binary events rather than integer counts.
    binary=False,
)

Creación de la matriz documento-termino#

[3]:

vectorizer.fit(corpus)
X = vectorizer.transform(corpus)

pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out(),
)

[3]:

   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         1      0   1    0       2    1      0     1
2    1         0      0   0    1       0    1      1     0
3    0         1      1   1    0       0    1      0     1

Transformación de un texto nuevo#

[4]:

vectorizer.transform(["Something completely new."]).toarray()

[4]:

array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])

Construcción de un analizador#

[5]:

analyze = vectorizer.build_analyzer()
analyze("This is a completely new text document to analyze.")

[5]:

['this', 'is', 'completely', 'new', 'text', 'document', 'to', 'analyze']

Posición (columna) del token en la matriz#

[6]:

vectorizer.vocabulary_.get("document")

[6]:

Reconocimiento de bigramas#

[7]:

bigram_vectorizer = CountVectorizer(
    ngram_range=(1, 2),
    token_pattern=r"\b\w+\b",
    min_df=1,
)

analyze = bigram_vectorizer.build_analyzer()

analyze("Bi-grams are cool!")

[7]:

['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool']

[8]:

bigram_vectorizer.fit(corpus)

X_2 = bigram_vectorizer.transform(corpus)

pd.DataFrame(
    X_2.toarray(),
    columns=bigram_vectorizer.get_feature_names_out(),
)

[8]:

   and  and the  document  first  first document  is  is the  is this  one  \
0    0        0         1      1               1   1       1        0    0
1    0        0         1      0               0   1       1        0    0
2    1        1         0      0               0   0       0        0    1
3    0        0         1      1               1   1       0        1    0

   second  ...  second second  the  the first  the second  the third  third  \
0       0  ...              0    1          1           0          0      0
1       2  ...              1    1          0           1          0      0
2       0  ...              0    1          0           0          1      1
3       0  ...              0    1          1           0          0      0

   third one  this  this is  this the
0          0     1        1         0
1          0     1        1         0
2          1     0        0         0
3          0     1        0         1

[4 rows x 21 columns]

Extracción de una columna de la matriz documento-término#

[9]:

feature_index = bigram_vectorizer.vocabulary_.get("is this")

X_2[:, feature_index].toarray()

[9]:

array([[0],
       [0],
       [0],
       [1]])