Trabajo con datos textuales#
Ultima modificación: 2024-01-22 | YouTube
Carga de datos#
[1]:
from sklearn.datasets import fetch_20newsgroups
categories = [
"alt.atheism",
"soc.religion.christian",
"comp.graphics",
"sci.med",
]
twenty_train = fetch_20newsgroups(
subset="train",
categories=categories,
shuffle=True,
random_state=42,
)
[2]:
twenty_train.target_names
[2]:
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
[3]:
len(twenty_train.data)
[3]:
2257
[4]:
twenty_train.filenames[0:10]
[4]:
array(['/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38440',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38479',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20737',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20942',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20487',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20891',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20914',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/sci.med/58110',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/sci.med/58114',
'/Users/jdvelasq/scikit_learn_data/20news_home/20news-bydate-train/sci.med/58838'],
dtype='<U96')
[5]:
len(twenty_train.filenames)
[5]:
2257
[6]:
print("\n".join(twenty_train.data[0].split("\n")[:10]))
From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14
Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format. We would also like to
do the same, converting to HPGL (HP plotter) files.
[7]:
print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics
[8]:
twenty_train.target[:10]
[8]:
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])
[9]:
for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med
Extracción de características de los archivos de texto#
[10]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape
[10]:
(2257, 35788)
[11]:
count_vect.vocabulary_.get("algorithm")
[11]:
4690
[12]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape
[12]:
(2257, 35788)
[13]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
[13]:
(2257, 35788)
Entrenamiento del clasificador#
[14]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
[15]:
docs_new = ["God is love", "OpenGL on the GPU is fast"]
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print("%r => %s" % (doc, twenty_train.target_names[category]))
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
Construcción de un pipeline#
[16]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline(
[
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", MultinomialNB()),
]
)
[17]:
text_clf.fit(twenty_train.data, twenty_train.target)
[17]:
Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
CountVectorizer()
TfidfTransformer()
MultinomialNB()
Evaluación del desempeño sobre el conjunto de prueba#
[18]:
import numpy as np
twenty_test = fetch_20newsgroups(
subset="test",
categories=categories,
shuffle=True,
random_state=42,
)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
[18]:
0.8348868175765646
[19]:
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline(
[
("vect", CountVectorizer()),
("tfidf", TfidfTransformer()),
(
"clf",
SGDClassifier(
loss="hinge",
penalty="l2",
alpha=1e-3,
random_state=42,
max_iter=5,
tol=None,
),
),
]
)
text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
[19]:
0.9101198402130493
[20]:
from sklearn import metrics
metrics.classification_report(
twenty_test.target, predicted, target_names=twenty_test.target_names
)
[20]:
' precision recall f1-score support\n\n alt.atheism 0.95 0.80 0.87 319\n comp.graphics 0.87 0.98 0.92 389\n sci.med 0.94 0.89 0.91 396\nsoc.religion.christian 0.90 0.95 0.93 398\n\n accuracy 0.91 1502\n macro avg 0.91 0.91 0.91 1502\n weighted avg 0.91 0.91 0.91 1502\n'
[21]:
metrics.confusion_matrix(twenty_test.target, predicted)
[21]:
array([[256, 11, 16, 36],
[ 4, 380, 3, 2],
[ 5, 35, 353, 3],
[ 5, 11, 4, 378]])
Afinamiento de parámetros usando grid search#
[22]:
from sklearn.model_selection import GridSearchCV
parameters = {
"vect__ngram_range": [(1, 1), (1, 2)],
"tfidf__use_idf": (True, False),
"clf__alpha": (1e-2, 1e-3),
}
[23]:
gs_clf = GridSearchCV(text_clf, parameters, cv=5, n_jobs=-1,)
[24]:
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400],)
[25]:
twenty_train.target_names[gs_clf.predict(["God is love"])[0]]
[25]:
'soc.religion.christian'
[26]:
gs_clf.best_score_
[26]:
0.9175000000000001
[27]:
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
clf__alpha: 0.001
tfidf__use_idf: True
vect__ngram_range: (1, 1)