The 20newsgroups dataset — 5:34 min#

  • 5:34 min | Ultima modificación: Septiembre 28, 2021 | YouTube

https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

Este dataset contiene la información de 18.846 mensages clasificados en 20 categorías. El dataset se encuentra dividido en un conjunto de entrenamiento y un conjunto de prueba. La partición de los dos conjuntos se baso en la fecha en la cual se publicó el mensaje únicamente.

[1]:
from sklearn.datasets import fetch_20newsgroups
[2]:
bunch = fetch_20newsgroups(
    # -----------------------------------------------------
    # Specify another download and cache folder for the
    # datasets. By default all scikit-learn data is stored
    # in ‘~/scikit_learn_data’ subfolders.
    data_home=None,
    # -----------------------------------------------------
    # Select the dataset to load: ‘train’ for the training
    # set, ‘test’ for the test set, ‘all’ for both, with
    # shuffled ordering.
    subset="all",
    # -----------------------------------------------------
    # If None (default), load all the categories. If not
    # None, list of category names to load (other
    # categories ignored).
    categories=None,
    # -----------------------------------------------------
    # Whether or not to shuffle the data
    shuffle=False,
    # -----------------------------------------------------
    # Determines random number generation for dataset
    # shuffling.
    random_state=None,
    # -----------------------------------------------------
    # May contain any subset of (‘headers’, ‘footers’,
    # ‘quotes’). Each of these are kinds of text that will
    # be detected and removed from the newsgroup posts,
    # preventing classifiers from overfitting on metadata.
    #
    # ‘headers’ removes newsgroup headers, ‘footers’ removes
    # blocks at the ends of posts that look like signatures,
    # and ‘quotes’ removes lines that appear to be quoting
    # another post.
    #
    # ‘headers’ follows an exact standard; the other filters
    # are not always correct.
    remove=(),
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=False,
)

bunch.keys()
[2]:
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
[3]:
bunch.target_names
[3]:
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
[4]:
bunch.target[:5]
[4]:
array([ 9,  4, 11,  4,  0])
[5]:
print(bunch.data[0])
From: cubbie@garnet.berkeley.edu (                               )
Subject: Re: Cubs behind Marlins? How?
Article-I.D.: agate.1pt592$f9a
Organization: University of California, Berkeley
Lines: 12
NNTP-Posting-Host: garnet.berkeley.edu


gajarsky@pilot.njin.net writes:

morgan and guzman will have era's 1 run higher than last year, and
 the cubs will be idiots and not pitch harkey as much as hibbard.
 castillo won't be good (i think he's a stud pitcher)

       This season so far, Morgan and Guzman helped to lead the Cubs
       at top in ERA, even better than THE rotation at Atlanta.
       Cubs ERA at 0.056 while Braves at 0.059. We know it is early
       in the season, we Cubs fans have learned how to enjoy the
       short triumph while it is still there.

[6]:
X, y = fetch_20newsgroups(
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=True,
)

display(
    X[:2],
    y[:2],
)
["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks.\n\nGuy Kuo <guykuo@u.washington.edu>\n"]
array([7, 4])
[7]:
from sklearn.datasets import fetch_20newsgroups_vectorized

bunch = fetch_20newsgroups_vectorized(
    # -----------------------------------------------------
    # Specify another download and cache folder for the
    # datasets. By default all scikit-learn data is stored
    # in ‘~/scikit_learn_data’ subfolders.
    data_home=None,
    # -----------------------------------------------------
    # Select the dataset to load: ‘train’ for the training
    # set, ‘test’ for the test set, ‘all’ for both, with
    # shuffled ordering.
    subset="all",
    # -----------------------------------------------------
    # May contain any subset of (‘headers’, ‘footers’,
    # ‘quotes’). Each of these are kinds of text that will
    # be detected and removed from the newsgroup posts,
    # preventing classifiers from overfitting on metadata.
    #
    # ‘headers’ removes newsgroup headers, ‘footers’ removes
    # blocks at the ends of posts that look like signatures,
    # and ‘quotes’ removes lines that appear to be quoting
    # another post.
    #
    # ‘headers’ follows an exact standard; the other filters
    # are not always correct.
    remove=(),
    # -----------------------------------------------------
    # If True, returns (data, target) instead of a Bunch
    # object.
    return_X_y=False,
)
[8]:
bunch.keys()
[8]:
dict_keys(['data', 'target', 'target_names', 'DESCR'])
[9]:
bunch.data.shape
[9]:
(18846, 130107)
[10]:
bunch.target
[10]:
array([17,  7, 10, ..., 10, 18,  9])