Descubrimiento de reglas de asociación en tags de proyectos de software

  • 60 min | Ultima modificación: Noviembre 26, 2020

Este tutorial esta basado en Mastering Data Mining with Python, Megan Squire, 2016. Packt Publishing.

El archivo project_tags.csv contiene los tags asociados a diferentes proyectos de software por los desarrolladdores. La primera columna corresponde al ID del proyecto; la segunda al tag asignado. Se desean construir reglas que permiten sugerir un tag a partir de dos tags previamente seleccionados por el usuario.

[1]:
import sqlite3

conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
[31]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv
--2020-11-29 04:47:25--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5418355 (5.2M) [text/plain]
Saving to: ‘project_tags.csv.1’

project_tags.csv.1  100%[===================>]   5.17M  4.52MB/s    in 1.1s

2020-11-29 04:47:27 (4.52 MB/s) - ‘project_tags.csv.1’ saved [5418355/5418355]

Carga de los datos

[2]:
conn.executescript("""
DROP TABLE IF EXISTS project_tags;

CREATE TABLE project_tags
(
    project_id     INT NOT NULL DEFAULT '0',
    tag_name    STRING NOT NULL DEFAULT '0',
    PRIMARY KEY (project_id, tag_name)
);
""")

conn.commit()
[3]:
with open('project_tags.csv', 'rt') as f:
    data = f.readlines()

## Elimina el '\n' al final de la línea
data = [line.replace('\n', '') for line in data]

## Separa los campos por comas
data = [line.split(',') for line in data]

## Convierte la fila en una tupla
data = [tuple(line) for line in data]

## Elimina valores duplicados
data = list(set([tuple(line) for line in data]))


## Imprime los primeros 5 registros para verificar
data[0:5]
[3]:
[('36762', 'Database Engines/Servers'),
 ('14882', 'Systems Administration'),
 ('53184', 'C'),
 ('41895', 'multimedia'),
 ('53266', 'Desktop Environment')]
[4]:
##
## Carga a partir de la lista de tuplas
## contenidas en data
##
cursor.executemany('INSERT INTO project_tags VALUES (?,?)', data)

##
## Verificación
##
cursor.execute("SELECT * FROM project_tags LIMIT 5;").fetchall()
[4]:
[(36762, 'Database Engines/Servers'),
 (14882, 'Systems Administration'),
 (53184, 'C'),
 (41895, 'multimedia'),
 (53266, 'Desktop Environment')]

Información básica

[5]:
##
## Cantidad de registros
##
cursor.execute("SELECT COUNT(*) FROM project_tags;").fetchone()[0]
[5]:
[(353401,)]
[6]:
##
## Cantidad de proyectos
##
cursor.execute("SELECT COUNT(DISTINCT project_id) FROM project_tags;").fetchone()[0]
[6]:
(46511,)
[17]:
##
## Cantidad de proyectos
## Se toma como baskets la cantidad de proyectos en la tabla
##
baskets = cursor.execute("SELECT COUNT(DISTINCT project_id) FROM project_tags;").fetchone()[0]
baskets
[17]:
46511

Número de proyectos por tag y soporte

[9]:
##
## Número de proyectos por tag
##
x = cursor.execute(
"""
    SELECT
        tag_name,
        COUNT(project_id),
        ROUND(
            COUNT(project_id) * 100.0 / (SELECT COUNT(DISTINCT project_id) FROM project_tags),
            2)
    FROM
        project_tags
    GROUP BY
        1
    ORDER BY
        2 DESC
    LIMIT
        35;
""")

##
## Un 5% equivale aprox a 2335 proyectos
##
for tag, value, pct in x.fetchall():
    print("{:23s}  {:6d}   {:6.2f}%".format(tag, value, pct))
GPL                       21176    45.53%
POSIX                     16868    36.27%
Linux                     16284    35.01%
C                         10288    22.12%
OS Independent            10178    21.88%
Software Development       9614    20.67%
Internet                   8097    17.41%
Windows                    7572    16.28%
Java                       6390    13.74%
Web                        6264    13.47%
English                    5997    12.89%
C++                        5891    12.67%
Libraries                  5738    12.34%
PHP                        5448    11.71%
Unix                       5098    10.96%
Mac OS X                   4823    10.37%
multimedia                 4813    10.35%
Communications             4449     9.57%
Perl                       4242     9.12%
Python                     4190     9.01%
LGPL                       3524     7.58%
Utilities                  3297     7.09%
Dynamic Content            3199     6.88%
GPLv3                      2875     6.18%
Networking                 2819     6.06%
Scientific/Engineering     2678     5.76%
Games/Entertainment        2528     5.44%
BSD                        2494     5.36%
Desktop Environment        2335     5.02%
Graphics                   2268     4.88%
Database                   2200     4.73%
GPLv2                      2147     4.62%
Text Processing            2131     4.58%
Sound/Audio                2094     4.50%
Security                   1960     4.21%
[18]:
#
# Soporte mínimo
#
MIN_SUPPORT_PCT = 5

#
# Descarta el porcentaje especificado (MIN_SUPPORT_PCT) de tags menos frecuentes.
# Se require que el tag aparezca en 554 proyectos o mas (de 46511 proyectos existentes)
#
minsupport = baskets * (MIN_SUPPORT_PCT / 100)
print(
    "Minimum support count: {} ({}% of bastkets)".format(minsupport, MIN_SUPPORT_PCT),
)
Minimum support count: 2325.55 (5% of bastkets)

Singletons

[11]:
##
## Descarta los tags menos frecuentes. Singletons es una
## lista de tuplas de la siguiente forma:
##
##    [('Apache 2.0',),
##     ('Application Frameworks',),
##     ('Archiving',),
##     ...
##    ]
##
singletons = cursor.execute(
    """
    SELECT
        DISTINCT tag_name
    FROM
        project_tags
    GROUP BY
        1
    HAVING
        COUNT(project_id) >= {}
    ORDER BY
        tag_name
    """.format(
        minsupport
    )
).fetchall()

##
## Esta variable contiene todos los tags que aparecen
## en, al menos, el 5% de los proyectos
##
allSingletonTags = [x[0] for x in singletons]

allSingletonTags[:5]
[11]:
['Apache 2.0', 'Application Frameworks', 'Archiving', 'Artistic', 'BSD']

Doubletons

[19]:
##
## La siguiente tabla contiene la cantidad de proyectos
## que tienen tag1 y tag2 simultáneamente
##
conn.executescript(
    """
    DROP TABLE IF EXISTS project_tag_pairs;

    CREATE TABLE project_tag_pairs
    (
        tag1      STRING,
        tag2      STRING,
        num_projs INT
    );
    """
)

conn.commit()
[20]:
from itertools import combinations

##
## Uso de itertools.combinations
##
x = [0, 1, 2, 3]
for w in list(combinations(x, 2)):
    print(w)
(0, 1)
(0, 2)
(0, 3)
(1, 2)
(1, 3)
(2, 3)
[21]:
##
## Tags que aparecen unicamente en las
## combinaciones admisibles de dos tags
## diferentes
##
allDoubletonTags = set()

##
## Tuplas unicas formadas por (tag0, tag1)
##
doubletonSet = set()


def findDoubletons():

    ##
    ## INNER JOIN retorna lo registros que aparecen
    ## simultaneamente en las dos tablas (intersección)
    ##
    ## La siguiente consulta retorna cuantos proyectos usan
    ## tag1 y tag2 simultaneamente.
    ##
    ## Si:
    ##
    ##    prj0, tag0
    ##    prj0, tag1
    ##    prj0, tag2
    ##    prj1, tag0
    ##    prj1, tag1
    ##    prj1, tag3
    ##    prj2, tag0
    ##    prj2, tag3
    ##    ...
    ##
    ## El inner join con tag0 y tag1 genera:
    ##
    ##    prj0, tag0, prj0, tag1
    ##    prj1, tag0, prj1, tag1
    ##    ...
    ##
    getDoubletonFrequencyQuery = """
        SELECT
            count(t1.project_id)
        FROM
            project_tags t1
        INNER JOIN
            project_tags t2
        ON
            t1.project_id = t2.project_id
        WHERE
        (
            t1.tag_name = '{}'
            AND t2.tag_name = '{}'
        )
    """

    insertPairQuery = """
        INSERT INTO
            project_tag_pairs (tag1, tag2, num_projs)
        VALUES
            ('{}','{}',{})
    """

    ##
    ## Genera todas las combinaciones de dos tags usando
    ## los tags individuales que cumplen con una ocurrencia
    ## minima
    ##
    doubletonCandidates = list(combinations(allSingletonTags, 2))

    for (index, candidate) in enumerate(doubletonCandidates):

        tag1 = candidate[0]
        tag2 = candidate[1]

        ##
        ## Cuenta la cantidad de proyectos que usan tag1 y tag2 simultaneamente
        ##
        count = cursor.execute(
            getDoubletonFrequencyQuery.format(tag1, tag2)
        ).fetchone()[0]


        if count > minsupport:

            ## Don't panic!: reporta que se esta ejecutando.
            print(".", sep="", end="")


            cursor.execute(insertPairQuery.format(tag1, tag2, count))

            ##
            ## Inserta la tupla (tag1, tag2) en la tabla
            ##
            doubletonSet.add(candidate)

            ##
            ## Agrega los tags a la lista de tags usados
            ##
            allDoubletonTags.add(tag1)
            allDoubletonTags.add(tag2)


findDoubletons()
....................................
[23]:
x = cursor.execute(
"""
    SELECT
        *
    FROM
        project_tag_pairs
    ORDER BY
        1 ASC,
        2 ASC;
""")

##
## Un 5% equivale aprox a 2335 proyectos
##
for tag1, tag2, num_projs in x.fetchall():
    print("{:23s}  {:23s}   {:6d}".format(tag1, tag2, num_projs))
C                        GPL                         5539
C                        Linux                       5648
C                        POSIX                       6952
C++                      GPL                         2911
C++                      Linux                       3425
C++                      POSIX                       3501
Communications           GPL                         2578
Dynamic Content          Internet                    3171
Dynamic Content          Web                         3170
English                  Linux                       2660
GPL                      Internet                    4035
GPL                      Linux                       8036
GPL                      OS Independent              4403
GPL                      PHP                         2372
GPL                      POSIX                      10062
GPL                      Software Development        3318
GPL                      Web                         2899
GPL                      Windows                     2603
GPL                      multimedia                  2879
Internet                 OS Independent              3005
Internet                 POSIX                       2831
Internet                 Web                         5973
Java                     OS Independent              3433
Java                     Software Development        2356
Libraries                Software Development        5633
Linux                    Mac OS X                    2973
Linux                    POSIX                      11896
Linux                    Software Development        2335
Linux                    Unix                        2493
Linux                    Windows                     5279
Mac OS X                 Windows                     3131
OS Independent           Software Development        3564
OS Independent           Web                         2602
POSIX                    Software Development        3501
POSIX                    Windows                     4464
POSIX                    multimedia                  2538

Tripletons

[24]:
##
## La siguiente tabla contiene la cantidad de proyectos
## que tienen tag1, tag2 y tag3 simultáneamente
##
conn.executescript("""
DROP TABLE IF EXISTS project_tag_triples;

CREATE TABLE project_tag_triples
(
    tag1      STRING,
    tag2      STRING,
    tag3      STRING,
    num_projs INT
);
""")

conn.commit()
[26]:
def findTripletons():

    ##
    ## Sigue una lógica similar a la usada anteriormente
    ##
    getTripletonFrequencyQuery = """
        SELECT
            count(t1.project_id)
        FROM
            project_tags t1
        INNER JOIN
                project_tags t2
            ON
                t1.project_id = t2.project_id
        INNER JOIN
                project_tags t3
            ON
                t2.project_id = t3.project_id
        WHERE
        (
            t1.tag_name = '{}'
            AND t2.tag_name = '{}'
            AND t3.tag_name = '{}'
        )
    """

    insertTripletonQuery = """
        INSERT INTO project_tag_triples(tag1, tag2, tag3, num_projs)
        VALUES ('{}','{}','{}',{})
    """

    ##
    ##  Crea tripletas ordenadas con los tags que aparecen en dos proyectos y
    ##  cumplen con el soporte minimo
    ##
    tripletonCandidates = [
        sorted(tc) for tc in list(combinations(allDoubletonTags, 3))
    ]

    for index, candidate in enumerate(tripletonCandidates):

        ##
        ## La tripleta contiene, al menos, una tupla que esta en la
        ## la lista de doubleTons
        ##
        if any(
            [
                tuple_ in doubletonSet
                for tuple_ in list(combinations(candidate, 2))
            ]
        ):

            ##
            ## Computa la frecuencia de la tripleta
            ##
            count = cursor.execute(
                getTripletonFrequencyQuery.format(candidate[0], candidate[1], candidate[2])
            ).fetchone()[0]

            ##
            ## Inserta las tripletas que cumplen con la frecuencia mínima
            ##
            if count > minsupport:

                print(".", sep="", end="")

                cursor.execute(
                    insertTripletonQuery.format(
                        candidate[0], candidate[1], candidate[2], count
                    ),
                )
    print('*')


findTripletons()
.........*
[27]:
x = cursor.execute(
"""
    SELECT
        *
    FROM
        project_tag_triples
    ORDER BY
        1 ASC,
        2 ASC,
        3 ASC;
""")


for tag1, tag2, tag3, num_projs in x.fetchall():
    print("{:23s}  {:23s}  {:23s}  {:6d}".format(tag1, tag2, tag3, num_projs))
C                        GPL                      Linux                      3295
C                        GPL                      POSIX                      4360
C                        Linux                    POSIX                      4625
C++                      Linux                    POSIX                      2621
Dynamic Content          Internet                 Web                        3163
GPL                      Internet                 Web                        2874
GPL                      Linux                    POSIX                      7379
Internet                 OS Independent           Web                        2516
Linux                    POSIX                    Windows                    3312
[28]:
def calcSCAV(tagA, tagB, tagC, ruleSupport, file):
    ##
    ## Support
    ##
    ruleSupportPct = round((ruleSupport / baskets), 2)

    ##
    ## Confidence
    ##
    queryConf = """
        SELECT num_projs
        FROM project_tag_pairs
        WHERE
        (
            (tag1 = '{}' AND tag2 = '{}')
            OR  (tag2 = '{}' AND tag1 = '{}')
        )
    """

    pairSupport = cursor.execute(queryConf.format(tagA, tagB, tagA, tagB)).fetchone()[0]

    confidence = round((ruleSupport / pairSupport), 2)

    ##
    ## Added Value
    ##
    queryAV = """
        SELECT count(*)
        FROM project_tags
        WHERE tag_name= '{}'
    """

    supportTagC = cursor.execute(queryAV.format(tagC)).fetchone()[0]

    supportTagCPct = supportTagC / baskets

    addedValue = round((confidence - supportTagCPct), 2)

    print(
        "{}, {} -> {}  [S={}, C={}, AV={}]".format(
            tagA, tagB, tagC, ruleSupportPct, confidence, addedValue
        ),
        file=file,
    )


def generateRules():

    ##
    ## Consulta para obtiener las tripletas para obtener las reglas
    ##
    getFinalListQuery = """
        SELECT tag1, tag2, tag3, num_projs FROM project_tag_triples
    """

    ##
    ## Obtiene las tripletas
    ##
    triples = cursor.execute(getFinalListQuery).fetchall()

    with open("report.txt", "w") as file:

        for triple in triples:

            tag1 = triple[0]
            tag2 = triple[1]
            tag3 = triple[2]
            ruleSupport = triple[3]

            calcSCAV(tag1, tag2, tag3, ruleSupport, file)
            calcSCAV(tag1, tag3, tag2, ruleSupport, file)
            calcSCAV(tag2, tag3, tag1, ruleSupport, file)
            print("*", file=file)


generateRules()
[29]:
!head -n 48 report.txt
Dynamic Content, Internet -> Web  [S=0.07, C=1.0, AV=0.87]
Dynamic Content, Web -> Internet  [S=0.07, C=1.0, AV=0.83]
Internet, Web -> Dynamic Content  [S=0.07, C=0.53, AV=0.46]
*
Internet, OS Independent -> Web  [S=0.05, C=0.84, AV=0.71]
Internet, Web -> OS Independent  [S=0.05, C=0.42, AV=0.2]
OS Independent, Web -> Internet  [S=0.05, C=0.97, AV=0.8]
*
GPL, Internet -> Web  [S=0.06, C=0.71, AV=0.58]
GPL, Web -> Internet  [S=0.06, C=0.99, AV=0.82]
Internet, Web -> GPL  [S=0.06, C=0.48, AV=0.02]
*
C, Linux -> POSIX  [S=0.1, C=0.82, AV=0.46]
C, POSIX -> Linux  [S=0.1, C=0.67, AV=0.32]
Linux, POSIX -> C  [S=0.1, C=0.39, AV=0.17]
*
C, GPL -> POSIX  [S=0.09, C=0.79, AV=0.43]
C, POSIX -> GPL  [S=0.09, C=0.63, AV=0.17]
GPL, POSIX -> C  [S=0.09, C=0.43, AV=0.21]
*
C, GPL -> Linux  [S=0.07, C=0.59, AV=0.24]
C, Linux -> GPL  [S=0.07, C=0.58, AV=0.12]
GPL, Linux -> C  [S=0.07, C=0.41, AV=0.19]
*
Linux, POSIX -> Windows  [S=0.07, C=0.28, AV=0.12]
Linux, Windows -> POSIX  [S=0.07, C=0.63, AV=0.27]
POSIX, Windows -> Linux  [S=0.07, C=0.74, AV=0.39]
*
C++, Linux -> POSIX  [S=0.06, C=0.77, AV=0.41]
C++, POSIX -> Linux  [S=0.06, C=0.75, AV=0.4]
Linux, POSIX -> C++  [S=0.06, C=0.22, AV=0.09]
*
GPL, Linux -> POSIX  [S=0.16, C=0.92, AV=0.56]
GPL, POSIX -> Linux  [S=0.16, C=0.73, AV=0.38]
Linux, POSIX -> GPL  [S=0.16, C=0.62, AV=0.16]
*
[ ]: