Descubrimiento de reglas de asociación en tags de proyectos de software¶
60 min | Ultima modificación: Noviembre 26, 2020
Este tutorial esta basado en Mastering Data Mining with Python, Megan Squire, 2016. Packt Publishing.
El archivo project_tags.csv
contiene los tags asociados a diferentes proyectos de software por los desarrolladdores. La primera columna corresponde al ID del proyecto; la segunda al tag asignado. Se desean construir reglas que permiten sugerir un tag a partir de dos tags previamente seleccionados por el usuario.
[1]:
import sqlite3
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
[31]:
!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv
--2020-11-29 04:47:25-- https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5418355 (5.2M) [text/plain]
Saving to: ‘project_tags.csv.1’
project_tags.csv.1 100%[===================>] 5.17M 4.52MB/s in 1.1s
2020-11-29 04:47:27 (4.52 MB/s) - ‘project_tags.csv.1’ saved [5418355/5418355]
Carga de los datos¶
[2]:
conn.executescript("""
DROP TABLE IF EXISTS project_tags;
CREATE TABLE project_tags
(
project_id INT NOT NULL DEFAULT '0',
tag_name STRING NOT NULL DEFAULT '0',
PRIMARY KEY (project_id, tag_name)
);
""")
conn.commit()
[3]:
with open('project_tags.csv', 'rt') as f:
data = f.readlines()
## Elimina el '\n' al final de la línea
data = [line.replace('\n', '') for line in data]
## Separa los campos por comas
data = [line.split(',') for line in data]
## Convierte la fila en una tupla
data = [tuple(line) for line in data]
## Elimina valores duplicados
data = list(set([tuple(line) for line in data]))
## Imprime los primeros 5 registros para verificar
data[0:5]
[3]:
[('36762', 'Database Engines/Servers'),
('14882', 'Systems Administration'),
('53184', 'C'),
('41895', 'multimedia'),
('53266', 'Desktop Environment')]
[4]:
##
## Carga a partir de la lista de tuplas
## contenidas en data
##
cursor.executemany('INSERT INTO project_tags VALUES (?,?)', data)
##
## Verificación
##
cursor.execute("SELECT * FROM project_tags LIMIT 5;").fetchall()
[4]:
[(36762, 'Database Engines/Servers'),
(14882, 'Systems Administration'),
(53184, 'C'),
(41895, 'multimedia'),
(53266, 'Desktop Environment')]
Información básica¶
[5]:
##
## Cantidad de registros
##
cursor.execute("SELECT COUNT(*) FROM project_tags;").fetchone()[0]
[5]:
[(353401,)]
[6]:
##
## Cantidad de proyectos
##
cursor.execute("SELECT COUNT(DISTINCT project_id) FROM project_tags;").fetchone()[0]
[6]:
(46511,)
[17]:
##
## Cantidad de proyectos
## Se toma como baskets la cantidad de proyectos en la tabla
##
baskets = cursor.execute("SELECT COUNT(DISTINCT project_id) FROM project_tags;").fetchone()[0]
baskets
[17]:
46511
Número de proyectos por tag y soporte¶
[9]:
##
## Número de proyectos por tag
##
x = cursor.execute(
"""
SELECT
tag_name,
COUNT(project_id),
ROUND(
COUNT(project_id) * 100.0 / (SELECT COUNT(DISTINCT project_id) FROM project_tags),
2)
FROM
project_tags
GROUP BY
1
ORDER BY
2 DESC
LIMIT
35;
""")
##
## Un 5% equivale aprox a 2335 proyectos
##
for tag, value, pct in x.fetchall():
print("{:23s} {:6d} {:6.2f}%".format(tag, value, pct))
GPL 21176 45.53%
POSIX 16868 36.27%
Linux 16284 35.01%
C 10288 22.12%
OS Independent 10178 21.88%
Software Development 9614 20.67%
Internet 8097 17.41%
Windows 7572 16.28%
Java 6390 13.74%
Web 6264 13.47%
English 5997 12.89%
C++ 5891 12.67%
Libraries 5738 12.34%
PHP 5448 11.71%
Unix 5098 10.96%
Mac OS X 4823 10.37%
multimedia 4813 10.35%
Communications 4449 9.57%
Perl 4242 9.12%
Python 4190 9.01%
LGPL 3524 7.58%
Utilities 3297 7.09%
Dynamic Content 3199 6.88%
GPLv3 2875 6.18%
Networking 2819 6.06%
Scientific/Engineering 2678 5.76%
Games/Entertainment 2528 5.44%
BSD 2494 5.36%
Desktop Environment 2335 5.02%
Graphics 2268 4.88%
Database 2200 4.73%
GPLv2 2147 4.62%
Text Processing 2131 4.58%
Sound/Audio 2094 4.50%
Security 1960 4.21%
[18]:
#
# Soporte mínimo
#
MIN_SUPPORT_PCT = 5
#
# Descarta el porcentaje especificado (MIN_SUPPORT_PCT) de tags menos frecuentes.
# Se require que el tag aparezca en 554 proyectos o mas (de 46511 proyectos existentes)
#
minsupport = baskets * (MIN_SUPPORT_PCT / 100)
print(
"Minimum support count: {} ({}% of bastkets)".format(minsupport, MIN_SUPPORT_PCT),
)
Minimum support count: 2325.55 (5% of bastkets)
Singletons¶
[11]:
##
## Descarta los tags menos frecuentes. Singletons es una
## lista de tuplas de la siguiente forma:
##
## [('Apache 2.0',),
## ('Application Frameworks',),
## ('Archiving',),
## ...
## ]
##
singletons = cursor.execute(
"""
SELECT
DISTINCT tag_name
FROM
project_tags
GROUP BY
1
HAVING
COUNT(project_id) >= {}
ORDER BY
tag_name
""".format(
minsupport
)
).fetchall()
##
## Esta variable contiene todos los tags que aparecen
## en, al menos, el 5% de los proyectos
##
allSingletonTags = [x[0] for x in singletons]
allSingletonTags[:5]
[11]:
['Apache 2.0', 'Application Frameworks', 'Archiving', 'Artistic', 'BSD']
Doubletons¶
[19]:
##
## La siguiente tabla contiene la cantidad de proyectos
## que tienen tag1 y tag2 simultáneamente
##
conn.executescript(
"""
DROP TABLE IF EXISTS project_tag_pairs;
CREATE TABLE project_tag_pairs
(
tag1 STRING,
tag2 STRING,
num_projs INT
);
"""
)
conn.commit()
[20]:
from itertools import combinations
##
## Uso de itertools.combinations
##
x = [0, 1, 2, 3]
for w in list(combinations(x, 2)):
print(w)
(0, 1)
(0, 2)
(0, 3)
(1, 2)
(1, 3)
(2, 3)
[21]:
##
## Tags que aparecen unicamente en las
## combinaciones admisibles de dos tags
## diferentes
##
allDoubletonTags = set()
##
## Tuplas unicas formadas por (tag0, tag1)
##
doubletonSet = set()
def findDoubletons():
##
## INNER JOIN retorna lo registros que aparecen
## simultaneamente en las dos tablas (intersección)
##
## La siguiente consulta retorna cuantos proyectos usan
## tag1 y tag2 simultaneamente.
##
## Si:
##
## prj0, tag0
## prj0, tag1
## prj0, tag2
## prj1, tag0
## prj1, tag1
## prj1, tag3
## prj2, tag0
## prj2, tag3
## ...
##
## El inner join con tag0 y tag1 genera:
##
## prj0, tag0, prj0, tag1
## prj1, tag0, prj1, tag1
## ...
##
getDoubletonFrequencyQuery = """
SELECT
count(t1.project_id)
FROM
project_tags t1
INNER JOIN
project_tags t2
ON
t1.project_id = t2.project_id
WHERE
(
t1.tag_name = '{}'
AND t2.tag_name = '{}'
)
"""
insertPairQuery = """
INSERT INTO
project_tag_pairs (tag1, tag2, num_projs)
VALUES
('{}','{}',{})
"""
##
## Genera todas las combinaciones de dos tags usando
## los tags individuales que cumplen con una ocurrencia
## minima
##
doubletonCandidates = list(combinations(allSingletonTags, 2))
for (index, candidate) in enumerate(doubletonCandidates):
tag1 = candidate[0]
tag2 = candidate[1]
##
## Cuenta la cantidad de proyectos que usan tag1 y tag2 simultaneamente
##
count = cursor.execute(
getDoubletonFrequencyQuery.format(tag1, tag2)
).fetchone()[0]
if count > minsupport:
## Don't panic!: reporta que se esta ejecutando.
print(".", sep="", end="")
cursor.execute(insertPairQuery.format(tag1, tag2, count))
##
## Inserta la tupla (tag1, tag2) en la tabla
##
doubletonSet.add(candidate)
##
## Agrega los tags a la lista de tags usados
##
allDoubletonTags.add(tag1)
allDoubletonTags.add(tag2)
findDoubletons()
....................................
[23]:
x = cursor.execute(
"""
SELECT
*
FROM
project_tag_pairs
ORDER BY
1 ASC,
2 ASC;
""")
##
## Un 5% equivale aprox a 2335 proyectos
##
for tag1, tag2, num_projs in x.fetchall():
print("{:23s} {:23s} {:6d}".format(tag1, tag2, num_projs))
C GPL 5539
C Linux 5648
C POSIX 6952
C++ GPL 2911
C++ Linux 3425
C++ POSIX 3501
Communications GPL 2578
Dynamic Content Internet 3171
Dynamic Content Web 3170
English Linux 2660
GPL Internet 4035
GPL Linux 8036
GPL OS Independent 4403
GPL PHP 2372
GPL POSIX 10062
GPL Software Development 3318
GPL Web 2899
GPL Windows 2603
GPL multimedia 2879
Internet OS Independent 3005
Internet POSIX 2831
Internet Web 5973
Java OS Independent 3433
Java Software Development 2356
Libraries Software Development 5633
Linux Mac OS X 2973
Linux POSIX 11896
Linux Software Development 2335
Linux Unix 2493
Linux Windows 5279
Mac OS X Windows 3131
OS Independent Software Development 3564
OS Independent Web 2602
POSIX Software Development 3501
POSIX Windows 4464
POSIX multimedia 2538
Tripletons¶
[24]:
##
## La siguiente tabla contiene la cantidad de proyectos
## que tienen tag1, tag2 y tag3 simultáneamente
##
conn.executescript("""
DROP TABLE IF EXISTS project_tag_triples;
CREATE TABLE project_tag_triples
(
tag1 STRING,
tag2 STRING,
tag3 STRING,
num_projs INT
);
""")
conn.commit()
[26]:
def findTripletons():
##
## Sigue una lógica similar a la usada anteriormente
##
getTripletonFrequencyQuery = """
SELECT
count(t1.project_id)
FROM
project_tags t1
INNER JOIN
project_tags t2
ON
t1.project_id = t2.project_id
INNER JOIN
project_tags t3
ON
t2.project_id = t3.project_id
WHERE
(
t1.tag_name = '{}'
AND t2.tag_name = '{}'
AND t3.tag_name = '{}'
)
"""
insertTripletonQuery = """
INSERT INTO project_tag_triples(tag1, tag2, tag3, num_projs)
VALUES ('{}','{}','{}',{})
"""
##
## Crea tripletas ordenadas con los tags que aparecen en dos proyectos y
## cumplen con el soporte minimo
##
tripletonCandidates = [
sorted(tc) for tc in list(combinations(allDoubletonTags, 3))
]
for index, candidate in enumerate(tripletonCandidates):
##
## La tripleta contiene, al menos, una tupla que esta en la
## la lista de doubleTons
##
if any(
[
tuple_ in doubletonSet
for tuple_ in list(combinations(candidate, 2))
]
):
##
## Computa la frecuencia de la tripleta
##
count = cursor.execute(
getTripletonFrequencyQuery.format(candidate[0], candidate[1], candidate[2])
).fetchone()[0]
##
## Inserta las tripletas que cumplen con la frecuencia mínima
##
if count > minsupport:
print(".", sep="", end="")
cursor.execute(
insertTripletonQuery.format(
candidate[0], candidate[1], candidate[2], count
),
)
print('*')
findTripletons()
.........*
[27]:
x = cursor.execute(
"""
SELECT
*
FROM
project_tag_triples
ORDER BY
1 ASC,
2 ASC,
3 ASC;
""")
for tag1, tag2, tag3, num_projs in x.fetchall():
print("{:23s} {:23s} {:23s} {:6d}".format(tag1, tag2, tag3, num_projs))
C GPL Linux 3295
C GPL POSIX 4360
C Linux POSIX 4625
C++ Linux POSIX 2621
Dynamic Content Internet Web 3163
GPL Internet Web 2874
GPL Linux POSIX 7379
Internet OS Independent Web 2516
Linux POSIX Windows 3312
[28]:
def calcSCAV(tagA, tagB, tagC, ruleSupport, file):
##
## Support
##
ruleSupportPct = round((ruleSupport / baskets), 2)
##
## Confidence
##
queryConf = """
SELECT num_projs
FROM project_tag_pairs
WHERE
(
(tag1 = '{}' AND tag2 = '{}')
OR (tag2 = '{}' AND tag1 = '{}')
)
"""
pairSupport = cursor.execute(queryConf.format(tagA, tagB, tagA, tagB)).fetchone()[0]
confidence = round((ruleSupport / pairSupport), 2)
##
## Added Value
##
queryAV = """
SELECT count(*)
FROM project_tags
WHERE tag_name= '{}'
"""
supportTagC = cursor.execute(queryAV.format(tagC)).fetchone()[0]
supportTagCPct = supportTagC / baskets
addedValue = round((confidence - supportTagCPct), 2)
print(
"{}, {} -> {} [S={}, C={}, AV={}]".format(
tagA, tagB, tagC, ruleSupportPct, confidence, addedValue
),
file=file,
)
def generateRules():
##
## Consulta para obtiener las tripletas para obtener las reglas
##
getFinalListQuery = """
SELECT tag1, tag2, tag3, num_projs FROM project_tag_triples
"""
##
## Obtiene las tripletas
##
triples = cursor.execute(getFinalListQuery).fetchall()
with open("report.txt", "w") as file:
for triple in triples:
tag1 = triple[0]
tag2 = triple[1]
tag3 = triple[2]
ruleSupport = triple[3]
calcSCAV(tag1, tag2, tag3, ruleSupport, file)
calcSCAV(tag1, tag3, tag2, ruleSupport, file)
calcSCAV(tag2, tag3, tag1, ruleSupport, file)
print("*", file=file)
generateRules()
[29]:
!head -n 48 report.txt
Dynamic Content, Internet -> Web [S=0.07, C=1.0, AV=0.87]
Dynamic Content, Web -> Internet [S=0.07, C=1.0, AV=0.83]
Internet, Web -> Dynamic Content [S=0.07, C=0.53, AV=0.46]
*
Internet, OS Independent -> Web [S=0.05, C=0.84, AV=0.71]
Internet, Web -> OS Independent [S=0.05, C=0.42, AV=0.2]
OS Independent, Web -> Internet [S=0.05, C=0.97, AV=0.8]
*
GPL, Internet -> Web [S=0.06, C=0.71, AV=0.58]
GPL, Web -> Internet [S=0.06, C=0.99, AV=0.82]
Internet, Web -> GPL [S=0.06, C=0.48, AV=0.02]
*
C, Linux -> POSIX [S=0.1, C=0.82, AV=0.46]
C, POSIX -> Linux [S=0.1, C=0.67, AV=0.32]
Linux, POSIX -> C [S=0.1, C=0.39, AV=0.17]
*
C, GPL -> POSIX [S=0.09, C=0.79, AV=0.43]
C, POSIX -> GPL [S=0.09, C=0.63, AV=0.17]
GPL, POSIX -> C [S=0.09, C=0.43, AV=0.21]
*
C, GPL -> Linux [S=0.07, C=0.59, AV=0.24]
C, Linux -> GPL [S=0.07, C=0.58, AV=0.12]
GPL, Linux -> C [S=0.07, C=0.41, AV=0.19]
*
Linux, POSIX -> Windows [S=0.07, C=0.28, AV=0.12]
Linux, Windows -> POSIX [S=0.07, C=0.63, AV=0.27]
POSIX, Windows -> Linux [S=0.07, C=0.74, AV=0.39]
*
C++, Linux -> POSIX [S=0.06, C=0.77, AV=0.41]
C++, POSIX -> Linux [S=0.06, C=0.75, AV=0.4]
Linux, POSIX -> C++ [S=0.06, C=0.22, AV=0.09]
*
GPL, Linux -> POSIX [S=0.16, C=0.92, AV=0.56]
GPL, POSIX -> Linux [S=0.16, C=0.73, AV=0.38]
Linux, POSIX -> GPL [S=0.16, C=0.62, AV=0.16]
*
[ ]: