{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Descubrimiento de reglas de asociación en tags de proyectos de software\n",
    "\n",
    "* *60 min* | Ultima modificación: Noviembre 26, 2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Este tutorial esta basado en *Mastering Data Mining with Python, Megan Squire, 2016. Packt Publishing*. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "El archivo `project_tags.csv` contiene los tags asociados a diferentes proyectos de software por los desarrolladdores. La primera columna corresponde al ID del proyecto; la segunda al tag asignado. Se desean construir reglas que permiten sugerir un tag a partir de dos tags previamente seleccionados por el usuario."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlite3\n",
    "\n",
    "conn = sqlite3.connect(\":memory:\")\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2020-11-29 04:47:25--  https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.48.133\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.48.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 5418355 (5.2M) [text/plain]\n",
      "Saving to: ‘project_tags.csv.1’\n",
      "\n",
      "project_tags.csv.1  100%[===================>]   5.17M  4.52MB/s    in 1.1s    \n",
      "\n",
      "2020-11-29 04:47:27 (4.52 MB/s) - ‘project_tags.csv.1’ saved [5418355/5418355]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://raw.githubusercontent.com/jdvelasq/datalabs/master/datasets/project_tags.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Carga de los datos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "conn.executescript(\"\"\"\n",
    "DROP TABLE IF EXISTS project_tags;\n",
    "\n",
    "CREATE TABLE project_tags \n",
    "(\n",
    "    project_id     INT NOT NULL DEFAULT '0',\n",
    "    tag_name    STRING NOT NULL DEFAULT '0',\n",
    "    PRIMARY KEY (project_id, tag_name)\n",
    ");\n",
    "\"\"\")\n",
    "\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('36762', 'Database Engines/Servers'),\n",
       " ('14882', 'Systems Administration'),\n",
       " ('53184', 'C'),\n",
       " ('41895', 'multimedia'),\n",
       " ('53266', 'Desktop Environment')]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('project_tags.csv', 'rt') as f:\n",
    "    data = f.readlines()\n",
    "\n",
    "## Elimina el '\\n' al final de la línea\n",
    "data = [line.replace('\\n', '') for line in data]\n",
    "\n",
    "## Separa los campos por comas\n",
    "data = [line.split(',') for line in data]\n",
    "\n",
    "## Convierte la fila en una tupla\n",
    "data = [tuple(line) for line in data]\n",
    "\n",
    "## Elimina valores duplicados\n",
    "data = list(set([tuple(line) for line in data]))\n",
    "\n",
    "\n",
    "## Imprime los primeros 5 registros para verificar\n",
    "data[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(36762, 'Database Engines/Servers'),\n",
       " (14882, 'Systems Administration'),\n",
       " (53184, 'C'),\n",
       " (41895, 'multimedia'),\n",
       " (53266, 'Desktop Environment')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Carga a partir de la lista de tuplas\n",
    "## contenidas en data\n",
    "##\n",
    "cursor.executemany('INSERT INTO project_tags VALUES (?,?)', data)\n",
    "\n",
    "##\n",
    "## Verificación\n",
    "##\n",
    "cursor.execute(\"SELECT * FROM project_tags LIMIT 5;\").fetchall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Información básica"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(353401,)]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Cantidad de registros\n",
    "## \n",
    "cursor.execute(\"SELECT COUNT(*) FROM project_tags;\").fetchone()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(46511,)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Cantidad de proyectos\n",
    "##\n",
    "cursor.execute(\"SELECT COUNT(DISTINCT project_id) FROM project_tags;\").fetchone()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "46511"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Cantidad de proyectos\n",
    "## Se toma como baskets la cantidad de proyectos en la tabla\n",
    "##\n",
    "baskets = cursor.execute(\"SELECT COUNT(DISTINCT project_id) FROM project_tags;\").fetchone()[0]\n",
    "baskets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Número de proyectos por tag y soporte"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPL                       21176    45.53%\n",
      "POSIX                     16868    36.27%\n",
      "Linux                     16284    35.01%\n",
      "C                         10288    22.12%\n",
      "OS Independent            10178    21.88%\n",
      "Software Development       9614    20.67%\n",
      "Internet                   8097    17.41%\n",
      "Windows                    7572    16.28%\n",
      "Java                       6390    13.74%\n",
      "Web                        6264    13.47%\n",
      "English                    5997    12.89%\n",
      "C++                        5891    12.67%\n",
      "Libraries                  5738    12.34%\n",
      "PHP                        5448    11.71%\n",
      "Unix                       5098    10.96%\n",
      "Mac OS X                   4823    10.37%\n",
      "multimedia                 4813    10.35%\n",
      "Communications             4449     9.57%\n",
      "Perl                       4242     9.12%\n",
      "Python                     4190     9.01%\n",
      "LGPL                       3524     7.58%\n",
      "Utilities                  3297     7.09%\n",
      "Dynamic Content            3199     6.88%\n",
      "GPLv3                      2875     6.18%\n",
      "Networking                 2819     6.06%\n",
      "Scientific/Engineering     2678     5.76%\n",
      "Games/Entertainment        2528     5.44%\n",
      "BSD                        2494     5.36%\n",
      "Desktop Environment        2335     5.02%\n",
      "Graphics                   2268     4.88%\n",
      "Database                   2200     4.73%\n",
      "GPLv2                      2147     4.62%\n",
      "Text Processing            2131     4.58%\n",
      "Sound/Audio                2094     4.50%\n",
      "Security                   1960     4.21%\n"
     ]
    }
   ],
   "source": [
    "##\n",
    "## Número de proyectos por tag\n",
    "##\n",
    "x = cursor.execute(\n",
    "\"\"\"\n",
    "    SELECT \n",
    "        tag_name, \n",
    "        COUNT(project_id), \n",
    "        ROUND(\n",
    "            COUNT(project_id) * 100.0 / (SELECT COUNT(DISTINCT project_id) FROM project_tags), \n",
    "            2)\n",
    "    FROM \n",
    "        project_tags\n",
    "    GROUP BY \n",
    "        1\n",
    "    ORDER BY \n",
    "        2 DESC\n",
    "    LIMIT \n",
    "        35;\n",
    "\"\"\")\n",
    "\n",
    "##\n",
    "## Un 5% equivale aprox a 2335 proyectos\n",
    "##\n",
    "for tag, value, pct in x.fetchall():\n",
    "    print(\"{:23s}  {:6d}   {:6.2f}%\".format(tag, value, pct))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Minimum support count: 2325.55 (5% of bastkets)\n"
     ]
    }
   ],
   "source": [
    "#\n",
    "# Soporte mínimo\n",
    "#\n",
    "MIN_SUPPORT_PCT = 5\n",
    "\n",
    "#\n",
    "# Descarta el porcentaje especificado (MIN_SUPPORT_PCT) de tags menos frecuentes.\n",
    "# Se require que el tag aparezca en 554 proyectos o mas (de 46511 proyectos existentes)\n",
    "#\n",
    "minsupport = baskets * (MIN_SUPPORT_PCT / 100)\n",
    "print(\n",
    "    \"Minimum support count: {} ({}% of bastkets)\".format(minsupport, MIN_SUPPORT_PCT),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Singletons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Apache 2.0', 'Application Frameworks', 'Archiving', 'Artistic', 'BSD']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Descarta los tags menos frecuentes. Singletons es una \n",
    "## lista de tuplas de la siguiente forma:\n",
    "##\n",
    "##    [('Apache 2.0',),\n",
    "##     ('Application Frameworks',),\n",
    "##     ('Archiving',),\n",
    "##     ...\n",
    "##    ]\n",
    "##\n",
    "singletons = cursor.execute(\n",
    "    \"\"\"\n",
    "    SELECT \n",
    "        DISTINCT tag_name\n",
    "    FROM \n",
    "        project_tags\n",
    "    GROUP BY \n",
    "        1 \n",
    "    HAVING \n",
    "        COUNT(project_id) >= {} \n",
    "    ORDER BY \n",
    "        tag_name\n",
    "    \"\"\".format(\n",
    "        minsupport\n",
    "    )\n",
    ").fetchall()\n",
    "\n",
    "##\n",
    "## Esta variable contiene todos los tags que aparecen\n",
    "## en, al menos, el 5% de los proyectos\n",
    "##\n",
    "allSingletonTags = [x[0] for x in singletons]\n",
    "    \n",
    "allSingletonTags[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Doubletons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "##\n",
    "## La siguiente tabla contiene la cantidad de proyectos\n",
    "## que tienen tag1 y tag2 simultáneamente\n",
    "##\n",
    "conn.executescript(\n",
    "    \"\"\"\n",
    "    DROP TABLE IF EXISTS project_tag_pairs;\n",
    "\n",
    "    CREATE TABLE project_tag_pairs \n",
    "    (\n",
    "        tag1      STRING,\n",
    "        tag2      STRING,\n",
    "        num_projs INT\n",
    "    );\n",
    "    \"\"\"\n",
    ")\n",
    "\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 1)\n",
      "(0, 2)\n",
      "(0, 3)\n",
      "(1, 2)\n",
      "(1, 3)\n",
      "(2, 3)\n"
     ]
    }
   ],
   "source": [
    "from itertools import combinations\n",
    "\n",
    "##\n",
    "## Uso de itertools.combinations\n",
    "##\n",
    "x = [0, 1, 2, 3]\n",
    "for w in list(combinations(x, 2)):\n",
    "    print(w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "...................................."
     ]
    }
   ],
   "source": [
    "##\n",
    "## Tags que aparecen unicamente en las\n",
    "## combinaciones admisibles de dos tags\n",
    "## diferentes\n",
    "##\n",
    "allDoubletonTags = set()\n",
    "\n",
    "##\n",
    "## Tuplas unicas formadas por (tag0, tag1)\n",
    "##\n",
    "doubletonSet = set()\n",
    "\n",
    "\n",
    "def findDoubletons():\n",
    "\n",
    "    ##\n",
    "    ## INNER JOIN retorna lo registros que aparecen\n",
    "    ## simultaneamente en las dos tablas (intersección)\n",
    "    ##\n",
    "    ## La siguiente consulta retorna cuantos proyectos usan\n",
    "    ## tag1 y tag2 simultaneamente.\n",
    "    ##\n",
    "    ## Si:\n",
    "    ##\n",
    "    ##    prj0, tag0\n",
    "    ##    prj0, tag1\n",
    "    ##    prj0, tag2\n",
    "    ##    prj1, tag0\n",
    "    ##    prj1, tag1\n",
    "    ##    prj1, tag3\n",
    "    ##    prj2, tag0\n",
    "    ##    prj2, tag3\n",
    "    ##    ...\n",
    "    ##\n",
    "    ## El inner join con tag0 y tag1 genera:\n",
    "    ##\n",
    "    ##    prj0, tag0, prj0, tag1\n",
    "    ##    prj1, tag0, prj1, tag1\n",
    "    ##    ...\n",
    "    ##\n",
    "    getDoubletonFrequencyQuery = \"\"\"\n",
    "        SELECT \n",
    "            count(t1.project_id) \n",
    "        FROM \n",
    "            project_tags t1\n",
    "        INNER JOIN \n",
    "            project_tags t2\n",
    "        ON \n",
    "            t1.project_id = t2.project_id\n",
    "        WHERE \n",
    "        (\n",
    "            t1.tag_name = '{}'\n",
    "            AND t2.tag_name = '{}'\n",
    "        )\n",
    "    \"\"\"\n",
    "\n",
    "    insertPairQuery = \"\"\"\n",
    "        INSERT INTO \n",
    "            project_tag_pairs (tag1, tag2, num_projs)\n",
    "        VALUES \n",
    "            ('{}','{}',{})\n",
    "    \"\"\"\n",
    "\n",
    "    ##\n",
    "    ## Genera todas las combinaciones de dos tags usando\n",
    "    ## los tags individuales que cumplen con una ocurrencia\n",
    "    ## minima\n",
    "    ##\n",
    "    doubletonCandidates = list(combinations(allSingletonTags, 2))\n",
    "\n",
    "    for (index, candidate) in enumerate(doubletonCandidates):\n",
    "\n",
    "        tag1 = candidate[0]\n",
    "        tag2 = candidate[1]\n",
    "\n",
    "        ##\n",
    "        ## Cuenta la cantidad de proyectos que usan tag1 y tag2 simultaneamente\n",
    "        ##\n",
    "        count = cursor.execute(\n",
    "            getDoubletonFrequencyQuery.format(tag1, tag2)\n",
    "        ).fetchone()[0]\n",
    "\n",
    "        \n",
    "        if count > minsupport:\n",
    "            \n",
    "            ## Don't panic!: reporta que se esta ejecutando.\n",
    "            print(\".\", sep=\"\", end=\"\")\n",
    "\n",
    "            \n",
    "            cursor.execute(insertPairQuery.format(tag1, tag2, count))\n",
    "\n",
    "            ##\n",
    "            ## Inserta la tupla (tag1, tag2) en la tabla \n",
    "            ##\n",
    "            doubletonSet.add(candidate)\n",
    "\n",
    "            ##\n",
    "            ## Agrega los tags a la lista de tags usados \n",
    "            ## \n",
    "            allDoubletonTags.add(tag1)\n",
    "            allDoubletonTags.add(tag2)\n",
    "\n",
    "\n",
    "findDoubletons()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C                        GPL                         5539\n",
      "C                        Linux                       5648\n",
      "C                        POSIX                       6952\n",
      "C++                      GPL                         2911\n",
      "C++                      Linux                       3425\n",
      "C++                      POSIX                       3501\n",
      "Communications           GPL                         2578\n",
      "Dynamic Content          Internet                    3171\n",
      "Dynamic Content          Web                         3170\n",
      "English                  Linux                       2660\n",
      "GPL                      Internet                    4035\n",
      "GPL                      Linux                       8036\n",
      "GPL                      OS Independent              4403\n",
      "GPL                      PHP                         2372\n",
      "GPL                      POSIX                      10062\n",
      "GPL                      Software Development        3318\n",
      "GPL                      Web                         2899\n",
      "GPL                      Windows                     2603\n",
      "GPL                      multimedia                  2879\n",
      "Internet                 OS Independent              3005\n",
      "Internet                 POSIX                       2831\n",
      "Internet                 Web                         5973\n",
      "Java                     OS Independent              3433\n",
      "Java                     Software Development        2356\n",
      "Libraries                Software Development        5633\n",
      "Linux                    Mac OS X                    2973\n",
      "Linux                    POSIX                      11896\n",
      "Linux                    Software Development        2335\n",
      "Linux                    Unix                        2493\n",
      "Linux                    Windows                     5279\n",
      "Mac OS X                 Windows                     3131\n",
      "OS Independent           Software Development        3564\n",
      "OS Independent           Web                         2602\n",
      "POSIX                    Software Development        3501\n",
      "POSIX                    Windows                     4464\n",
      "POSIX                    multimedia                  2538\n"
     ]
    }
   ],
   "source": [
    "x = cursor.execute(\n",
    "\"\"\"\n",
    "    SELECT \n",
    "        *\n",
    "    FROM \n",
    "        project_tag_pairs\n",
    "    ORDER BY \n",
    "        1 ASC,\n",
    "        2 ASC;\n",
    "\"\"\")\n",
    "\n",
    "##\n",
    "## Un 5% equivale aprox a 2335 proyectos\n",
    "##\n",
    "for tag1, tag2, num_projs in x.fetchall():\n",
    "    print(\"{:23s}  {:23s}   {:6d}\".format(tag1, tag2, num_projs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tripletons"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "##\n",
    "## La siguiente tabla contiene la cantidad de proyectos\n",
    "## que tienen tag1, tag2 y tag3 simultáneamente\n",
    "##\n",
    "conn.executescript(\"\"\"\n",
    "DROP TABLE IF EXISTS project_tag_triples;\n",
    "\n",
    "CREATE TABLE project_tag_triples \n",
    "(\n",
    "    tag1      STRING,\n",
    "    tag2      STRING,\n",
    "    tag3      STRING,\n",
    "    num_projs INT\n",
    ");\n",
    "\"\"\")\n",
    "\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".........*\n"
     ]
    }
   ],
   "source": [
    "def findTripletons():\n",
    "\n",
    "    ##\n",
    "    ## Sigue una lógica similar a la usada anteriormente\n",
    "    ##\n",
    "    getTripletonFrequencyQuery = \"\"\"\n",
    "        SELECT \n",
    "            count(t1.project_id)\n",
    "        FROM \n",
    "            project_tags t1\n",
    "        INNER JOIN \n",
    "                project_tags t2\n",
    "            ON \n",
    "                t1.project_id = t2.project_id\n",
    "        INNER JOIN \n",
    "                project_tags t3\n",
    "            ON \n",
    "                t2.project_id = t3.project_id\n",
    "        WHERE\n",
    "        (\n",
    "            t1.tag_name = '{}'\n",
    "            AND t2.tag_name = '{}'\n",
    "            AND t3.tag_name = '{}'\n",
    "        )\n",
    "    \"\"\"\n",
    "\n",
    "    insertTripletonQuery = \"\"\"\n",
    "        INSERT INTO project_tag_triples(tag1, tag2, tag3, num_projs)\n",
    "        VALUES ('{}','{}','{}',{})\n",
    "    \"\"\"\n",
    "\n",
    "    ##\n",
    "    ##  Crea tripletas ordenadas con los tags que aparecen en dos proyectos y\n",
    "    ##  cumplen con el soporte minimo\n",
    "    ##\n",
    "    tripletonCandidates = [\n",
    "        sorted(tc) for tc in list(combinations(allDoubletonTags, 3))\n",
    "    ]\n",
    "    \n",
    "    for index, candidate in enumerate(tripletonCandidates):\n",
    "\n",
    "        ##\n",
    "        ## La tripleta contiene, al menos, una tupla que esta en la\n",
    "        ## la lista de doubleTons\n",
    "        ##\n",
    "        if any(\n",
    "            [\n",
    "                tuple_ in doubletonSet\n",
    "                for tuple_ in list(combinations(candidate, 2))\n",
    "            ]\n",
    "        ):\n",
    "\n",
    "            ##\n",
    "            ## Computa la frecuencia de la tripleta\n",
    "            ##\n",
    "            count = cursor.execute(\n",
    "                getTripletonFrequencyQuery.format(candidate[0], candidate[1], candidate[2])\n",
    "            ).fetchone()[0]\n",
    "\n",
    "            ##\n",
    "            ## Inserta las tripletas que cumplen con la frecuencia mínima\n",
    "            ## \n",
    "            if count > minsupport:\n",
    "\n",
    "                print(\".\", sep=\"\", end=\"\")\n",
    "\n",
    "                cursor.execute(\n",
    "                    insertTripletonQuery.format(\n",
    "                        candidate[0], candidate[1], candidate[2], count\n",
    "                    ),\n",
    "                )\n",
    "    print('*')\n",
    "\n",
    "\n",
    "findTripletons()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "C                        GPL                      Linux                      3295\n",
      "C                        GPL                      POSIX                      4360\n",
      "C                        Linux                    POSIX                      4625\n",
      "C++                      Linux                    POSIX                      2621\n",
      "Dynamic Content          Internet                 Web                        3163\n",
      "GPL                      Internet                 Web                        2874\n",
      "GPL                      Linux                    POSIX                      7379\n",
      "Internet                 OS Independent           Web                        2516\n",
      "Linux                    POSIX                    Windows                    3312\n"
     ]
    }
   ],
   "source": [
    "x = cursor.execute(\n",
    "\"\"\"\n",
    "    SELECT \n",
    "        *\n",
    "    FROM \n",
    "        project_tag_triples\n",
    "    ORDER BY \n",
    "        1 ASC,\n",
    "        2 ASC,\n",
    "        3 ASC;\n",
    "\"\"\")\n",
    "\n",
    "\n",
    "for tag1, tag2, tag3, num_projs in x.fetchall():\n",
    "    print(\"{:23s}  {:23s}  {:23s}  {:6d}\".format(tag1, tag2, tag3, num_projs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calcSCAV(tagA, tagB, tagC, ruleSupport, file):\n",
    "    ##\n",
    "    ## Support\n",
    "    ##\n",
    "    ruleSupportPct = round((ruleSupport / baskets), 2)\n",
    "\n",
    "    ##\n",
    "    ## Confidence\n",
    "    ##\n",
    "    queryConf = \"\"\"\n",
    "        SELECT num_projs\n",
    "        FROM project_tag_pairs \n",
    "        WHERE \n",
    "        (\n",
    "            (tag1 = '{}' AND tag2 = '{}')  \n",
    "            OR  (tag2 = '{}' AND tag1 = '{}')\n",
    "        )\n",
    "    \"\"\"\n",
    "\n",
    "    pairSupport = cursor.execute(queryConf.format(tagA, tagB, tagA, tagB)).fetchone()[0]\n",
    "\n",
    "    confidence = round((ruleSupport / pairSupport), 2)\n",
    "\n",
    "    ## \n",
    "    ## Added Value\n",
    "    ##\n",
    "    queryAV = \"\"\"\n",
    "        SELECT count(*) \n",
    "        FROM project_tags \n",
    "        WHERE tag_name= '{}'\n",
    "    \"\"\"\n",
    "    \n",
    "    supportTagC = cursor.execute(queryAV.format(tagC)).fetchone()[0]\n",
    "    \n",
    "    supportTagCPct = supportTagC / baskets\n",
    "    \n",
    "    addedValue = round((confidence - supportTagCPct), 2)\n",
    "\n",
    "    print(\n",
    "        \"{}, {} -> {}  [S={}, C={}, AV={}]\".format(\n",
    "            tagA, tagB, tagC, ruleSupportPct, confidence, addedValue\n",
    "        ),\n",
    "        file=file,\n",
    "    )\n",
    "\n",
    "\n",
    "def generateRules():\n",
    "\n",
    "    ##\n",
    "    ## Consulta para obtiener las tripletas para obtener las reglas\n",
    "    ##\n",
    "    getFinalListQuery = \"\"\"\n",
    "        SELECT tag1, tag2, tag3, num_projs FROM project_tag_triples\n",
    "    \"\"\"\n",
    "\n",
    "    ##\n",
    "    ## Obtiene las tripletas\n",
    "    ##\n",
    "    triples = cursor.execute(getFinalListQuery).fetchall()\n",
    "\n",
    "    with open(\"report.txt\", \"w\") as file:\n",
    "\n",
    "        for triple in triples:\n",
    "\n",
    "            tag1 = triple[0]\n",
    "            tag2 = triple[1]\n",
    "            tag3 = triple[2]\n",
    "            ruleSupport = triple[3]\n",
    "\n",
    "            calcSCAV(tag1, tag2, tag3, ruleSupport, file)\n",
    "            calcSCAV(tag1, tag3, tag2, ruleSupport, file)\n",
    "            calcSCAV(tag2, tag3, tag1, ruleSupport, file)\n",
    "            print(\"*\", file=file)\n",
    "\n",
    "\n",
    "generateRules()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dynamic Content, Internet -> Web  [S=0.07, C=1.0, AV=0.87]\n",
      "Dynamic Content, Web -> Internet  [S=0.07, C=1.0, AV=0.83]\n",
      "Internet, Web -> Dynamic Content  [S=0.07, C=0.53, AV=0.46]\n",
      "*\n",
      "Internet, OS Independent -> Web  [S=0.05, C=0.84, AV=0.71]\n",
      "Internet, Web -> OS Independent  [S=0.05, C=0.42, AV=0.2]\n",
      "OS Independent, Web -> Internet  [S=0.05, C=0.97, AV=0.8]\n",
      "*\n",
      "GPL, Internet -> Web  [S=0.06, C=0.71, AV=0.58]\n",
      "GPL, Web -> Internet  [S=0.06, C=0.99, AV=0.82]\n",
      "Internet, Web -> GPL  [S=0.06, C=0.48, AV=0.02]\n",
      "*\n",
      "C, Linux -> POSIX  [S=0.1, C=0.82, AV=0.46]\n",
      "C, POSIX -> Linux  [S=0.1, C=0.67, AV=0.32]\n",
      "Linux, POSIX -> C  [S=0.1, C=0.39, AV=0.17]\n",
      "*\n",
      "C, GPL -> POSIX  [S=0.09, C=0.79, AV=0.43]\n",
      "C, POSIX -> GPL  [S=0.09, C=0.63, AV=0.17]\n",
      "GPL, POSIX -> C  [S=0.09, C=0.43, AV=0.21]\n",
      "*\n",
      "C, GPL -> Linux  [S=0.07, C=0.59, AV=0.24]\n",
      "C, Linux -> GPL  [S=0.07, C=0.58, AV=0.12]\n",
      "GPL, Linux -> C  [S=0.07, C=0.41, AV=0.19]\n",
      "*\n",
      "Linux, POSIX -> Windows  [S=0.07, C=0.28, AV=0.12]\n",
      "Linux, Windows -> POSIX  [S=0.07, C=0.63, AV=0.27]\n",
      "POSIX, Windows -> Linux  [S=0.07, C=0.74, AV=0.39]\n",
      "*\n",
      "C++, Linux -> POSIX  [S=0.06, C=0.77, AV=0.41]\n",
      "C++, POSIX -> Linux  [S=0.06, C=0.75, AV=0.4]\n",
      "Linux, POSIX -> C++  [S=0.06, C=0.22, AV=0.09]\n",
      "*\n",
      "GPL, Linux -> POSIX  [S=0.16, C=0.92, AV=0.56]\n",
      "GPL, POSIX -> Linux  [S=0.16, C=0.73, AV=0.38]\n",
      "Linux, POSIX -> GPL  [S=0.16, C=0.62, AV=0.16]\n",
      "*\n"
     ]
    }
   ],
   "source": [
    "!head -n 48 report.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}