{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Expresiones regulares en Python\n",
    "\n",
    "* *90 min* | Última modificación: Noviembre 29, 2020"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparación de datos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Se crea el directorio de entrada\n",
    "!rm -rf input output\n",
    "!mkdir input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing input/text0.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile input/text0.txt\n",
    "Analytics is the discovery, interpretation, and communication of meaningful patterns\n",
    "in data. Especially valuable in areas rich with recorded information, analytics relies\n",
    "on the simultaneous application of statistics, computer programming and operations research\n",
    "to quantify performance.\n",
    "\n",
    "Organizations may apply analytics to business data to describe, predict, and improve business\n",
    "performance. Specifically, areas within analytics include predictive analytics, prescriptive\n",
    "analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big\n",
    "Data Analytics, retail analytics, store assortment and stock-keeping unit optimization,\n",
    "marketing optimization and marketing mix modeling, web analytics, call analytics, speech\n",
    "analytics, sales force sizing and optimization, price and promotion modeling, predictive\n",
    "science, credit risk analysis, and fraud analytics. Since analytics can require extensive\n",
    "computation (see big data), the algorithms and software used for analytics harness the most\n",
    "current methods in computer science, statistics, and mathematics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing input/text1.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile input/text1.txt\n",
    "The field of data analysis. Analytics often involves studying past historical data to\n",
    "research potential trends, to analyze the effects of certain decisions or events, or to\n",
    "evaluate the performance of a given tool or scenario. The goal of analytics is to improve\n",
    "the business by gaining knowledge which can be used to make improvements or changes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing input/text2.txt\n"
     ]
    }
   ],
   "source": [
    "%%writefile input/text2.txt\n",
    "Data analytics (DA) is the process of examining data sets in order to draw conclusions\n",
    "about the information they contain, increasingly with the aid of specialized systems\n",
    "and software. Data analytics technologies and techniques are widely used in commercial\n",
    "industries to enable organizations to make more-informed business decisions and by\n",
    "scientists and researchers to verify or disprove scientific models, theories and\n",
    "hypotheses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "\n",
    "raw_text = []\n",
    "for filename in glob.glob(\"input/*.txt\"):\n",
    "    with open(filename, \"r\") as f:\n",
    "        raw_text += f.readlines()\n",
    "        \n",
    "raw_text = ' '.join(raw_text)\n",
    "raw_text = raw_text.replace('\\n', '')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Analytics is the discovery, interpretation, and communication of meaningful patterns in data.',\n",
       " 'Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance.',\n",
       " 'Organizations may apply analytics to business data to describe, predict, and improve business performance.',\n",
       " 'Specifically, areas within analytics include predictive analytics, prescriptive analytics, enterprise decision management, descriptive analytics, cognitive analytics, Big Data Analytics, retail analytics, store assortment and stock-keeping unit optimization, marketing optimization and marketing mix modeling, web analytics, call analytics, speech analytics, sales force sizing and optimization, price and promotion modeling, predictive science, credit risk analysis, and fraud analytics.',\n",
       " 'Since analytics can require extensive computation (see big data), the algorithms and software used for analytics harness the most current methods in computer science, statistics, and mathematics.',\n",
       " 'The field of data analysis.',\n",
       " 'Analytics often involves studying past historical data to research potential trends, to analyze the effects of certain decisions or events, or to evaluate the performance of a given tool or scenario.',\n",
       " 'The goal of analytics is to improve the business by gaining knowledge which can be used to make improvements or changes.',\n",
       " 'Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information they contain, increasingly with the aid of specialized systems and software.',\n",
       " 'Data analytics technologies and techniques are widely used in commercial industries to enable organizations to make more-informed business decisions and by scientists and researchers to verify or disprove scientific models, theories and hypotheses.',\n",
       " '.']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "phrases = raw_text.split('.')\n",
    "phrases = [t.strip() + '.' for t in phrases]\n",
    "phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Analytics',\n",
       " 'is',\n",
       " 'the',\n",
       " 'discovery,',\n",
       " 'interpretation,',\n",
       " 'and',\n",
       " 'communication',\n",
       " 'of',\n",
       " 'meaningful',\n",
       " 'patterns',\n",
       " 'in',\n",
       " 'data.',\n",
       " 'Especially',\n",
       " 'valuable',\n",
       " 'in',\n",
       " 'areas',\n",
       " 'rich',\n",
       " 'with',\n",
       " 'recorded',\n",
       " 'information,']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = raw_text.split()\n",
    "words[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ejemplos"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Analytics',\n",
       " 'Analytics,',\n",
       " 'Especially',\n",
       " 'Specifically,',\n",
       " 'analysis,',\n",
       " 'analysis.',\n",
       " 'analytics',\n",
       " 'analytics,',\n",
       " 'analytics.',\n",
       " 'analyze',\n",
       " 'apply',\n",
       " 'increasingly',\n",
       " 'widely'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Uso de `in` \n",
    "##\n",
    "set(word for word in words if 'ly' in word)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(3, 5), match='ly'>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Uso de re.search\n",
    "##    Obtiene el primer match con la cadena de busqueda\n",
    "## \n",
    "import re\n",
    "\n",
    "g = re.search('ly', raw_text)\n",
    "g"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ly'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "g.group(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 9), match='Analytics'>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search('analytics', raw_text, re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Analytics'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.search('analytics', raw_text, re.IGNORECASE).group(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Resumen de operadores usados en expresiones regulares**\n",
    "\n",
    "    .          Wildcard, matches any character\n",
    "    ^abc       Matches some pattern abc at the start of a string\n",
    "    abc$       Matches some pattern abc at the end of a string\n",
    "    [abc]      Matches one of a set of characters\n",
    "    [A-Z0-9]   Matches one of a range of characters\n",
    "    ed|ing|s   Matches one of the specified strings (disjunction)\n",
    "    *          Zero or more of previous item, e.g. a*, [a-z]* (also known as Kleene Closure)\n",
    "    +          One or more of previous item, e.g. a+, [a-z]+\n",
    "    ?          Zero or one of the previous item (i.e. optional), e.g. a?, [a-z]?\n",
    "    {n}        Exactly n repeats where n is a non-negative integer\n",
    "    {n,}       At least n repeats\n",
    "    {,n}       No more than n repeats\n",
    "    {m,n}      At least m and no more than n repeats\n",
    "    a(b|c)+    Parentheses that indicate the scope of the operators\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Principales caracteres usados en expresiones regulares**\n",
    "\n",
    "    \\b   Word boundary (zero width)\n",
    "    \\d   Any decimal digit (equivalent to [0-9])\n",
    "    \\D   Any non-digit character (equivalent to [^0-9])\n",
    "    \\s   Any whitespace character (equivalent to [ \\t\\n\\r\\f\\v])\n",
    "    \\S   Any non-whitespace character (equivalent to [^ \\t\\n\\r\\f\\v])\n",
    "    \\w   Any alphanumeric character (equivalent to [a-zA-Z0-9_])\n",
    "    \\W   Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_])\n",
    "    \\t   The tab character\n",
    "    \\n   The newline character"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['examining',\n",
       " 'gaining',\n",
       " 'keeping',\n",
       " 'marketing',\n",
       " 'modeling',\n",
       " 'programming',\n",
       " 'sizing',\n",
       " 'studying']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Uso de findall\n",
    "##   Extrae las palabras terminadas en `ing`\n",
    "##   Note la r en la expresión regular\n",
    "##\n",
    "sorted(set(re.findall(r'\\b[a-zA-Z]+ing\\b', raw_text)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['examining',\n",
       " 'gaining',\n",
       " 'marketing',\n",
       " 'modeling',\n",
       " 'programming',\n",
       " 'sizing',\n",
       " 'stock-keeping',\n",
       " 'studying']"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular\n",
    "##\n",
    "sorted(set(re.findall(r'\\b\\S+ing\\b', raw_text)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['examining',\n",
       " 'gaining',\n",
       " 'marketing',\n",
       " 'modeling',\n",
       " 'programming',\n",
       " 'sizing',\n",
       " 'stock-keeping',\n",
       " 'studying']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "sorted(set(re.findall(r'\\b[a-zA-Z\\-]+ing\\b', raw_text)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two three']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\btwo three\\b', 'one two three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\btwo three\\b', 'one two  three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two three']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\btwo\\Wthree\\b', 'one two three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\btwo\\Wthree\\b', 'one two  three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two  three']"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\btwo\\W+three\\b', 'one two  three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two  three']"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Explique el resultado de esta expresión regular:\n",
    "##\n",
    "re.findall(r'\\bthree\\W+two\\b|\\btwo\\W+three\\b', 'one two  three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two three four']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Palabras cercanas\n",
    "##   Dos palabras separadas por otra cualquiera\n",
    "##\n",
    "re.findall(r'\\btwo\\W+\\w+\\W+four\\b', 'one two three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['two three four five']"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## ?: indica que no se desea extraer la expresión reconocida como tal\n",
    "##\n",
    "## Palabras cercanas\n",
    "##   Dos palabras separadas por una, dos o tres palabras no especificadas\n",
    "##\n",
    "re.findall(r'\\btwo\\W+(?:\\w+\\W+){1,3}five', 'one two three four five six')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Isaac ']"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## (?=...)   Aserción hacia adelante\n",
    "##   Match Isaac si es seguido de Asimov\n",
    "##\n",
    "re.findall(r'\\b\\S+ (?=Asimov)', 'Carl Sagan Isaac Asimov James Bond')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Carl ', 'Sagan ', 'Asimov ', 'James ']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## (?!...)   Aserción negativa hacia adelante\n",
    "##   Match si la cadena no es seguida de Asimov\n",
    "##\n",
    "re.findall(r'\\b\\S+ (?!Asimov)', 'Carl Sagan Isaac Asimov James Bond')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['hand', 'hand']"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## (?<=...)   Precedencia\n",
    "##\n",
    "re.findall(r'(?<=-)\\w+', 'sun right-hand moon left-hand')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['sun', 'right', 'and', 'moon', 'left', 'and']"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## (?<!...)   Precedencia\n",
    "##\n",
    "re.findall(r'(?<!-)\\w+', 'sun right-hand moon left-hand')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('c',)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Manejo de grupos\n",
    "## \n",
    "m = re.match(\"([abc])+\", \"abc\")\n",
    "m.groups()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abcd'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re_ = re.compile('(a(b)c)d')\n",
    "m = re_.match('abcd')\n",
    "m.group(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abc'"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'b'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('b', 'abc', 'abcd')"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(2, 1, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Lots'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Se asigna el nombre de <word> a la parte reconocida\n",
    "##\n",
    "p = re.compile(r'(?P<word>\\b\\w+\\b)')\n",
    "m = p.search( '(((( Lots of punctuation )))' )\n",
    "m.group('word')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Lots'"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'first': 'Jane', 'last': 'Doe'}"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Extracción de los grupos reconocidos mediante un diccionario\n",
    "##\n",
    "m = re.match(r'(?P<first>\\w+) (?P<last>\\w+)', 'Jane Doe')\n",
    "m.groupdict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(228, 239), match='programming'>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Compilación de expresiones regulares\n",
    "##\n",
    "re_ = re.compile(r'\\b\\S+ing\\b')\n",
    "re_.search(raw_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'programming'"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = re_.search(raw_text)\n",
    "m.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<_sre.SRE_Match object; span=(0, 9), match='Analytics'>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Búsqueda de palabras que comiencen por `ana`\n",
    "##   match() determina si la expresión regular está al \n",
    "##   principio de la cadena\n",
    "##\n",
    "re_ = re.compile(r'\\b^Ana\\S+')\n",
    "re_.match(raw_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Analytics'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re_.match(raw_text).group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['12', '11', '10']"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## findall\n",
    "##\n",
    "re_ = re.compile(r'\\d+')\n",
    "re_.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<callable_iterator at 0x7fd4237d80f0>"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## finditer\n",
    "##\n",
    "iterator = re_.finditer('12 drummers drumming, 11 10 ...')\n",
    "iterator "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(0, 2)\n",
      "(22, 24)\n",
      "(25, 27)\n"
     ]
    }
   ],
   "source": [
    "for match in iterator:\n",
    "    print(match.span())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Lots'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Se asigna el nombre de <word> a la parte reconocida\n",
    "##\n",
    "p = re.compile(r'(?P<word>\\b\\w+\\b)')\n",
    "m = p.search( '(((( Lots of punctuation )))' )\n",
    "m.group('word')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Lots'"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m.group(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'first': 'Jane', 'last': 'Doe'}"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Extracción de los grupos reconocidos mediante un diccionario\n",
    "##\n",
    "m = re.match(r'(?P<first>\\w+) (?P<last>\\w+)', 'Jane Doe')\n",
    "m.groupdict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'comput'"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "##\n",
    "## Stemmer \n",
    "##\n",
    "def stemmer(word):\n",
    "    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:\n",
    "        if word.endswith(suffix):\n",
    "            return word[:-len(suffix)]\n",
    "    return word\n",
    "\n",
    "stemmer('computing')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['ing']"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'computing')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['computing']"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'computing')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('processe', 's')]"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('process', 'es')]"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "##\n",
    "## Tokenización usando re\n",
    "##\n",
    "raw = \"\"\"'When I'M a Duchess,' she said to herself, (not in a very hopeful tone\n",
    "though), 'I won't have any pepper in my kitchen AT ALL. Soup does very\n",
    "well without--Maybe it's always pepper that makes people hot-tempered,'...\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"'When\",\n",
       " \"I'M\",\n",
       " 'a',\n",
       " \"Duchess,'\",\n",
       " 'she',\n",
       " 'said',\n",
       " 'to',\n",
       " 'herself,',\n",
       " '(not',\n",
       " 'in',\n",
       " 'a',\n",
       " 'very',\n",
       " 'hopeful',\n",
       " 'tone\\nthough),',\n",
       " \"'I\",\n",
       " \"won't\",\n",
       " 'have',\n",
       " 'any',\n",
       " 'pepper',\n",
       " 'in',\n",
       " 'my',\n",
       " 'kitchen',\n",
       " 'AT',\n",
       " 'ALL.',\n",
       " 'Soup',\n",
       " 'does',\n",
       " 'very\\nwell',\n",
       " 'without--Maybe',\n",
       " \"it's\",\n",
       " 'always',\n",
       " 'pepper',\n",
       " 'that',\n",
       " 'makes',\n",
       " 'people',\n",
       " \"hot-tempered,'...\"]"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(r' ', raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['',\n",
       " 'When',\n",
       " 'I',\n",
       " 'M',\n",
       " 'a',\n",
       " 'Duchess',\n",
       " 'she',\n",
       " 'said',\n",
       " 'to',\n",
       " 'herself',\n",
       " 'not',\n",
       " 'in',\n",
       " 'a',\n",
       " 'very',\n",
       " 'hopeful',\n",
       " 'tone',\n",
       " 'though',\n",
       " 'I',\n",
       " 'won',\n",
       " 't',\n",
       " 'have',\n",
       " 'any',\n",
       " 'pepper',\n",
       " 'in',\n",
       " 'my',\n",
       " 'kitchen',\n",
       " 'AT',\n",
       " 'ALL',\n",
       " 'Soup',\n",
       " 'does',\n",
       " 'very',\n",
       " 'well',\n",
       " 'without',\n",
       " 'Maybe',\n",
       " 'it',\n",
       " 's',\n",
       " 'always',\n",
       " 'pepper',\n",
       " 'that',\n",
       " 'makes',\n",
       " 'people',\n",
       " 'hot',\n",
       " 'tempered',\n",
       " '']"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(r'\\W+', raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Baby names exercise from Google Education"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Descargue el archivo https://developers.google.com/edu/python/google-python-exercises.zip y descomprimalo.\n",
    "\n",
    "* Los archivos baby1990.html, baby1991.html contienen el código HTML de la página web que publica los nombres más populares para bebes nacidos en el correspondiente año.\n",
    "\n",
    "* Escriba una función que retorne una lista simple que contiene el año, y posteriormente el nombre y su posición. La lista solicitada debe presentar los nombres en orden alfabético y debe considerar simultaneamente los nombres de niños y niñas. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}