Categorías inconsistentes#
Ultima actualización: Mar 6, 2023 | YouTube
[1]:
import numpy as np
import pandas as pd
print(np.__version__)
print(pd.__version__)
1.23.5
1.5.2
[2]:
%%writefile /tmp/data.csv
personId,eventType
1,AA
2,A
3,AZ
4,AB
5,ZB
6,ZZ
7,BA
8,BB
Overwriting /tmp/data.csv
[3]:
valid_eventType = {"AA", "AB", "BA", "BB"}
df = pd.read_csv("/tmp/data.csv")
#
# Categorias inconsistentes
#
set(df.eventType).difference(valid_eventType)
[3]:
{'A', 'AZ', 'ZB', 'ZZ'}
[4]:
#
# Registros con categorias inconsistentes
#
df[~df.eventType.isin(valid_eventType)]
[4]:
personId | eventType | |
---|---|---|
1 | 2 | A |
2 | 3 | AZ |
4 | 5 | ZB |
5 | 6 | ZZ |
Posibles soluciones:
Borrado del registro.
Reemplazo de las categorias inconsistentes
Inferencia la categoria a partir de otros campos.
[5]:
#
# Borrado de registros inconsistentes
#
df = df[df.eventType.isin(valid_eventType)]
df
[5]:
personId | eventType | |
---|---|---|
0 | 1 | AA |
3 | 4 | AB |
6 | 7 | BA |
7 | 8 | BB |