Regresión Lineal

Ecuación de Regresión: \begin{equation} Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \end{equation}

Ecuación de la Pendiente: \begin{equation} \hat{\beta}_1 = \frac{(X_i - \bar{X})} {(Y_i - \bar{Y})} \end{equation}

Este ejercicio se a adaptado de “Linear Regression in Julia” por Silaparasetty, V.

Descargar una muestra de los precios de acciones New York Stock Exchange

El conjunto completo de datos de precios

Otro Ejemplo Recomendado de Regresión Lineal

[3]:
# Import Packages
using Pkg  # Package to install new packages

# Install packages
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("CSVFiles")
Pkg.add("Plots")
Pkg.add("Lathe")
Pkg.add("GLM")
Pkg.add("StatsPlots")
Pkg.add("MLBase")
Pkg.add("Missings")
Pkg.add("Statistics")
Pkg.add("Plots")
Pkg.build("CodecZlib")
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
  Resolving package versions...
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Project.toml`
No Changes to `C:\Users\Fernan\.julia\environments\v1.5\Manifest.toml`
   Building CodecZlib → `C:\Users\Fernan\.julia\packages\CodecZlib\5t9zO\deps\build.log`
[1]:
# Cargar los paquetes instalados
using DataFrames
using CSV
using CSVFiles
using Plots
using Lathe
using GLM
using Statistics
using StatsPlots
using MLBase
using Missings
[2]:
# Carga el archivo CSV en un DataFrame
# para más detalles consultar -> https://juliapackages.com/p/csvfiles

using CSVFiles, DataFrames

df = DataFrame(load("./Downloads/nystocks.csv"))

println(first(df,5))
5×7 DataFrame
│ Row │ date       │ symbol │ open    │ close   │ low     │ high    │ volume    │
│     │ StringStringFloat64Float64Float64Float64Int64     │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│ 1   │ 04-01-2010 │ A      │ 31.39   │ 31.3    │ 31.13   │ 31.63   │ 3815500   │
│ 2   │ 04-01-2010 │ AAP    │ 40.7    │ 40.38   │ 40.36   │ 41.04   │ 1701700   │
│ 3   │ 04-01-2010 │ AAPL   │ 213.43  │ 214.01  │ 212.38  │ 214.5   │ 123432400 │
│ 4   │ 04-01-2010 │ ABC    │ 26.29   │ 26.63   │ 26.14   │ 26.69   │ 2455900   │
│ 5   │ 04-01-2010 │ ABT    │ 54.19   │ 54.46   │ 53.92   │ 54.56   │ 10829000  │

Exploración de los Datos

[3]:
# Variables Disponibles
names(df)
[3]:
7-element Array{String,1}:
 "date"
 "symbol"
 "open"
 "close"
 "low"
 "high"
 "volume"
[4]:
# Presentar las primeras 5 filas
first(df,5)
[4]:

5 rows × 7 columns

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
104-01-2010A31.3931.331.1331.633815500
204-01-2010AAP40.740.3840.3641.041701700
304-01-2010AAPL213.43214.01212.38214.5123432400
404-01-2010ABC26.2926.6326.1426.692455900
504-01-2010ABT54.1954.4653.9254.5610829000
[5]:
#Las últimas 5 filas
last(df,5)
[5]:

5 rows × 7 columns

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
106-01-2010BMY25.1725.2225.0725.2915528900
206-01-2010BSX9.079.168.999.2812923000
306-01-2010BWA35.3936.6935.336.784171000
406-01-2010BXP68.2368.4468.0368.941814900
506-01-2010C3.563.643.513.6867433800
[6]:
# Algunos Indicadores Estadísticos
describe(df)
[6]:

7 rows × 8 columns

variablemeanminmedianmaxnuniquenmissingeltype
SymbolUnion…AnyUnion…AnyUnion…NothingDataType
1date04-01-201006-01-20105String
2symbolAZION467String
3open46.90741.5337.07627.181Float64
4close47.04071.6137.25626.751Float64
5low46.44531.5136.74624.241Float64
6high47.41971.6137.76629.511Float64
7volume7.01361e6100003.0912e6215620200Int64
[7]:
# Separar por grupos
agrupar = groupby(df, :symbol)
[7]:

GroupedDataFrame with 467 groups based on key: symbol

First Group (3 rows): symbol = "A"

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
104-01-2010A31.3931.331.1331.633815500
205-01-2010A31.2130.9630.7631.224186000
306-01-2010A30.8530.8530.7631.03243700

Last Group (1 row): symbol = "CHTR"

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
105-01-2010CHTR35.035.035.035.010000
[8]:
#Obtener un grupo
losBXP = get(agrupar, (symbol=:"BMY",), nothing)
println(typeof(losBXP))
losBXP

SubDataFrame{DataFrame,DataFrames.Index,Array{Int64,1}}
[8]:

3 rows × 7 columns

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
104-01-2010BMY25.4125.6325.325.714376100
205-01-2010BMY25.5125.2325.0125.5516973600
306-01-2010BMY25.1725.2225.0725.2915528900
[9]:
#Obtener un grupo
losBXP = get(agrupar, (symbol=:"BXP",), nothing)
println(typeof(losBXP))
losBXP
SubDataFrame{DataFrame,DataFrames.Index,Array{Int64,1}}
[9]:

3 rows × 7 columns

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
104-01-2010BXP67.5967.166.5368.331511500
205-01-2010BXP67.2468.1266.4568.22173700
306-01-2010BXP68.2368.4468.0368.941814900
[10]:
losBMY = agrupar[(symbol= "BMY",)]
println(typeof(losBMY))
losBMY
SubDataFrame{DataFrame,DataFrames.Index,Array{Int64,1}}
[10]:

3 rows × 7 columns

datesymbolopencloselowhighvolume
StringStringFloat64Float64Float64Float64Int64
104-01-2010BMY25.4125.6325.325.714376100
205-01-2010BMY25.5125.2325.0125.5516973600
306-01-2010BMY25.1725.2225.0725.2915528900

Relación entre los precios de apertura vs los de cierre

[11]:
# Relación de los Precios de Apertura vs los de Cierre
scatter(df.open, df.close, xlabel="Precios de Apertura", ylabel="Precios de Cierre")
[11]:
../../_images/notebooks_julia_5-01-05_14_0.svg
[12]:
# División del Conjunto de Entrenamiento y Prueba
# Esquema 1: Conservando el orden de los datos
# Tamaño del DataFrame
filas, columnas = size(df)

#Se toma el 80% para Entrenamiento
nEntrenamiento = filas * 0.8
nValidacion = filas - nEntrenamiento

train = first(df, Int(nEntrenamiento))
test = last(df, Int(nValidacion))
println(size(train))
println(size(test))
(800, 7)
(200, 7)
[13]:
# División del Conjunto de Entrenamiento y Prueba
# Esquema 2: Aleatoriamente
# Aproximadamente el 80% para entrenamiento
using Lathe.preprocess: TrainTestSplit
train, test = TrainTestSplit(df, .8)
[13]:
(814×7 DataFrame. Omitted printing of 1 columns
│ Row │ date       │ symbol │ open    │ close   │ low     │ high    │
│     │ String     │ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 04-01-2010 │ A      │ 31.39   │ 31.3    │ 31.13   │ 31.63   │
│ 2   │ 04-01-2010 │ AAP    │ 40.7    │ 40.38   │ 40.36   │ 41.04   │
│ 3   │ 04-01-2010 │ AAPL   │ 213.43  │ 214.01  │ 212.38  │ 214.5   │
│ 4   │ 04-01-2010 │ ABC    │ 26.29   │ 26.63   │ 26.14   │ 26.69   │
│ 5   │ 04-01-2010 │ ACN    │ 41.52   │ 42.07   │ 41.5    │ 42.2    │
│ 6   │ 04-01-2010 │ ADP    │ 43.54   │ 42.83   │ 42.7    │ 43.54   │
│ 7   │ 04-01-2010 │ ADS    │ 65.0    │ 65.89   │ 64.96   │ 66.0    │
│ 8   │ 04-01-2010 │ ADSK   │ 25.61   │ 25.67   │ 25.61   │ 25.83   │
│ 9   │ 04-01-2010 │ AEP    │ 35.1    │ 34.94   │ 34.8    │ 36.0    │
│ 10  │ 04-01-2010 │ AET    │ 32.06   │ 33.0    │ 31.87   │ 33.08   │
⋮
│ 804 │ 06-01-2010 │ BBBY   │ 39.13   │ 39.23   │ 38.47   │ 39.67   │
│ 805 │ 06-01-2010 │ BDX    │ 77.99   │ 77.66   │ 77.28   │ 78.08   │
│ 806 │ 06-01-2010 │ BEN    │ 108.57  │ 108.92  │ 108.32  │ 109.64  │
│ 807 │ 06-01-2010 │ BHI    │ 43.37   │ 45.84   │ 43.37   │ 46.03   │
│ 808 │ 06-01-2010 │ BIIB   │ 53.1    │ 53.43   │ 52.8    │ 53.7    │
│ 809 │ 06-01-2010 │ BLL    │ 51.92   │ 52.0    │ 51.64   │ 52.12   │
│ 810 │ 06-01-2010 │ BMY    │ 25.17   │ 25.22   │ 25.07   │ 25.29   │
│ 811 │ 06-01-2010 │ BSX    │ 9.07    │ 9.16    │ 8.99    │ 9.28    │
│ 812 │ 06-01-2010 │ BWA    │ 35.39   │ 36.69   │ 35.3    │ 36.78   │
│ 813 │ 06-01-2010 │ BXP    │ 68.23   │ 68.44   │ 68.03   │ 68.94   │
│ 814 │ 06-01-2010 │ C      │ 3.56    │ 3.64    │ 3.51    │ 3.68    │, 186×7 DataFrame. Omitted printing of 1 columns
│ Row │ date       │ symbol │ open    │ close   │ low     │ high    │
│     │ String     │ String │ Float64 │ Float64 │ Float64 │ Float64 │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┤
│ 1   │ 04-01-2010 │ ABT    │ 54.19   │ 54.46   │ 53.92   │ 54.56   │
│ 2   │ 04-01-2010 │ ADBE   │ 36.65   │ 37.09   │ 36.65   │ 37.3    │
│ 3   │ 04-01-2010 │ ADI    │ 31.79   │ 31.67   │ 31.61   │ 32.19   │
│ 4   │ 04-01-2010 │ ADM    │ 31.48   │ 31.47   │ 31.33   │ 31.84   │
│ 5   │ 04-01-2010 │ AEE    │ 28.03   │ 27.76   │ 27.69   │ 28.27   │
│ 6   │ 04-01-2010 │ AES    │ 13.38   │ 13.67   │ 13.38   │ 13.7    │
│ 7   │ 04-01-2010 │ AGN    │ 39.7    │ 40.29   │ 39.7    │ 40.46   │
│ 8   │ 04-01-2010 │ AIG    │ 30.53   │ 29.89   │ 29.41   │ 30.54   │
│ 9   │ 04-01-2010 │ AKAM   │ 25.63   │ 25.92   │ 25.53   │ 26.06   │
│ 10  │ 04-01-2010 │ ALL    │ 30.36   │ 30.41   │ 30.09   │ 30.51   │
⋮
│ 176 │ 06-01-2010 │ AMAT   │ 14.23   │ 14.16   │ 14.1    │ 14.4    │
│ 177 │ 06-01-2010 │ AMG    │ 69.21   │ 70.79   │ 69.21   │ 71.47   │
│ 178 │ 06-01-2010 │ AMP    │ 41.37   │ 41.38   │ 40.86   │ 41.55   │
│ 179 │ 06-01-2010 │ ARNC   │ 16.31   │ 16.97   │ 16.26   │ 17.06   │
│ 180 │ 06-01-2010 │ ATVI   │ 11.26   │ 11.26   │ 11.21   │ 11.38   │
│ 181 │ 06-01-2010 │ BAC    │ 16.21   │ 16.39   │ 16.03   │ 16.54   │
│ 182 │ 06-01-2010 │ BBT    │ 25.95   │ 26.58   │ 25.95   │ 26.91   │
│ 183 │ 06-01-2010 │ BBY    │ 41.21   │ 40.89   │ 40.66   │ 41.34   │
│ 184 │ 06-01-2010 │ BCR    │ 79.37   │ 79.01   │ 78.68   │ 79.37   │
│ 185 │ 06-01-2010 │ BK     │ 28.48   │ 28.16   │ 28.09   │ 28.51   │
│ 186 │ 06-01-2010 │ BLK    │ 238.51  │ 234.67  │ 234.06  │ 238.65  │)

El Modelo de Regresión

[14]:
using GLM
modelo = @formula(open~close)
linreg = lm(modelo, train)
[14]:
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

open ~ 1 + close

Coefficients:
────────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error        t  Pr(>|t|)  Lower 95%   Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)  -0.161929  0.0336811     -4.81    <1e-5   -0.228041  -0.0958166
close         1.00038   0.00047325  2113.84    <1e-99   0.999448   1.00131
────────────────────────────────────────────────────────────────────────────
[15]:
# Verificamos el valor del R cuadrado
r2(linreg)
[15]:
0.9998183099258227

Predicción

[16]:
test_pred = predict(linreg, test);
train_pred = predict(linreg, train);
[17]:
perf_test = df_original = DataFrame(y_original = test[!, :open], y_pred = test_pred)
perf_test.error = perf_test[!,:y_original] - perf_test[!,:y_pred]
perf_test.error_sq = perf_test.error.*perf_test.error;
[18]:
perf_train = df_original = DataFrame(y_original = train[!, :open], y_pred = train_pred)
perf_train.error = perf_train[!,:y_original] - perf_train[!,:y_pred]
perf_train.error_sq = perf_train.error .* perf_train.error;

Funciones de Pérdida

[19]:
# Calcular la Función de Pérdida o Riesgo Total

# Función de MAPE
function mape(perf_df)
    mape = mean(abs.(perf_df.error ./ perf_df.y_original))
    return mape
end

#Funcion RMSE
function rmse(perf_df)
    rmse = sqrt(mean(perf_df.error .* perf_df.error))
    return rmse
end

[19]:
rmse (generic function with 1 method)
[20]:
println("Mean Absolute test error: ",mean(abs.(perf_test.error)))
println("Mean Absolute Percentage test error: ", mape(perf_test))
println("Root Mean Square Test Error: ", rmse(perf_test))
println("Mean Square Test Error: ",mean(perf_test.error_sq))
Mean Absolute test error: 0.5017023151586166
Mean Absolute Percentage test error: 0.011701124353951658
Root Mean Square Test Error: 0.7499470152740347
Mean Square Test Error: 0.5624205257184333
[21]:
println("Mean Absolute train error: ",mean(abs.(perf_train.error)))
println("Mean Absolute Percentage train error: ", mape(perf_train))
println("Root Mean Square train Error: ", rmse(perf_train))
println("Mean Square train Error: ",mean(perf_train.error_sq))
Mean Absolute train error: 0.4845467192962157
Mean Absolute Percentage train error: 0.012227805323710275
Root Mean Square train Error: 0.7201515657678202
Mean Square train Error: 0.5186182776778432