Practical session - Modèles de Régression linéaire

Introduction

This study focuses on the Mexico electricity dataset with meteorological and temporal variables. The goal is to predict whether daily electricity generation is above or below the median using logistic regression and regularized methods (Ridge, Lasso). We’ll explore multicollinearity, model selection, and performance comparison.

Data Description

The dataset contains daily electricity generation data with the following variables:

RH: Relative humidity \((\%)\)
SSRD: Surface solar radiation downward \((J.m^{-2})\)
STRD: Surface thermal radiation downward \((J.m^{-2})\)
T2M: Average daily temperature at \(2m\) \((^\circ C)\)
T2Mmax: Maximum daily temperature at \(2m\) \((^\circ C)\)
T2Mmin: Minimum daily temperature at \(2m\) \((^\circ C)\)
Covid: COVID-19 stringency index
Holidays: Holidays or public holidays, \(1\) = holidays, \(0\) otherwise
DOW: Day of week, \(0\) for Monday, \(1\) for Tuesday, \(\cdots\)
TOY: Day of the year (from \(1\) to \(366\))
Total: Electricity generated daily in the territory \((GWh)\) - target variable

Data Loading and Exploration

# Load data
data <- read.csv("Mexico_data.csv")

# Basic structure
cat("Dataset dimensions:", dim(data), "\n")

## Dataset dimensions: 1461 12

cat("Number of observations:", nrow(data), "\n")

## Number of observations: 1461

cat("Number of variables:", ncol(data), "\n\n")

## Number of variables: 12

# Summary
summary(data)

##       X0                  RH             SSRD              STRD        
##  Length:1461        Min.   :28.37   Min.   : 407045   Min.   :1019376  
##  Class :character   1st Qu.:50.12   1st Qu.: 719323   1st Qu.:1156350  
##  Mode  :character   Median :57.66   Median : 870071   Median :1235306  
##                     Mean   :57.55   Mean   : 869055   Mean   :1244835  
##                     3rd Qu.:64.55   3rd Qu.:1022083   3rd Qu.:1353442  
##                     Max.   :80.54   Max.   :1217895   Max.   :1431876  
##       T2M            T2Mmax          T2Mmin           Covid      
##  Min.   :11.23   Min.   :17.82   Min.   : 5.536   Min.   : 0.00  
##  1st Qu.:18.09   1st Qu.:25.27   1st Qu.:12.064   1st Qu.: 0.00  
##  Median :22.32   Median :28.72   Median :16.220   Median :33.33  
##  Mean   :21.27   Mean   :27.93   Mean   :15.623   Mean   :33.00  
##  3rd Qu.:24.63   3rd Qu.:30.83   3rd Qu.:19.747   3rd Qu.:63.89  
##  Max.   :27.39   Max.   :33.83   Max.   :21.841   Max.   :82.41  
##     Holidays            DOW         TOY            Total       
##  Min.   :0.00000   Min.   :0   Min.   :  1.0   Min.   : 578.5  
##  1st Qu.:0.00000   1st Qu.:1   1st Qu.: 92.0   1st Qu.: 816.0  
##  Median :0.00000   Median :3   Median :183.0   Median : 871.4  
##  Mean   :0.03012   Mean   :3   Mean   :183.1   Mean   : 880.7  
##  3rd Qu.:0.00000   3rd Qu.:5   3rd Qu.:274.0   3rd Qu.: 958.1  
##  Max.   :1.00000   Max.   :6   Max.   :366.0   Max.   :1107.4

# Check missing values
cat("\nMissing values per column:\n")

## 
## Missing values per column:

print(colSums(is.na(data)))

##       X0       RH     SSRD     STRD      T2M   T2Mmax   T2Mmin    Covid 
##        0        0        0        0        0        0        0        0 
## Holidays      DOW      TOY    Total 
##        0        0        0        0

# Data types
cat("\nNumerical columns:", names(data)[sapply(data, is.numeric)], "\n")

## 
## Numerical columns: RH SSRD STRD T2M T2Mmax T2Mmin Covid Holidays DOW TOY Total

cat("Categorical columns:", names(data)[sapply(data, is.character)], "\n")

## Categorical columns: X0

Creating Binary Target Variable

We create a binary variable Y indicating whether electricity generation is above (1) or below (0) the median.

# Create binary target
median_energy <- median(data$Total)
data$Y <- ifelse(data$Total > median_energy, 1, 0)

cat("Median electricity generation:", round(median_energy, 2), "GWh\n")

## Median electricity generation: 871.43 GWh

cat("Class distribution:\n")

## Class distribution:

print(table(data$Y))

## 
##   0   1 
## 731 730

cat("Proportion: ", prop.table(table(data$Y)), "\n")

## Proportion:  0.5003422 0.4996578

Exploratory Data Analysis

Distribution Analysis

par(mfrow=c(2,2), bg="white")

# Histogram with median line
hist(data$Total
     , breaks=30
     , col="#8ecae6"
     , main="Distribution of Total Electricity",
     , xlab="Total Electricity (GWh)",
     , border="white"
     , col.main="#023047"
     , col.lab="#023047"
)
abline(v=median_energy, col="#fb8500", lwd=3, lty=2)
legend("topright", legend="Median", col="#fb8500", lty=2, lwd=3, bty="n")

# Boxplot by class
boxplot(Total ~ Y
        , data=data
        , col=c(col_low, col_high)
        , names=c("Below Median", "Above Median")
        , main="Electricity by Class"
        , ylab="Total Electricity (GWh)"
        , border=c("#b91c1c", "#059669")
        , col.main="#023047", col.lab="#023047", lwd=1.5)

# Density plot
plot(density(data$Total)
     , main="Density Plot of Electricity"
     , xlab="Total Electricity (GWh)", lwd=3, col="#1982c4"
     , col.main="#023047"
     , col.lab="#023047"
)
polygon(density(data$Total), col="#8ecae680", border="#1982c4", lwd=2)
abline(v=median_energy, col="#fb8500", lwd=3, lty=2)

# QQ plot to check normality
qqnorm(data$Total
       , main="Q-Q Plot"
       , pch=19
       , col="#8ecae6"
       , col.main="#023047"
       , col.lab="#023047"
)

qqline(data$Total, col="#fb8500", lwd=3)

The distribution shows reasonable separation between low and high electricity classes. The data is fairly balanced around the median. Electricity generation shows a slightly skewed, mildly bimodal distribution with distinct differences between blow-median and above median days. The Q-Q plot indicates non-normality driven by tail behavior, which align with knowns seasonal fluctuations in electricity demand.

Temporal Patterns

It is great to check the temporal pattern to see how the data behavior,

par(mfrow=c(1,2))

# Electricity by DOW
dow_colors <- colorRampPalette(c("#1982c4", "#8ac926", "#ffca3a"))(7)
boxplot(Total ~ DOW
        , data=data
        , col=dow_colors
        , border="#023047"
        , main="Electricity by Day of Week"
        , xlab="Day (0=Mon, 6=Sun)"
        , ylab="Total (GWh)"
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

# Electricity over a year
plot(data$TOY
     , data$Total
     , col=ifelse(data$Y==1, col_high, col_low)
     , pch=16
     , cex=0.8
     , main="Electricity Generation Throughout Year"
     , xlab="Day of Year"
     , ylab="Total (GWh)"
     , col.main="#023047"
     , col.lab="#023047"
)
abline(h=median_energy, col="#023047", lwd=2.5, lty=2)

legend("topright", legend=c("Above Median", "Below Median"),
       col=c(col_high, col_low), pch=16, bty="n", cex=0.9)

This shows a clearly seasonal, in the left plot shows that the Electricity demand is significant higher during the work week and drop during the weekend, espcially the Sunday. We can interprete the consistent use of the industrial, commercial activity pattern high during workday and less in the weekday. For the right plot, it shows the clear picture of the Electricity generate starting from day 0-100 and 250-300 low value and peak on the roughly 200-250 which demonstrate the demanding on the summary spike.

Correlation Analysis

num_data <- data[, sapply(data, is.numeric)]
corr_matrix <- cor(num_data, use="complete.obs")

# Correlation
ggcorrplot(
  corr_matrix
  , hc.order = TRUE
  , lab = TRUE
  , lab_size = 2.5
  , method = "circle"
  , type = "lower"
  , colors = c("#4361ee", "#f8f9fa", "#e63946")
  , title = "Correlation Matrix - Mexico Electricity Dataset"
  , ggtheme = theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "#023047"),
          axis.text = element_text(color = "#023047"))
)

So from this calculation we can state that:

Variable	\(Corr(Total, X_{ij})\)	Interpretation
STRD	0.72	It details that on clear day, clear sky the condition is optimal produces more electricity, it impact on the total value.
SSRD	0.53	This means that the more sunlight directly it could improve the photovoltaic on solar generation.
T2M	0.76	Strong positive value, the hotter the day it corresponds on generating more energy.
T2Mmax	0.71	Implies that on the hotter peak day temperature, more electricity is being generated.
T2Mmin	0.77	Implies that at the night the temperature dropdown, is likely to have small generated electricity.
RH	0.27	With low related value, this could tell us that Humidity isn’t really affect on the total generated electricity.
COVID-19	0.01	COVID-19 doesn’t have any impact on the electricity indicated with lowest weak value. We can remove this in to our model
Holidays	-0.23	The negative correlation indicates that the electricity generated is lower on the holidays compare to the regular days. This could implies the reduced industrial activity during the holiday.
DOW	-0.27	Day of week suggests that there is a lower electricity generation than weekdays.

Source reference : meteorological

Checking Variance Inflation Factor (VIF)

# Full model with all predictors
full_model_vif <- glm(
  Y ~ RH + SSRD + STRD + T2M + T2Mmax + T2Mmin + Covid + Holidays + DOW + TOY
  , data = data
  , family = binomial
)

# Calculate VIF
vif_values <- vif(full_model_vif)
cat("VIF Values (Full Model):\n")

## VIF Values (Full Model):

print(vif_values)

##         RH       SSRD       STRD        T2M     T2Mmax     T2Mmin      Covid 
##  18.088931  11.192520  45.455237 301.728881  78.222212 146.634867   1.347205 
##   Holidays        DOW        TOY 
##   1.021233   1.281240   3.564355

# VIF > 10 indicates problematic multicollinearity
cat("\n\nVariables with VIF > 10:\n")

## 
## 
## Variables with VIF > 10:

print(vif_values[vif_values > 10])

##        RH      SSRD      STRD       T2M    T2Mmax    T2Mmin 
##  18.08893  11.19252  45.45524 301.72888  78.22221 146.63487

In the full model, serveral predictors does have high multicollinearity, especially the temperature variables T2M, T2Mmax, T2Mmin whose VIF values exeeding 100. The STRD , SSRD also have VIF > 10, indicating the strong correlation with each other with temperature.

So we assume:

Temperature variables (T2M, T2Mmax, T2Mmin) are highly correlated (> 0.95), indicating multicollinearity
SSRD and STRD show strong positive correlation with Total electricity
T2M and temperature-related variables have strong relationship with electricity generation
To avoid multicollinearity issues, we’ll remove T2Mmax and T2Mmin from the model, keeping only T2M

Logistic Regression Models

A. Full Logistic Regression Model

We fit a logistic regression model excluding the highly correlated variables T2Mmax and T2Mmin.

# Fit logistic regression
log_model <- glm(
  Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
  , data = data
  , family = binomial
)

# Model summary
summary(log_model)

## 
## Call:
## glm(formula = Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + 
##     DOW + TOY, family = binomial, data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.460e+01  3.398e+00  -7.239 4.53e-13 ***
## RH           5.806e-02  3.367e-02   1.724 0.084677 .  
## SSRD         8.866e-06  1.714e-06   5.171 2.32e-07 ***
## STRD         6.014e-06  5.864e-06   1.026 0.305083    
## T2M          3.005e-01  1.681e-01   1.787 0.073873 .  
## Covid       -1.898e-02  3.514e-03  -5.399 6.69e-08 ***
## Holidays    -3.027e+00  7.861e-01  -3.850 0.000118 ***
## DOW         -4.370e-01  4.868e-02  -8.977  < 2e-16 ***
## TOY          7.185e-03  1.753e-03   4.099 4.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2025.38  on 1460  degrees of freedom
## Residual deviance:  866.97  on 1452  degrees of freedom
## AIC: 884.97
## 
## Number of Fisher Scoring iterations: 6

# Extract coefficients
log_coef <- summary(log_model)$coefficients

Significance Analysis

# Predictors (p < 0.05)
log_significant <- log_coef[log_coef[, 4] < 0.05, ]

cat("Significant predictors (p < 0.05):\n")

## Significant predictors (p < 0.05):

print(log_significant)

##                  Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept) -2.459978e+01 3.398420e+00 -7.238595 4.533573e-13
## SSRD         8.865513e-06 1.714321e-06  5.171443 2.322931e-07
## Covid       -1.897541e-02 3.514402e-03 -5.399330 6.689034e-08
## Holidays    -3.026540e+00 7.860669e-01 -3.850232 1.180058e-04
## DOW         -4.370404e-01 4.868228e-02 -8.977402 2.772346e-19
## TOY          7.185129e-03 1.752790e-03  4.099252 4.144880e-05

# p-values
pvals <- log_coef[-1, 4]  # Exclude intercept
names_vars <- rownames(log_coef)[-1]

par(mar=c(8, 4, 4, 2))
barplot(sort(pvals)
        , names.arg = names_vars[order(pvals)]
        , las=2
        , col = ifelse(sort(pvals) < 0.05, "#06d6a0", "#e63946")
        , border = ifelse(sort(pvals) < 0.05, "#059669", "#b91c1c")
        , main = "P-values of Predictors"
        , ylab = "P-value"
        , ylim = c(0, max(pvals)*1.1)
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)
abline(h=0.05, col="#fb8500", lwd=3, lty=2)
legend("topleft", legend=c("Significant (p<0.05)", "Not Significant"),
       fill=c("#06d6a0", "#e63946"), bty="n", border=c("#059669", "#b91c1c"))

The significant predictors \((p < 0.05)\) are SSRD, Covid, Holidays, DOW, and TOY. The variables STRD and T2M are not significant in this model, indicating that once multicollinearity is removed, they do not meaningfully contribute to predicting whether electricity generation is above or below the median.

B. Model Predictions

# Compare prediction types
cat("Prediction type='link' (log-odds scale):\n")

## Prediction type='link' (log-odds scale):

head(predict(log_model, type="link"))

##         1         2         3         4         5         6 
## -9.255574 -6.000108 -6.137141 -6.586506 -6.902523 -6.304666

cat("\n\nPrediction type='response' (probability scale):\n")

## 
## 
## Prediction type='response' (probability scale):

head(predict(log_model, type="response"))

##            1            2            3            4            5            6 
## 9.556825e-05 2.472356e-03 2.156434e-03 1.376950e-03 1.004237e-03 1.824421e-03

Note: type="link" returns log-odds \(\in (-\infty, +\infty)\), while type="response" returns probabilities \(\in [0, 1]\).

C. Odds Ratios Analysis

# Calculate odds ratios
log_odd_ratio <- exp(coef(log_model))

# 95% CI
CI <- exp(confint(log_model))

# Create odds-ratio table
log_odd_ratio_table <- data.frame(
  Odds_Ratio = log_odd_ratio,
  CI_Lower = CI[, 1],
  CI_Upper = CI[, 2]
)

cat("Odds Ratios with 95% Confidence Intervals:\n")

## Odds Ratios with 95% Confidence Intervals:

print(round(log_odd_ratio_table, 4))

##             Odds_Ratio CI_Lower CI_Upper
## (Intercept)     0.0000   0.0000   0.0000
## RH              1.0598   0.9924   1.1326
## SSRD            1.0000   1.0000   1.0000
## STRD            1.0000   1.0000   1.0000
## T2M             1.3505   0.9753   1.8859
## Covid           0.9812   0.9744   0.9879
## Holidays        0.0485   0.0086   0.1985
## DOW             0.6459   0.5859   0.7092
## TOY             1.0072   1.0038   1.0108

# Visualize odds ratios
or_plot <- log_odd_ratio_table[-1, ]
or_names <- rownames(or_plot)

par(mar=c(5, 8, 4, 2))

or_colors <- ifelse(or_plot$Odds_Ratio > 1, "#1982c4", "#e63946")
plot(or_plot$Odds_Ratio
     , 1:nrow(or_plot)
     , xlim=c(0, max(or_plot$CI_Upper)*1.1)
     , yaxt="n"
     , ylab=""
     , xlab="Odds Ratio"
     , main="Odds Ratios with 95% CI"
     , pch=19
     , col=or_colors
     , cex=2,
     , col.main="#023047"
     , col.lab="#023047"
)

axis(2, at=1:nrow(or_plot), labels=or_names, las=2, col.axis="#023047")
segments(or_plot$CI_Lower
         , 1:nrow(or_plot)
         , or_plot$CI_Upper
         , 1:nrow(or_plot)
         , col=or_colors
         , lwd=3
)

abline(v=1, col="#023047", lty=2, lwd=2.5)
legend("topright", legend=c("OR > 1 (Positive)", "OR < 1 (Negative)"),
       col=c("#1982c4", "#e63946"), pch=19, bty="n", cex=0.9)

par(mar=c(5, 4, 4, 2))

Interpretation:

Positive effects (OR > 1): SSRD and TOY increase the likelihood of high electricity generation.
Negative effects (OR < 1): Covid, Holidays, and DOW significantly reduce electricity generation.
Not significant: RH, STRD, and T2M show no meaningful effect (their confidence intervals include 1).
Most influential: Holidays, DOW, Covid, TOY, and SSRD.

D. Model Performance - Confusion Matrix

# Predict
log_p_hat <- predict(log_model, type="response")
log_y_hat <- ifelse(log_p_hat > 0.5, 1, 0)

# Confusion matrix
log_conf_mat <- table(Predicted = log_y_hat, Actual = data$Y)

cat("Confusion Matrix:\n")

## Confusion Matrix:

print(log_conf_mat)

##          Actual
## Predicted   0   1
##         0 624  81
##         1 107 649

# Extract TP, FN, FP, TN

TN <- log_conf_mat["0","0"]
FN <- log_conf_mat["0","1"]
FP <- log_conf_mat["1","0"]
TP <- log_conf_mat["1","1"]

# Calculate metrics
accuracy <- (TP + TN) / sum(log_conf_mat)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)
FP_rate <- FP / (FP + TN)
FN_rate <- FN / (FN + TP)

# Display metrics
cat("\n\n=== Performance Metrics ===\n")

## 
## 
## === Performance Metrics ===

cat(sprintf("Accuracy:    (%.2f%%)\n", accuracy, accuracy*100))

## Accuracy:    (0.87%)

cat(sprintf("Precision:  \n", precision))

## Precision:

cat(sprintf("Recall:     \n", recall))

## Recall:

cat(sprintf("F1-Score:   \n", f1_score))

## F1-Score:

cat(sprintf("FP Rate:     (%.2f%%)\n", FP_rate, FP_rate*100))

## FP Rate:     (0.15%)

cat(sprintf("FN Rate:     (%.2f%%)\n", FN_rate, FN_rate*100))

## FN Rate:     (0.11%)

# Vizualiuze
par(mfrow=c(2,2))

# 1. Confusion matrix
log_conf_mat_df <- as.data.frame(log_conf_mat)

colnames(log_conf_mat_df) <- c("Predicted", "Actual", "Freq")

ggplot(log_conf_mat_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", size=1.5) +
  geom_text(aes(label = Freq), size = 10, fontface="bold", color = "#023047") +
  scale_fill_gradient(low = "#e0f2fe", high = "#0284c7") +
  labs(title = "Confusion Matrix - Logistic Regression",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal(base_size = 14) +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5, face = "bold", color = "#023047"),
        axis.text = element_text(color = "#023047"))

Confusion matrix shows a high number of the correct prediction in both classes, False Positive Rate is only \(15\%\) and the False Negative Rate is \(11\%\) are low which is acceptable.

# 2. ROC Curve
roc_obj <- roc(data$Y, log_p_hat)
plot(roc_obj
     , main=paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")")
     , col="#1982c4"
     , lwd=4
     , col.main="#023047"
     , col.lab="#023047"
)

abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)

# 3. Predicted probabilities distribution
plot(density(log_p_hat[data$Y==0])
     , col=col_low
     , lwd=3
     , main="Predicted Probability Distribution"
     , xlab="Predicted Probability", xlim=c(0,1)
     , col.main="#023047"
     , col.lab="#023047"
)

lines(density(log_p_hat[data$Y==1])
      , col=col_high
      , lwd=3)

abline(v=0.5, lty=2, col="#023047", lwd=2.5)

legend("topright"
       , legend=c("Class 0", "Class 1", "Threshold")
      , col=c(col_low, col_high, "#023047")
      , lty=c(1,1,2)
      , lwd=3
      , bty="n"
)

# 4. Metrics bar plot
metrics_vals <- c(accuracy, precision, recall, f1_score)
metrics_names <- c("Accuracy", "Precision", "Recall", "F1-Score")

barplot(metrics_vals
        , names.arg=metrics_names
        , col=c("#1982c4", "#8ac926", "#ffca3a", "#ff595e")
        , border=c("#0d5c8e", "#6b9e1f", "#c79a2c", "#cc444c")
        , main="Performance Metrics"
        , ylim=c(0,1)
        , las=2
        , ylab="Score"
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

abline(h=0.5, lty=2, col="#6c757d", lwd=2)

Interpreation:

ROC Curve: We get the \(AUC=0.948\) as the curve is very close to the top, this indicate an excellenet model discrimination hence the logistic model can reliably separate the high vs low energy generation days.
Predicted Probability Distribution: Class 0 has predicted probabilities strongly concentrated near \(0.0-0.3\) whereas Class 1 has peaks sharply near \(0.8-1.0\). The two probabilites are well separated, meaning model has very high confidence prediction.
Performance Metrics: Accuracy, Precision, Recall, F1-Score: are well around \(0.86-0.89\) indicating stable and strong performance across all aspects.

E. K-Fold Cross-Validation

We evaluate model stability using K-fold cross-validation with K=5 and K=10.

set.seed(42)

# K-fold CV function
kfold_logistic_cv <- function(data, K) {
  folds <- createFolds(data$Y, k=K, list=TRUE, returnTrain = FALSE)

  results <- data.frame(
    Fold = integer()
    , Accuracy = numeric()
    , Precision = numeric()
    , Recall = numeric()
    , F1 = numeric()
  )

  for (i in 1:K) {
    test_idx <- folds[[i]]
    train_idx <- setdiff(seq_len(nrow(data)), test_idx)

    train_data <- data[train_idx, ]
    test_data <- data[test_idx, ]

    model <- glm(
      Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
      , data = train_data
      , family = binomial
    )

    p_hat <- predict(model, newdata=test_data, type="response")
    y_hat <- ifelse(p_hat > 0.5, 1, 0)

    cm <- table(Pred = y_hat, True = test_data$Y)

    TP <- cm[2,2]
    TN <- cm[1,1]
    FP <- cm[2,1]
    FN <- cm[1,2]

    acc <- (TP + TN) / sum(cm)
    prec <- TP / (TP + FP)
    rec <- TP / (TP + FN)
    f1 <- 2 * (prec * rec) / (prec + rec)

    results <- rbind(results, data.frame(
      Fold = i,
      Accuracy = acc,
      Precision = prec,
      Recall = rec,
      F1 = f1
    ))
  }

  return(results)
}

# Run for K = 5 and K = 10
results_k5 <- kfold_logistic_cv(data, 5)
results_k10 <- kfold_logistic_cv(data, 10)

# Summary 
cat("=== K=5 Cross-Validation Results ===\n")

## === K=5 Cross-Validation Results ===

cat(sprintf("Mean Accuracy: %.4f ± %.4f\n", mean(results_k5$Accuracy), sd(results_k5$Accuracy)))

## Mean Accuracy: 0.8679 ± 0.0135

cat(sprintf("Mean F1-Score: %.4f ± %.4f\n", mean(results_k5$F1), sd(results_k5$F1)))

## Mean F1-Score: 0.8699 ± 0.0143

cat("\n=== K=10 Cross-Validation Results ===\n")

## 
## === K=10 Cross-Validation Results ===

cat(sprintf("Mean Accuracy: %.4f ± %.4f\n", mean(results_k10$Accuracy), sd(results_k10$Accuracy)))

## Mean Accuracy: 0.8679 ± 0.0297

cat(sprintf("Mean F1-Score: %.4f ± %.4f\n", mean(results_k10$F1), sd(results_k10$F1)))

## Mean F1-Score: 0.8702 ± 0.0300

# Visualization
par(mfrow=c(1,2))

# Boxplot
cv_results <- data.frame(
  Accuracy = c(results_k5$Accuracy, results_k10$Accuracy)
  , K = factor(c(rep("K=5", 5), rep("K=10", 10)))
)

boxplot(Accuracy ~ K
        , data = cv_results
        , col=c("#ffca3a", "#1982c4")
        , border=c("#c79a2c", "#0d5c8e")
        , main = "K-fold Cross-Validation Accuracy"
        , ylab = "Accuracy"
        , ylim=c(0.8, 0.95)
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)
points(1:2
       , c(mean(results_k5$Accuracy), mean(results_k10$Accuracy))
       , pch=18
       , cex=2.5
       , col="#e63946")

# Multiple metrics comparison for K=10
metric_colors <- c("#1982c4", "#8ac926", "#ffca3a", "#ff595e")
boxplot(results_k10[,c("Accuracy", "Precision", "Recall", "F1")]
        , col=metric_colors
        , border=c("#0d5c8e", "#6b9e1f", "#c79a2c", "#cc444c")
        , main="K=10 CV - All Metrics"
        , ylab="Score"
        , las=2
        , ylim=c(0.8, 0.95)
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

par(mfrow=c(1,1))

Both \(K=5,10\) cross-validation give a similar accuracy around \(0.868\) and F1-Score \(0.87\) confirming that model has a consistence performance. Eventhough the \(K=5\) shows lower variance, meaning it provides more stable estimates while \(K=10\) has higher variability , but does not improve the average performance

Model Selection Methods

We compare automatic variable selection methods: Forward, Backward, and Stepwise selection using AIC criterion.

# Full model
resall <- glm(
  Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
  , data = data
  , family = binomial
)

# Null model (intercept only)
res_0 <- glm(
  Y ~ 1
  , data = data
  , family = binomial
)

# Forward selection
res_forward <- step(
  res_0
  , scope = list(upper = resall)
  , direction = "forward"
  , trace = 0
)

# Backward selection
res_backward <- step(
  resall
  , direction = "backward"
  , trace = 0
)

# Stepwise selection
res_step <- step(
  resall
  , direction = "both"
  , trace = 0
)

# Compare selected models
cat("=== Forward Selection ===\n")

## === Forward Selection ===

cat("Formula:", deparse(formula(res_forward)), "\n")

## Formula: Y ~ T2M + DOW + Covid + Holidays + RH + SSRD + TOY

cat("AIC:", AIC(res_forward), "\n\n")

## AIC: 884.0148

cat("=== Backward Selection ===\n")

## === Backward Selection ===

cat("Formula:", deparse(formula(res_backward)), "\n")

## Formula: Y ~ RH + SSRD + T2M + Covid + Holidays + DOW + TOY

cat("AIC:", AIC(res_backward), "\n\n")

## AIC: 884.0148

cat("=== Stepwise Selection ===\n")

## === Stepwise Selection ===

cat("Formula:", deparse(formula(res_step)), "\n")

## Formula: Y ~ RH + SSRD + T2M + Covid + Holidays + DOW + TOY

cat("AIC:", AIC(res_step), "\n\n")

## AIC: 884.0148

# Compare AICs
models_aic <- data.frame(
  Model = c("Null", "Full", "Forward", "Backward", "Stepwise")
  , AIC = c(AIC(res_0), AIC(resall), AIC(res_forward), AIC(res_backward), AIC(res_step))
  , N_Predictors = c(0, 8, length(coef(res_forward))-1,
                   length(coef(res_backward))-1, length(coef(res_step))-1)
)

cat("=== Model Comparison ===\n")

## === Model Comparison ===

print(models_aic)

##      Model       AIC N_Predictors
## 1     Null 2027.3754            0
## 2     Full  884.9660            8
## 3  Forward  884.0148            7
## 4 Backward  884.0148            7
## 5 Stepwise  884.0148            7

# Visualize AIC comparison
par(mfrow=c(1,2))

model_colors <- c("#6c757d", "#e63946", "#1982c4", "#8ac926", "#ffca3a")

barplot(models_aic$AIC
        , names.arg=models_aic$Model
        , col=model_colors
        , border=c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")
        , main="AIC Comparison"
        , ylab="AIC (lower is better)"
        , las=2
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

barplot(models_aic$N_Predictors
        , names.arg=models_aic$Model
        , col=model_colors
        , border=c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")
        , main="Number of Predictores"
        , ylab="Count"
        , las=2
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

Interpreation: Forward, Backward and Stepwise selections all coverage the same 7-predictor model, archieving the lowest AIC \(884\) slighly better than the full model AIC \(885\). This shows that removing the STRD improves models parsimony without harming fit. The null model performs poorly AIC \(2027\) stating the necessity of using predictors.

Regularized Logistic Regression

We now apply L1 (Lasso) and L2 (Ridge) regularization to handle multicollinearity and prevent overfitting. Data is split 80/20 for training and testing.

Data Splitting

set.seed(42)

# Split 80/20 train/test
n <- nrow(data)
shuffled_idx <- sample(1:n)
train_size <- floor(0.8 * n)

train_idx <- shuffled_idx[1:train_size]
test_idx  <- shuffled_idx[(train_size + 1):n]

train_data <- data[train_idx, ]
test_data  <- data[test_idx, ]

# Prepare matrices
predictors <- c("RH","SSRD","STRD","T2M","T2Mmax","T2Mmin",
                "Covid","Holidays","DOW","TOY")

x_train <- as.matrix(train_data[, predictors])
y_train <- as.numeric(train_data$Y)

x_test  <- as.matrix(test_data[, predictors])
y_test  <- as.numeric(test_data$Y)

# Summary
cat("=== Data Split Summary ===\n")

## === Data Split Summary ===

cat("Training size:", nrow(train_data), "\n")

## Training size: 1168

cat("Testing size :", nrow(test_data), "\n\n")

## Testing size : 293

cat("Training class distribution:\n")

## Training class distribution:

print(table(y_train))

## y_train
##   0   1 
## 576 592

cat("\nTesting class distribution:\n")

## 
## Testing class distribution:

print(table(y_test))

## y_test
##   0   1 
## 155 138

A. Ridge Regression (L2 Regularization)

Ridge regression applies L2 penalty, shrinking coefficients without eliminating variables.

# Ridge with 10-fold CV
ridge_cv <- cv.glmnet(
  x_train
  , y_train
  , family="binomial"
  , alpha=0
  , nfolds=10
)

# Optimal lambdas
ridge_min <- ridge_cv$lambda.min
ridge_1se <- ridge_cv$lambda.1se

cat("=== Ridge Lambda Selection ===\n")

## === Ridge Lambda Selection ===

cat(sprintf("Lambda Min: %.6f\n", ridge_min))

## Lambda Min: 0.036774

cat(sprintf("Lambda 1SE: %.6f\n", ridge_1se))

## Lambda 1SE: 0.064264

# Visualizations
par(mfrow=c(2,2))

# 1. Regularization path
plot(glmnet(x_train, y_train, family="binomial", alpha=0)
     , xvar="lambda"
     , label=TRUE
     , lwd=2
)

title("Ridge Regularization Path (L2)")

abline(v=log(ridge_min), col="red", lty=2, lwd=2)

# 2. Cross-validation curve
plot(ridge_cv, main="Ridge 10-Fold CV")
abline(v=log(ridge_min), col="red", lty=2)
abline(v=log(ridge_1se), col="blue", lty=2)

# 3. Coefficient comparison lamdabs
ridge_coef_min <- coef(ridge_cv, s="lambda.min")[-1]
ridge_coef_1se <- coef(ridge_cv, s="lambda.1se")[-1]

barplot(as.numeric(ridge_coef_min)
        , names.arg=rownames(ridge_coef_min)
        , las=2
        , col="#4ECDC4"
        , main="Ridge Coefficients (lambda.min)"
        , ylab="Coefficient Value"
)

abline(h=0, lty=2)

barplot(as.numeric(ridge_coef_1se)
        , names.arg=rownames(ridge_coef_1se)
        , las=2
        , col="#FF6B6B"
        , main="Ridge Coefficients (lambda.1se)"
        , ylab="Coefficient Value")

abline(h=0, lty=2)

# Prediction
ridge_pred <- predict(ridge_cv, newx = x_test, s=ridge_min, type="response")
ridge_yhat <- ifelse(ridge_pred >= 0.5, 1, 0)

# Confusion Matrix
ridge_cm <- table(Predicted = ridge_yhat, Actual = y_test)

cat("\n=== Ridge Performance (Test Set) ===\n")

## 
## === Ridge Performance (Test Set) ===

print(ridge_cm)

##          Actual
## Predicted   0   1
##         0 131  17
##         1  24 121

# Metrics
ridge_TN <- ridge_cm["0", "0"]
ridge_FN <- ridge_cm["0", "1"]
ridge_FP <- ridge_cm["1", "0"]
ridge_TP <- ridge_cm["1", "1"]

ridge_accuracy <- (ridge_TP + ridge_TN) / sum(ridge_cm)
ridge_precision <- ridge_TP / (ridge_TP + ridge_FP)
ridge_recall <- ridge_TP / (ridge_TP + ridge_FN)
ridge_f1 <- 2 * (ridge_precision * ridge_recall) / (ridge_precision + ridge_recall)
ridge_FP_rate <- ridge_FP / (ridge_FP + ridge_TN)
ridge_FN_rate <- ridge_FN / (ridge_FN + ridge_TP)

cat(sprintf("\nAccuracy:  %.4f (%.2f%%)\n", ridge_accuracy, ridge_accuracy*100))

## 
## Accuracy:  0.8601 (86.01%)

cat(sprintf("Precision: %.4f\n", ridge_precision))

## Precision: 0.8345

cat(sprintf("Recall:    %.4f\n", ridge_recall))

## Recall:    0.8768

cat(sprintf("F1-Score:  %.4f\n", ridge_f1))

## F1-Score:  0.8551

cat(sprintf("FP Rate:   %.4f\n", ridge_FP_rate))

## FP Rate:   0.1548

cat(sprintf("FN Rate:   %.4f\n", ridge_FN_rate))

## FN Rate:   0.1232

Interpretation: Ridge regression improve model stability under multicollinearity and maintains strong prediction accuracy \(86\%\). The model exhibits balanced precision and recall, with low error rate and the smoother coefficients estimates. Ridge provide a more reliable way than logistic regression by making low errro rate.

B. Lasso Regression (L1 Regularization)

Lasso applies L1 penalty, which can shrink coefficients to exactly zero, performing automatic variable selection.

# Lasso with 10-fold CV
lasso_cv <- cv.glmnet(
  x_train
  , y_train
  , family="binomial"
  , alpha=1
  , nfolds=10
)

# Optimal lambdas
lasso_min <- lasso_cv$lambda.min
lasso_1se <- lasso_cv$lambda.1se

cat("=== Lasso Lambda Selection ===\n")

## === Lasso Lambda Selection ===

cat(sprintf("Lambda Min: %.6f\n", lasso_min))

## Lambda Min: 0.000453

cat(sprintf("Lambda 1SE: %.6f\n", lasso_1se))

## Lambda 1SE: 0.020560

# Visualizations
par(mfrow=c(2,2))

# 1. Regularization path
plot(glmnet(x_train
            , y_train
            , family="binomial"
            , alpha=1)
            , xvar="lambda"
            , label=TRUE
            , lwd=2
)

title("Lasso Regularization Path (L1)")
abline(v=log(lasso_min), col="red", lty=2, lwd=2)

# 2. Cross-validation curve
plot(lasso_cv, main="Lasso 10-Fold CV")
abline(v=log(lasso_min), col="red", lty=2)
abline(v=log(lasso_1se), col="blue", lty=2)

# 3. Coefficient comparison
lasso_coef_min <- coef(lasso_cv, s="lambda.min")[-1]
lasso_coef_1se <- coef(lasso_cv, s="lambda.1se")[-1]

barplot(as.numeric(lasso_coef_min)
        , names.arg=rownames(lasso_coef_min)
        , las=2
        , col="#4ECDC4"
        , main="Lasso Coefficients (lambda.min)"
        , ylab="Coefficient Value"
)

abline(h=0, lty=2)

barplot(as.numeric(lasso_coef_1se)
        , names.arg=rownames(lasso_coef_1se)
        , las=2
        , col="#FF6B6B"
        , main="Lasso Coefficients (lambda.1se)"
        , ylab="Coefficient Value"
)

abline(h=0, lty=2)

# Show selected variables
cat("\n=== Lasso Variable Selection ===\n")

## 
## === Lasso Variable Selection ===

cat("Non-zero coefficients (lambda.min):\n")

## Non-zero coefficients (lambda.min):

print(lasso_coef_min[lasso_coef_min != 0])

## [1]  1.822798e-03  8.634209e-06 -1.982212e-07 -2.683754e-01  6.908147e-01
## [6] -2.060080e-02 -2.915653e+00 -4.609036e-01  8.143636e-03

cat("\nNon-zero coefficients (lambda.1se):\n")

## 
## Non-zero coefficients (lambda.1se):

print(lasso_coef_1se[lasso_coef_1se != 0])

## [1]  1.610895e-06  5.004947e-01 -7.424221e-03 -9.158956e-01 -2.485956e-01

# Predictions and evaluation
lasso_pred <- predict(lasso_cv, newx = x_test, s=lasso_min, type="response")
lasso_yhat <- ifelse(lasso_pred >= 0.5, 1, 0)

# Confusion Matrix
lasso_cm <- table(Predicted = lasso_yhat, Actual = y_test)

cat("\n=== Lasso Performance (Test Set) ===\n")

## 
## === Lasso Performance (Test Set) ===

print(lasso_cm)

##          Actual
## Predicted   0   1
##         0 131  15
##         1  24 123

# Metrics
lasso_TN <- lasso_cm["0", "0"]
lasso_FN <- lasso_cm["0", "1"]
lasso_FP <- lasso_cm["1", "0"]
lasso_TP <- lasso_cm["1", "1"]

lasso_accuracy <- (lasso_TP + lasso_TN) / sum(lasso_cm)
lasso_precision <- lasso_TP / (lasso_TP + lasso_FP)
lasso_recall <- lasso_TP / (lasso_TP + lasso_FN)
lasso_f1 <- 2 * (lasso_precision * lasso_recall) / (lasso_precision + lasso_recall)
lasso_FP_rate <- lasso_FP / (lasso_FP + lasso_TN)
lasso_FN_rate <- lasso_FN / (lasso_FN + lasso_TP)

cat(sprintf("\nAccuracy:  %.4f (%.2f%%)\n", lasso_accuracy, lasso_accuracy*100))

## 
## Accuracy:  0.8669 (86.69%)

cat(sprintf("Precision: %.4f\n", lasso_precision))

## Precision: 0.8367

cat(sprintf("Recall:    %.4f\n", lasso_recall))

## Recall:    0.8913

cat(sprintf("F1-Score:  %.4f\n", lasso_f1))

## F1-Score:  0.8632

cat(sprintf("FP Rate:   %.4f\n", lasso_FP_rate))

## FP Rate:   0.1548

cat(sprintf("FN Rate:   %.4f\n", lasso_FN_rate))

## FN Rate:   0.1087

Interpretation:: Lasso performs both regularization and variable selection. It identifies Holidays, DOW, COVID, TOY, SSRD as the most important predictors, while it eleminates nnoisy or collinear variables. On the test set, Lasso achieves the best performance among all models, with high accuracy strong recall and highest F1-score.

Final Model Comparison

We now compare all models on the test set to determine the best approach.

# Evaluate stepwise model on test set
step_pred_test <- predict(res_step, newdata=test_data, type="response")
step_yhat_test <- ifelse(step_pred_test > 0.5, 1, 0)
step_cm_test <- table(Predicted = step_yhat_test, Actual = y_test)

step_TP <- step_cm_test[2,2]
step_TN <- step_cm_test[1,1]
step_FP <- step_cm_test[2,1]
step_FN <- step_cm_test[1,2]

step_accuracy <- (step_TP + step_TN) / sum(step_cm_test)
step_precision <- step_TP / (step_TP + step_FP)
step_recall <- step_TP / (step_TP + step_FN)
step_f1 <- 2 * (step_precision * step_recall) / (step_precision + step_recall)

# Evaluate full logistic on test set
full_pred_test <- predict(log_model, newdata=test_data, type="response")
full_yhat_test <- ifelse(full_pred_test > 0.5, 1, 0)
full_cm_test <- table(Predicted = full_yhat_test, Actual = y_test)

full_TP <- full_cm_test[2,2]
full_TN <- full_cm_test[1,1]
full_FP <- full_cm_test[2,1]
full_FN <- full_cm_test[1,2]

full_accuracy <- (full_TP + full_TN) / sum(full_cm_test)
full_precision <- full_TP / (full_TP + full_FP)
full_recall <- full_TP / (full_TP + full_FN)
full_f1 <- 2 * (full_precision * full_recall) / (full_precision + full_recall)

# Null model baseline
null_accuracy <- max(table(y_test)) / length(y_test)

# Create comparison table
comparison_table <- data.frame(
  Model = c("Null (Baseline)", "Logistic (Full)", "Stepwise", "Ridge", "Lasso")
  , Accuracy = c(null_accuracy, full_accuracy, step_accuracy, ridge_accuracy, lasso_accuracy)
  , Precision = c(NA, full_precision, step_precision, ridge_precision, lasso_precision)
  , Recall = c(NA, full_recall, step_recall, ridge_recall, lasso_recall)
  , F1_Score = c(NA, full_f1, step_f1, ridge_f1, lasso_f1)
  , N_Predictors = c(0, 8, length(coef(res_step))-1, 10, sum(lasso_coef_min != 0))
)

cat("=== Final Model Comparison (Test Set) ===\n")

## === Final Model Comparison (Test Set) ===

comparison_table_print <- comparison_table
comparison_table_print[, -1] <- round(comparison_table[, -1], 4)
print(comparison_table_print)

##             Model Accuracy Precision Recall F1_Score N_Predictors
## 1 Null (Baseline)   0.5290        NA     NA       NA            0
## 2 Logistic (Full)   0.8567    0.8333 0.8696   0.8511            8
## 3        Stepwise   0.8567    0.8333 0.8696   0.8511            7
## 4           Ridge   0.8601    0.8345 0.8768   0.8551           10
## 5           Lasso   0.8669    0.8367 0.8913   0.8632            9

# Visualizations
par(mfrow=c(2,2))

# Professional color palette for final comparison
final_colors <- c("#6c757d", "#e63946", "#1982c4", "#8ac926", "#ffca3a")
final_borders <- c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")

# 1. Accuracy comparison
barplot(comparison_table$Accuracy
        , names.arg=comparison_table$Model
        , col=final_colors
        , border=final_borders
        , main="Test Set Accuracy Comparison"
        , ylab="Accuracy"
        , ylim=c(0, 1)
        , las=2
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5
)

abline(h=null_accuracy, lty=2, col="#fb8500", lwd=2.5)

# 2. F1-Score comparison
barplot(comparison_table$F1_Score[-1]
        , names.arg=comparison_table$Model[-1]
        , col=final_colors[-1]
        , border=final_borders[-1]
        , main="F1-Score Comparison"
        , ylab="F1-Score"
        , ylim=c(0, 1)
        , las=2
        , col.main="#023047"
        , col.lab="#023047"
        , lwd=1.5)

# 3. Precision vs Recall
plot(comparison_table$Recall[-1]
     , comparison_table$Precision[-1]
     , pch=21
     , cex=3
     , col=final_borders[-1]
     , bg=final_colors[-1]
     , lwd=2
     , xlim=c(0.8, 0.95)
     , ylim=c(0.8, 0.95),
     , xlab="Recall"
     , ylab="Precision"
     , main="Precision-Recall Trade-off"
     , col.main="#023047"
     , col.lab="#023047"
)

text(comparison_table$Recall[-1], comparison_table$Precision[-1],
     labels=comparison_table$Model[-1], pos=3, cex=0.75, col="#023047", font=2)

abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)
grid(col="#e9ecef", lty=1)

# 4. Model complexity vs performance
plot(comparison_table$N_Predictors
     , comparison_table$Accuracy
     , pch=21
     , cex=3
     , col=final_borders
     , bg=final_colors
     , lwd=2
     , xlab="Number of Predictors"
     , ylab="Accuracy"
     , main="Model Complexity vs Accuracy"
     , xlim=c(-1, 11)
     , ylim=c(0.4, 1)
     , col.main="#023047"
     , col.lab="#023047"
)

text(comparison_table$N_Predictors, comparison_table$Accuracy,
     labels=comparison_table$Model, pos=3, cex=0.75, col="#023047", font=2)

grid(col="#e9ecef", lty=1)

# ROC curves comparison

roc_full <- roc(y_test, full_pred_test, quiet=TRUE)
roc_step <- roc(y_test, step_pred_test, quiet=TRUE)
roc_ridge <- roc(y_test, as.numeric(ridge_pred), quiet=TRUE)
roc_lasso <- roc(y_test, as.numeric(lasso_pred), quiet=TRUE)

# Professional ROC curve colors
roc_colors <- c("#e63946", "#1982c4", "#8ac926", "#ffca3a")
plot(roc_full, col=roc_colors[1], lwd=3.5,
     main="ROC Curves Comparison",
     col.main="#023047", col.lab="#023047")

lines(roc_step, col=roc_colors[2], lwd=3.5)
lines(roc_ridge, col=roc_colors[3], lwd=3.5)
lines(roc_lasso, col=roc_colors[4], lwd=3.5)

abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)

legend("bottomright"
       , legend=c(
         sprintf("Logistic (AUC=%.3f)", auc(roc_full))
         , sprintf("Stepwise (AUC=%.3f)", auc(roc_step))
         , sprintf("Ridge (AUC=%.3f)", auc(roc_ridge))
         , sprintf("Lasso (AUC=%.3f)", auc(roc_lasso))
       )
       , col=roc_colors
       , lwd=3.5
       , bty="n"
       , cex=0.95)

Conclusions

In this study, we successfully achieved the objective of predicting whether daily electricity generation in Mexico is above or below the median using meteorological and temporal predictors. Our exploratory analysis revealed strong seasonal patterns, with electricity levels largely driven by solar radiation and temperature-related variables.

Logistic regression showed that SSRD, Holidays, DOW, Covid, and TOY are the most significant factors influencing electricity generation. Ridge regression improved model stability in the presence of multicollinearity, while Lasso offered the best trade-off between sparsity and predictive performance.

Among all evaluated models, Lasso performed the best on the test set, achieving an accuracy of approximately \(86.7\%\), recall of ~ \(89\%\) , and an F1-score of \(0.86\), along with the highest AUC value (~ \(0.936\)). These results indicate excellent classification ability. Its high sensitivity and moderate specificity demonstrate that the model reliably identifies high-demand days while maintaining an acceptable false-positive rate.

In conclusion, sparse logistic regression—particularly Lasso—effectively identifies the key factors influencing electricity generation and provides the most reliable, stable, and interpretable predictive model for this dataset.