Practical session - Modèles de Régression linéaire
Introduction
This study focuses on the Mexico electricity dataset with meteorological and temporal variables. The goal is to predict whether daily electricity generation is above or below the median using logistic regression and regularized methods (Ridge, Lasso). We’ll explore multicollinearity, model selection, and performance comparison.
Data Description
The dataset contains daily electricity generation data with the following variables:
- RH: Relative humidity \((\%)\)
- SSRD: Surface solar radiation downward \((J.m^{-2})\)
- STRD: Surface thermal radiation downward \((J.m^{-2})\)
- T2M: Average daily temperature at \(2m\) \((^\circ C)\)
- T2Mmax: Maximum daily temperature at \(2m\) \((^\circ C)\)
- T2Mmin: Minimum daily temperature at \(2m\) \((^\circ C)\)
- Covid: COVID-19 stringency index
- Holidays: Holidays or public holidays, \(1\) = holidays, \(0\) otherwise
- DOW: Day of week, \(0\) for Monday, \(1\) for Tuesday, \(\cdots\)
- TOY: Day of the year (from \(1\) to \(366\))
- Total: Electricity generated daily in the territory \((GWh)\) - target variable
Data Loading and Exploration
# Load data
data <- read.csv("Mexico_data.csv")
# Basic structure
cat("Dataset dimensions:", dim(data), "\n")
## Dataset dimensions: 1461 12
cat("Number of observations:", nrow(data), "\n")
## Number of observations: 1461
cat("Number of variables:", ncol(data), "\n\n")
## Number of variables: 12
# Summary
summary(data)
## X0 RH SSRD STRD
## Length:1461 Min. :28.37 Min. : 407045 Min. :1019376
## Class :character 1st Qu.:50.12 1st Qu.: 719323 1st Qu.:1156350
## Mode :character Median :57.66 Median : 870071 Median :1235306
## Mean :57.55 Mean : 869055 Mean :1244835
## 3rd Qu.:64.55 3rd Qu.:1022083 3rd Qu.:1353442
## Max. :80.54 Max. :1217895 Max. :1431876
## T2M T2Mmax T2Mmin Covid
## Min. :11.23 Min. :17.82 Min. : 5.536 Min. : 0.00
## 1st Qu.:18.09 1st Qu.:25.27 1st Qu.:12.064 1st Qu.: 0.00
## Median :22.32 Median :28.72 Median :16.220 Median :33.33
## Mean :21.27 Mean :27.93 Mean :15.623 Mean :33.00
## 3rd Qu.:24.63 3rd Qu.:30.83 3rd Qu.:19.747 3rd Qu.:63.89
## Max. :27.39 Max. :33.83 Max. :21.841 Max. :82.41
## Holidays DOW TOY Total
## Min. :0.00000 Min. :0 Min. : 1.0 Min. : 578.5
## 1st Qu.:0.00000 1st Qu.:1 1st Qu.: 92.0 1st Qu.: 816.0
## Median :0.00000 Median :3 Median :183.0 Median : 871.4
## Mean :0.03012 Mean :3 Mean :183.1 Mean : 880.7
## 3rd Qu.:0.00000 3rd Qu.:5 3rd Qu.:274.0 3rd Qu.: 958.1
## Max. :1.00000 Max. :6 Max. :366.0 Max. :1107.4
# Check missing values
cat("\nMissing values per column:\n")
##
## Missing values per column:
print(colSums(is.na(data)))
## X0 RH SSRD STRD T2M T2Mmax T2Mmin Covid
## 0 0 0 0 0 0 0 0
## Holidays DOW TOY Total
## 0 0 0 0
# Data types
cat("\nNumerical columns:", names(data)[sapply(data, is.numeric)], "\n")
##
## Numerical columns: RH SSRD STRD T2M T2Mmax T2Mmin Covid Holidays DOW TOY Total
cat("Categorical columns:", names(data)[sapply(data, is.character)], "\n")
## Categorical columns: X0
Creating Binary Target Variable
We create a binary variable Y indicating whether electricity generation is above (1) or below (0) the median.
# Create binary target
median_energy <- median(data$Total)
data$Y <- ifelse(data$Total > median_energy, 1, 0)
cat("Median electricity generation:", round(median_energy, 2), "GWh\n")
## Median electricity generation: 871.43 GWh
cat("Class distribution:\n")
## Class distribution:
print(table(data$Y))
##
## 0 1
## 731 730
cat("Proportion: ", prop.table(table(data$Y)), "\n")
## Proportion: 0.5003422 0.4996578
Exploratory Data Analysis
Distribution Analysis
par(mfrow=c(2,2), bg="white")
# Histogram with median line
hist(data$Total
, breaks=30
, col="#8ecae6"
, main="Distribution of Total Electricity",
, xlab="Total Electricity (GWh)",
, border="white"
, col.main="#023047"
, col.lab="#023047"
)
abline(v=median_energy, col="#fb8500", lwd=3, lty=2)
legend("topright", legend="Median", col="#fb8500", lty=2, lwd=3, bty="n")
# Boxplot by class
boxplot(Total ~ Y
, data=data
, col=c(col_low, col_high)
, names=c("Below Median", "Above Median")
, main="Electricity by Class"
, ylab="Total Electricity (GWh)"
, border=c("#b91c1c", "#059669")
, col.main="#023047", col.lab="#023047", lwd=1.5)
# Density plot
plot(density(data$Total)
, main="Density Plot of Electricity"
, xlab="Total Electricity (GWh)", lwd=3, col="#1982c4"
, col.main="#023047"
, col.lab="#023047"
)
polygon(density(data$Total), col="#8ecae680", border="#1982c4", lwd=2)
abline(v=median_energy, col="#fb8500", lwd=3, lty=2)
# QQ plot to check normality
qqnorm(data$Total
, main="Q-Q Plot"
, pch=19
, col="#8ecae6"
, col.main="#023047"
, col.lab="#023047"
)
qqline(data$Total, col="#fb8500", lwd=3)
The distribution shows reasonable separation between low and high electricity classes. The data is fairly balanced around the median. Electricity generation shows a slightly skewed, mildly bimodal distribution with distinct differences between blow-median and above median days. The Q-Q plot indicates non-normality driven by tail behavior, which align with knowns seasonal fluctuations in electricity demand.
Temporal Patterns
It is great to check the temporal pattern to see how the data behavior,
par(mfrow=c(1,2))
# Electricity by DOW
dow_colors <- colorRampPalette(c("#1982c4", "#8ac926", "#ffca3a"))(7)
boxplot(Total ~ DOW
, data=data
, col=dow_colors
, border="#023047"
, main="Electricity by Day of Week"
, xlab="Day (0=Mon, 6=Sun)"
, ylab="Total (GWh)"
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
# Electricity over a year
plot(data$TOY
, data$Total
, col=ifelse(data$Y==1, col_high, col_low)
, pch=16
, cex=0.8
, main="Electricity Generation Throughout Year"
, xlab="Day of Year"
, ylab="Total (GWh)"
, col.main="#023047"
, col.lab="#023047"
)
abline(h=median_energy, col="#023047", lwd=2.5, lty=2)
legend("topright", legend=c("Above Median", "Below Median"),
col=c(col_high, col_low), pch=16, bty="n", cex=0.9)
This shows a clearly seasonal, in the left plot shows that the Electricity demand is significant higher during the work week and drop during the weekend, espcially the Sunday. We can interprete the consistent use of the industrial, commercial activity pattern high during workday and less in the weekday. For the right plot, it shows the clear picture of the Electricity generate starting from day 0-100 and 250-300 low value and peak on the roughly 200-250 which demonstrate the demanding on the summary spike.
Correlation Analysis
num_data <- data[, sapply(data, is.numeric)]
corr_matrix <- cor(num_data, use="complete.obs")
# Correlation
ggcorrplot(
corr_matrix
, hc.order = TRUE
, lab = TRUE
, lab_size = 2.5
, method = "circle"
, type = "lower"
, colors = c("#4361ee", "#f8f9fa", "#e63946")
, title = "Correlation Matrix - Mexico Electricity Dataset"
, ggtheme = theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", color = "#023047"),
axis.text = element_text(color = "#023047"))
)
So from this calculation we can state that:
| Variable | \(Corr(Total, X_{ij})\) | Interpretation |
|---|---|---|
| STRD | 0.72 | It details that on clear day, clear sky the condition is optimal produces more electricity, it impact on the total value. |
| SSRD | 0.53 | This means that the more sunlight directly it could improve the photovoltaic on solar generation. |
| T2M | 0.76 | Strong positive value, the hotter the day it corresponds on generating more energy. |
| T2Mmax | 0.71 | Implies that on the hotter peak day temperature, more electricity is being generated. |
| T2Mmin | 0.77 | Implies that at the night the temperature dropdown, is likely to have small generated electricity. |
| RH | 0.27 | With low related value, this could tell us that Humidity isn’t really affect on the total generated electricity. |
| COVID-19 | 0.01 | COVID-19 doesn’t have any impact on the electricity indicated with lowest weak value. We can remove this in to our model |
| Holidays | -0.23 | The negative correlation indicates that the electricity generated is lower on the holidays compare to the regular days. This could implies the reduced industrial activity during the holiday. |
| DOW | -0.27 | Day of week suggests that there is a lower electricity generation than weekdays. |
Source reference : meteorological
Checking Variance Inflation Factor (VIF)
# Full model with all predictors
full_model_vif <- glm(
Y ~ RH + SSRD + STRD + T2M + T2Mmax + T2Mmin + Covid + Holidays + DOW + TOY
, data = data
, family = binomial
)
# Calculate VIF
vif_values <- vif(full_model_vif)
cat("VIF Values (Full Model):\n")
## VIF Values (Full Model):
print(vif_values)
## RH SSRD STRD T2M T2Mmax T2Mmin Covid
## 18.088931 11.192520 45.455237 301.728881 78.222212 146.634867 1.347205
## Holidays DOW TOY
## 1.021233 1.281240 3.564355
# VIF > 10 indicates problematic multicollinearity
cat("\n\nVariables with VIF > 10:\n")
##
##
## Variables with VIF > 10:
print(vif_values[vif_values > 10])
## RH SSRD STRD T2M T2Mmax T2Mmin
## 18.08893 11.19252 45.45524 301.72888 78.22221 146.63487
In the full model, serveral predictors does have high multicollinearity, especially the temperature variables T2M, T2Mmax, T2Mmin whose VIF values exeeding 100. The STRD , SSRD also have VIF > 10, indicating the strong correlation with each other with temperature.
So we assume:
- Temperature variables (T2M, T2Mmax, T2Mmin) are highly correlated (> 0.95), indicating multicollinearity
- SSRD and STRD show strong positive correlation with Total electricity
- T2M and temperature-related variables have strong relationship with electricity generation
- To avoid multicollinearity issues, we’ll remove T2Mmax and T2Mmin from the model, keeping only T2M
Logistic Regression Models
A. Full Logistic Regression Model
We fit a logistic regression model excluding the highly correlated variables T2Mmax and T2Mmin.
# Fit logistic regression
log_model <- glm(
Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
, data = data
, family = binomial
)
# Model summary
summary(log_model)
##
## Call:
## glm(formula = Y ~ RH + SSRD + STRD + T2M + Covid + Holidays +
## DOW + TOY, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.460e+01 3.398e+00 -7.239 4.53e-13 ***
## RH 5.806e-02 3.367e-02 1.724 0.084677 .
## SSRD 8.866e-06 1.714e-06 5.171 2.32e-07 ***
## STRD 6.014e-06 5.864e-06 1.026 0.305083
## T2M 3.005e-01 1.681e-01 1.787 0.073873 .
## Covid -1.898e-02 3.514e-03 -5.399 6.69e-08 ***
## Holidays -3.027e+00 7.861e-01 -3.850 0.000118 ***
## DOW -4.370e-01 4.868e-02 -8.977 < 2e-16 ***
## TOY 7.185e-03 1.753e-03 4.099 4.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2025.38 on 1460 degrees of freedom
## Residual deviance: 866.97 on 1452 degrees of freedom
## AIC: 884.97
##
## Number of Fisher Scoring iterations: 6
# Extract coefficients
log_coef <- summary(log_model)$coefficients
Significance Analysis
# Predictors (p < 0.05)
log_significant <- log_coef[log_coef[, 4] < 0.05, ]
cat("Significant predictors (p < 0.05):\n")
## Significant predictors (p < 0.05):
print(log_significant)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.459978e+01 3.398420e+00 -7.238595 4.533573e-13
## SSRD 8.865513e-06 1.714321e-06 5.171443 2.322931e-07
## Covid -1.897541e-02 3.514402e-03 -5.399330 6.689034e-08
## Holidays -3.026540e+00 7.860669e-01 -3.850232 1.180058e-04
## DOW -4.370404e-01 4.868228e-02 -8.977402 2.772346e-19
## TOY 7.185129e-03 1.752790e-03 4.099252 4.144880e-05
# p-values
pvals <- log_coef[-1, 4] # Exclude intercept
names_vars <- rownames(log_coef)[-1]
par(mar=c(8, 4, 4, 2))
barplot(sort(pvals)
, names.arg = names_vars[order(pvals)]
, las=2
, col = ifelse(sort(pvals) < 0.05, "#06d6a0", "#e63946")
, border = ifelse(sort(pvals) < 0.05, "#059669", "#b91c1c")
, main = "P-values of Predictors"
, ylab = "P-value"
, ylim = c(0, max(pvals)*1.1)
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
abline(h=0.05, col="#fb8500", lwd=3, lty=2)
legend("topleft", legend=c("Significant (p<0.05)", "Not Significant"),
fill=c("#06d6a0", "#e63946"), bty="n", border=c("#059669", "#b91c1c"))
The significant predictors \((p < 0.05)\) are SSRD, Covid, Holidays, DOW, and TOY. The variables STRD and T2M are not significant in this model, indicating that once multicollinearity is removed, they do not meaningfully contribute to predicting whether electricity generation is above or below the median.
B. Model Predictions
# Compare prediction types
cat("Prediction type='link' (log-odds scale):\n")
## Prediction type='link' (log-odds scale):
head(predict(log_model, type="link"))
## 1 2 3 4 5 6
## -9.255574 -6.000108 -6.137141 -6.586506 -6.902523 -6.304666
cat("\n\nPrediction type='response' (probability scale):\n")
##
##
## Prediction type='response' (probability scale):
head(predict(log_model, type="response"))
## 1 2 3 4 5 6
## 9.556825e-05 2.472356e-03 2.156434e-03 1.376950e-03 1.004237e-03 1.824421e-03
Note: type="link" returns log-odds
\(\in (-\infty, +\infty)\), while
type="response" returns probabilities \(\in [0, 1]\).
C. Odds Ratios Analysis
# Calculate odds ratios
log_odd_ratio <- exp(coef(log_model))
# 95% CI
CI <- exp(confint(log_model))
# Create odds-ratio table
log_odd_ratio_table <- data.frame(
Odds_Ratio = log_odd_ratio,
CI_Lower = CI[, 1],
CI_Upper = CI[, 2]
)
cat("Odds Ratios with 95% Confidence Intervals:\n")
## Odds Ratios with 95% Confidence Intervals:
print(round(log_odd_ratio_table, 4))
## Odds_Ratio CI_Lower CI_Upper
## (Intercept) 0.0000 0.0000 0.0000
## RH 1.0598 0.9924 1.1326
## SSRD 1.0000 1.0000 1.0000
## STRD 1.0000 1.0000 1.0000
## T2M 1.3505 0.9753 1.8859
## Covid 0.9812 0.9744 0.9879
## Holidays 0.0485 0.0086 0.1985
## DOW 0.6459 0.5859 0.7092
## TOY 1.0072 1.0038 1.0108
# Visualize odds ratios
or_plot <- log_odd_ratio_table[-1, ]
or_names <- rownames(or_plot)
par(mar=c(5, 8, 4, 2))
or_colors <- ifelse(or_plot$Odds_Ratio > 1, "#1982c4", "#e63946")
plot(or_plot$Odds_Ratio
, 1:nrow(or_plot)
, xlim=c(0, max(or_plot$CI_Upper)*1.1)
, yaxt="n"
, ylab=""
, xlab="Odds Ratio"
, main="Odds Ratios with 95% CI"
, pch=19
, col=or_colors
, cex=2,
, col.main="#023047"
, col.lab="#023047"
)
axis(2, at=1:nrow(or_plot), labels=or_names, las=2, col.axis="#023047")
segments(or_plot$CI_Lower
, 1:nrow(or_plot)
, or_plot$CI_Upper
, 1:nrow(or_plot)
, col=or_colors
, lwd=3
)
abline(v=1, col="#023047", lty=2, lwd=2.5)
legend("topright", legend=c("OR > 1 (Positive)", "OR < 1 (Negative)"),
col=c("#1982c4", "#e63946"), pch=19, bty="n", cex=0.9)
par(mar=c(5, 4, 4, 2))
Interpretation:
- Positive effects (OR > 1): SSRD and TOY increase the likelihood of high electricity generation.
- Negative effects (OR < 1): Covid, Holidays, and DOW significantly reduce electricity generation.
- Not significant: RH, STRD, and T2M show no meaningful effect (their confidence intervals include 1).
- Most influential: Holidays, DOW, Covid, TOY, and SSRD.
D. Model Performance - Confusion Matrix
# Predict
log_p_hat <- predict(log_model, type="response")
log_y_hat <- ifelse(log_p_hat > 0.5, 1, 0)
# Confusion matrix
log_conf_mat <- table(Predicted = log_y_hat, Actual = data$Y)
cat("Confusion Matrix:\n")
## Confusion Matrix:
print(log_conf_mat)
## Actual
## Predicted 0 1
## 0 624 81
## 1 107 649
# Extract TP, FN, FP, TN
TN <- log_conf_mat["0","0"]
FN <- log_conf_mat["0","1"]
FP <- log_conf_mat["1","0"]
TP <- log_conf_mat["1","1"]
# Calculate metrics
accuracy <- (TP + TN) / sum(log_conf_mat)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1_score <- 2 * (precision * recall) / (precision + recall)
FP_rate <- FP / (FP + TN)
FN_rate <- FN / (FN + TP)
# Display metrics
cat("\n\n=== Performance Metrics ===\n")
##
##
## === Performance Metrics ===
cat(sprintf("Accuracy: (%.2f%%)\n", accuracy, accuracy*100))
## Accuracy: (0.87%)
cat(sprintf("Precision: \n", precision))
## Precision:
cat(sprintf("Recall: \n", recall))
## Recall:
cat(sprintf("F1-Score: \n", f1_score))
## F1-Score:
cat(sprintf("FP Rate: (%.2f%%)\n", FP_rate, FP_rate*100))
## FP Rate: (0.15%)
cat(sprintf("FN Rate: (%.2f%%)\n", FN_rate, FN_rate*100))
## FN Rate: (0.11%)
# Vizualiuze
par(mfrow=c(2,2))
# 1. Confusion matrix
log_conf_mat_df <- as.data.frame(log_conf_mat)
colnames(log_conf_mat_df) <- c("Predicted", "Actual", "Freq")
ggplot(log_conf_mat_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", size=1.5) +
geom_text(aes(label = Freq), size = 10, fontface="bold", color = "#023047") +
scale_fill_gradient(low = "#e0f2fe", high = "#0284c7") +
labs(title = "Confusion Matrix - Logistic Regression",
x = "Actual Class", y = "Predicted Class") +
theme_minimal(base_size = 14) +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, face = "bold", color = "#023047"),
axis.text = element_text(color = "#023047"))
Confusion matrix shows a high number of the correct prediction in both classes, False Positive Rate is only \(15\%\) and the False Negative Rate is \(11\%\) are low which is acceptable.
# 2. ROC Curve
roc_obj <- roc(data$Y, log_p_hat)
plot(roc_obj
, main=paste("ROC Curve (AUC =", round(auc(roc_obj), 3), ")")
, col="#1982c4"
, lwd=4
, col.main="#023047"
, col.lab="#023047"
)
abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)
# 3. Predicted probabilities distribution
plot(density(log_p_hat[data$Y==0])
, col=col_low
, lwd=3
, main="Predicted Probability Distribution"
, xlab="Predicted Probability", xlim=c(0,1)
, col.main="#023047"
, col.lab="#023047"
)
lines(density(log_p_hat[data$Y==1])
, col=col_high
, lwd=3)
abline(v=0.5, lty=2, col="#023047", lwd=2.5)
legend("topright"
, legend=c("Class 0", "Class 1", "Threshold")
, col=c(col_low, col_high, "#023047")
, lty=c(1,1,2)
, lwd=3
, bty="n"
)
# 4. Metrics bar plot
metrics_vals <- c(accuracy, precision, recall, f1_score)
metrics_names <- c("Accuracy", "Precision", "Recall", "F1-Score")
barplot(metrics_vals
, names.arg=metrics_names
, col=c("#1982c4", "#8ac926", "#ffca3a", "#ff595e")
, border=c("#0d5c8e", "#6b9e1f", "#c79a2c", "#cc444c")
, main="Performance Metrics"
, ylim=c(0,1)
, las=2
, ylab="Score"
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
abline(h=0.5, lty=2, col="#6c757d", lwd=2)
Interpreation:
ROC Curve: We get the \(AUC=0.948\) as the curve is very close to the top, this indicate an excellenet model discrimination hence the logistic model can reliably separate the high vs low energy generation days.
Predicted Probability Distribution: Class 0 has predicted probabilities strongly concentrated near \(0.0-0.3\) whereas Class 1 has peaks sharply near \(0.8-1.0\). The two probabilites are well separated, meaning model has very high confidence prediction.
Performance Metrics: Accuracy, Precision, Recall, F1-Score: are well around \(0.86-0.89\) indicating stable and strong performance across all aspects.
E. K-Fold Cross-Validation
We evaluate model stability using K-fold cross-validation with K=5 and K=10.
set.seed(42)
# K-fold CV function
kfold_logistic_cv <- function(data, K) {
folds <- createFolds(data$Y, k=K, list=TRUE, returnTrain = FALSE)
results <- data.frame(
Fold = integer()
, Accuracy = numeric()
, Precision = numeric()
, Recall = numeric()
, F1 = numeric()
)
for (i in 1:K) {
test_idx <- folds[[i]]
train_idx <- setdiff(seq_len(nrow(data)), test_idx)
train_data <- data[train_idx, ]
test_data <- data[test_idx, ]
model <- glm(
Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
, data = train_data
, family = binomial
)
p_hat <- predict(model, newdata=test_data, type="response")
y_hat <- ifelse(p_hat > 0.5, 1, 0)
cm <- table(Pred = y_hat, True = test_data$Y)
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
acc <- (TP + TN) / sum(cm)
prec <- TP / (TP + FP)
rec <- TP / (TP + FN)
f1 <- 2 * (prec * rec) / (prec + rec)
results <- rbind(results, data.frame(
Fold = i,
Accuracy = acc,
Precision = prec,
Recall = rec,
F1 = f1
))
}
return(results)
}
# Run for K = 5 and K = 10
results_k5 <- kfold_logistic_cv(data, 5)
results_k10 <- kfold_logistic_cv(data, 10)
# Summary
cat("=== K=5 Cross-Validation Results ===\n")
## === K=5 Cross-Validation Results ===
cat(sprintf("Mean Accuracy: %.4f ± %.4f\n", mean(results_k5$Accuracy), sd(results_k5$Accuracy)))
## Mean Accuracy: 0.8679 ± 0.0135
cat(sprintf("Mean F1-Score: %.4f ± %.4f\n", mean(results_k5$F1), sd(results_k5$F1)))
## Mean F1-Score: 0.8699 ± 0.0143
cat("\n=== K=10 Cross-Validation Results ===\n")
##
## === K=10 Cross-Validation Results ===
cat(sprintf("Mean Accuracy: %.4f ± %.4f\n", mean(results_k10$Accuracy), sd(results_k10$Accuracy)))
## Mean Accuracy: 0.8679 ± 0.0297
cat(sprintf("Mean F1-Score: %.4f ± %.4f\n", mean(results_k10$F1), sd(results_k10$F1)))
## Mean F1-Score: 0.8702 ± 0.0300
# Visualization
par(mfrow=c(1,2))
# Boxplot
cv_results <- data.frame(
Accuracy = c(results_k5$Accuracy, results_k10$Accuracy)
, K = factor(c(rep("K=5", 5), rep("K=10", 10)))
)
boxplot(Accuracy ~ K
, data = cv_results
, col=c("#ffca3a", "#1982c4")
, border=c("#c79a2c", "#0d5c8e")
, main = "K-fold Cross-Validation Accuracy"
, ylab = "Accuracy"
, ylim=c(0.8, 0.95)
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
points(1:2
, c(mean(results_k5$Accuracy), mean(results_k10$Accuracy))
, pch=18
, cex=2.5
, col="#e63946")
# Multiple metrics comparison for K=10
metric_colors <- c("#1982c4", "#8ac926", "#ffca3a", "#ff595e")
boxplot(results_k10[,c("Accuracy", "Precision", "Recall", "F1")]
, col=metric_colors
, border=c("#0d5c8e", "#6b9e1f", "#c79a2c", "#cc444c")
, main="K=10 CV - All Metrics"
, ylab="Score"
, las=2
, ylim=c(0.8, 0.95)
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
par(mfrow=c(1,1))
Both \(K=5,10\) cross-validation give a similar accuracy around \(0.868\) and F1-Score \(0.87\) confirming that model has a consistence performance. Eventhough the \(K=5\) shows lower variance, meaning it provides more stable estimates while \(K=10\) has higher variability , but does not improve the average performance
Model Selection Methods
We compare automatic variable selection methods: Forward, Backward, and Stepwise selection using AIC criterion.
# Full model
resall <- glm(
Y ~ RH + SSRD + STRD + T2M + Covid + Holidays + DOW + TOY
, data = data
, family = binomial
)
# Null model (intercept only)
res_0 <- glm(
Y ~ 1
, data = data
, family = binomial
)
# Forward selection
res_forward <- step(
res_0
, scope = list(upper = resall)
, direction = "forward"
, trace = 0
)
# Backward selection
res_backward <- step(
resall
, direction = "backward"
, trace = 0
)
# Stepwise selection
res_step <- step(
resall
, direction = "both"
, trace = 0
)
# Compare selected models
cat("=== Forward Selection ===\n")
## === Forward Selection ===
cat("Formula:", deparse(formula(res_forward)), "\n")
## Formula: Y ~ T2M + DOW + Covid + Holidays + RH + SSRD + TOY
cat("AIC:", AIC(res_forward), "\n\n")
## AIC: 884.0148
cat("=== Backward Selection ===\n")
## === Backward Selection ===
cat("Formula:", deparse(formula(res_backward)), "\n")
## Formula: Y ~ RH + SSRD + T2M + Covid + Holidays + DOW + TOY
cat("AIC:", AIC(res_backward), "\n\n")
## AIC: 884.0148
cat("=== Stepwise Selection ===\n")
## === Stepwise Selection ===
cat("Formula:", deparse(formula(res_step)), "\n")
## Formula: Y ~ RH + SSRD + T2M + Covid + Holidays + DOW + TOY
cat("AIC:", AIC(res_step), "\n\n")
## AIC: 884.0148
# Compare AICs
models_aic <- data.frame(
Model = c("Null", "Full", "Forward", "Backward", "Stepwise")
, AIC = c(AIC(res_0), AIC(resall), AIC(res_forward), AIC(res_backward), AIC(res_step))
, N_Predictors = c(0, 8, length(coef(res_forward))-1,
length(coef(res_backward))-1, length(coef(res_step))-1)
)
cat("=== Model Comparison ===\n")
## === Model Comparison ===
print(models_aic)
## Model AIC N_Predictors
## 1 Null 2027.3754 0
## 2 Full 884.9660 8
## 3 Forward 884.0148 7
## 4 Backward 884.0148 7
## 5 Stepwise 884.0148 7
# Visualize AIC comparison
par(mfrow=c(1,2))
model_colors <- c("#6c757d", "#e63946", "#1982c4", "#8ac926", "#ffca3a")
barplot(models_aic$AIC
, names.arg=models_aic$Model
, col=model_colors
, border=c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")
, main="AIC Comparison"
, ylab="AIC (lower is better)"
, las=2
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
barplot(models_aic$N_Predictors
, names.arg=models_aic$Model
, col=model_colors
, border=c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")
, main="Number of Predictores"
, ylab="Count"
, las=2
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
Interpreation: Forward, Backward and Stepwise selections all coverage the same 7-predictor model, archieving the lowest AIC \(884\) slighly better than the full model AIC \(885\). This shows that removing the STRD improves models parsimony without harming fit. The null model performs poorly AIC \(2027\) stating the necessity of using predictors.
Regularized Logistic Regression
We now apply L1 (Lasso) and L2 (Ridge) regularization to handle multicollinearity and prevent overfitting. Data is split 80/20 for training and testing.
Data Splitting
set.seed(42)
# Split 80/20 train/test
n <- nrow(data)
shuffled_idx <- sample(1:n)
train_size <- floor(0.8 * n)
train_idx <- shuffled_idx[1:train_size]
test_idx <- shuffled_idx[(train_size + 1):n]
train_data <- data[train_idx, ]
test_data <- data[test_idx, ]
# Prepare matrices
predictors <- c("RH","SSRD","STRD","T2M","T2Mmax","T2Mmin",
"Covid","Holidays","DOW","TOY")
x_train <- as.matrix(train_data[, predictors])
y_train <- as.numeric(train_data$Y)
x_test <- as.matrix(test_data[, predictors])
y_test <- as.numeric(test_data$Y)
# Summary
cat("=== Data Split Summary ===\n")
## === Data Split Summary ===
cat("Training size:", nrow(train_data), "\n")
## Training size: 1168
cat("Testing size :", nrow(test_data), "\n\n")
## Testing size : 293
cat("Training class distribution:\n")
## Training class distribution:
print(table(y_train))
## y_train
## 0 1
## 576 592
cat("\nTesting class distribution:\n")
##
## Testing class distribution:
print(table(y_test))
## y_test
## 0 1
## 155 138
A. Ridge Regression (L2 Regularization)
Ridge regression applies L2 penalty, shrinking coefficients without eliminating variables.
# Ridge with 10-fold CV
ridge_cv <- cv.glmnet(
x_train
, y_train
, family="binomial"
, alpha=0
, nfolds=10
)
# Optimal lambdas
ridge_min <- ridge_cv$lambda.min
ridge_1se <- ridge_cv$lambda.1se
cat("=== Ridge Lambda Selection ===\n")
## === Ridge Lambda Selection ===
cat(sprintf("Lambda Min: %.6f\n", ridge_min))
## Lambda Min: 0.036774
cat(sprintf("Lambda 1SE: %.6f\n", ridge_1se))
## Lambda 1SE: 0.064264
# Visualizations
par(mfrow=c(2,2))
# 1. Regularization path
plot(glmnet(x_train, y_train, family="binomial", alpha=0)
, xvar="lambda"
, label=TRUE
, lwd=2
)
title("Ridge Regularization Path (L2)")
abline(v=log(ridge_min), col="red", lty=2, lwd=2)
# 2. Cross-validation curve
plot(ridge_cv, main="Ridge 10-Fold CV")
abline(v=log(ridge_min), col="red", lty=2)
abline(v=log(ridge_1se), col="blue", lty=2)
# 3. Coefficient comparison lamdabs
ridge_coef_min <- coef(ridge_cv, s="lambda.min")[-1]
ridge_coef_1se <- coef(ridge_cv, s="lambda.1se")[-1]
barplot(as.numeric(ridge_coef_min)
, names.arg=rownames(ridge_coef_min)
, las=2
, col="#4ECDC4"
, main="Ridge Coefficients (lambda.min)"
, ylab="Coefficient Value"
)
abline(h=0, lty=2)
barplot(as.numeric(ridge_coef_1se)
, names.arg=rownames(ridge_coef_1se)
, las=2
, col="#FF6B6B"
, main="Ridge Coefficients (lambda.1se)"
, ylab="Coefficient Value")
abline(h=0, lty=2)
# Prediction
ridge_pred <- predict(ridge_cv, newx = x_test, s=ridge_min, type="response")
ridge_yhat <- ifelse(ridge_pred >= 0.5, 1, 0)
# Confusion Matrix
ridge_cm <- table(Predicted = ridge_yhat, Actual = y_test)
cat("\n=== Ridge Performance (Test Set) ===\n")
##
## === Ridge Performance (Test Set) ===
print(ridge_cm)
## Actual
## Predicted 0 1
## 0 131 17
## 1 24 121
# Metrics
ridge_TN <- ridge_cm["0", "0"]
ridge_FN <- ridge_cm["0", "1"]
ridge_FP <- ridge_cm["1", "0"]
ridge_TP <- ridge_cm["1", "1"]
ridge_accuracy <- (ridge_TP + ridge_TN) / sum(ridge_cm)
ridge_precision <- ridge_TP / (ridge_TP + ridge_FP)
ridge_recall <- ridge_TP / (ridge_TP + ridge_FN)
ridge_f1 <- 2 * (ridge_precision * ridge_recall) / (ridge_precision + ridge_recall)
ridge_FP_rate <- ridge_FP / (ridge_FP + ridge_TN)
ridge_FN_rate <- ridge_FN / (ridge_FN + ridge_TP)
cat(sprintf("\nAccuracy: %.4f (%.2f%%)\n", ridge_accuracy, ridge_accuracy*100))
##
## Accuracy: 0.8601 (86.01%)
cat(sprintf("Precision: %.4f\n", ridge_precision))
## Precision: 0.8345
cat(sprintf("Recall: %.4f\n", ridge_recall))
## Recall: 0.8768
cat(sprintf("F1-Score: %.4f\n", ridge_f1))
## F1-Score: 0.8551
cat(sprintf("FP Rate: %.4f\n", ridge_FP_rate))
## FP Rate: 0.1548
cat(sprintf("FN Rate: %.4f\n", ridge_FN_rate))
## FN Rate: 0.1232
Interpretation: Ridge regression improve model stability under multicollinearity and maintains strong prediction accuracy \(86\%\). The model exhibits balanced precision and recall, with low error rate and the smoother coefficients estimates. Ridge provide a more reliable way than logistic regression by making low errro rate.
B. Lasso Regression (L1 Regularization)
Lasso applies L1 penalty, which can shrink coefficients to exactly zero, performing automatic variable selection.
# Lasso with 10-fold CV
lasso_cv <- cv.glmnet(
x_train
, y_train
, family="binomial"
, alpha=1
, nfolds=10
)
# Optimal lambdas
lasso_min <- lasso_cv$lambda.min
lasso_1se <- lasso_cv$lambda.1se
cat("=== Lasso Lambda Selection ===\n")
## === Lasso Lambda Selection ===
cat(sprintf("Lambda Min: %.6f\n", lasso_min))
## Lambda Min: 0.000453
cat(sprintf("Lambda 1SE: %.6f\n", lasso_1se))
## Lambda 1SE: 0.020560
# Visualizations
par(mfrow=c(2,2))
# 1. Regularization path
plot(glmnet(x_train
, y_train
, family="binomial"
, alpha=1)
, xvar="lambda"
, label=TRUE
, lwd=2
)
title("Lasso Regularization Path (L1)")
abline(v=log(lasso_min), col="red", lty=2, lwd=2)
# 2. Cross-validation curve
plot(lasso_cv, main="Lasso 10-Fold CV")
abline(v=log(lasso_min), col="red", lty=2)
abline(v=log(lasso_1se), col="blue", lty=2)
# 3. Coefficient comparison
lasso_coef_min <- coef(lasso_cv, s="lambda.min")[-1]
lasso_coef_1se <- coef(lasso_cv, s="lambda.1se")[-1]
barplot(as.numeric(lasso_coef_min)
, names.arg=rownames(lasso_coef_min)
, las=2
, col="#4ECDC4"
, main="Lasso Coefficients (lambda.min)"
, ylab="Coefficient Value"
)
abline(h=0, lty=2)
barplot(as.numeric(lasso_coef_1se)
, names.arg=rownames(lasso_coef_1se)
, las=2
, col="#FF6B6B"
, main="Lasso Coefficients (lambda.1se)"
, ylab="Coefficient Value"
)
abline(h=0, lty=2)
# Show selected variables
cat("\n=== Lasso Variable Selection ===\n")
##
## === Lasso Variable Selection ===
cat("Non-zero coefficients (lambda.min):\n")
## Non-zero coefficients (lambda.min):
print(lasso_coef_min[lasso_coef_min != 0])
## [1] 1.822798e-03 8.634209e-06 -1.982212e-07 -2.683754e-01 6.908147e-01
## [6] -2.060080e-02 -2.915653e+00 -4.609036e-01 8.143636e-03
cat("\nNon-zero coefficients (lambda.1se):\n")
##
## Non-zero coefficients (lambda.1se):
print(lasso_coef_1se[lasso_coef_1se != 0])
## [1] 1.610895e-06 5.004947e-01 -7.424221e-03 -9.158956e-01 -2.485956e-01
# Predictions and evaluation
lasso_pred <- predict(lasso_cv, newx = x_test, s=lasso_min, type="response")
lasso_yhat <- ifelse(lasso_pred >= 0.5, 1, 0)
# Confusion Matrix
lasso_cm <- table(Predicted = lasso_yhat, Actual = y_test)
cat("\n=== Lasso Performance (Test Set) ===\n")
##
## === Lasso Performance (Test Set) ===
print(lasso_cm)
## Actual
## Predicted 0 1
## 0 131 15
## 1 24 123
# Metrics
lasso_TN <- lasso_cm["0", "0"]
lasso_FN <- lasso_cm["0", "1"]
lasso_FP <- lasso_cm["1", "0"]
lasso_TP <- lasso_cm["1", "1"]
lasso_accuracy <- (lasso_TP + lasso_TN) / sum(lasso_cm)
lasso_precision <- lasso_TP / (lasso_TP + lasso_FP)
lasso_recall <- lasso_TP / (lasso_TP + lasso_FN)
lasso_f1 <- 2 * (lasso_precision * lasso_recall) / (lasso_precision + lasso_recall)
lasso_FP_rate <- lasso_FP / (lasso_FP + lasso_TN)
lasso_FN_rate <- lasso_FN / (lasso_FN + lasso_TP)
cat(sprintf("\nAccuracy: %.4f (%.2f%%)\n", lasso_accuracy, lasso_accuracy*100))
##
## Accuracy: 0.8669 (86.69%)
cat(sprintf("Precision: %.4f\n", lasso_precision))
## Precision: 0.8367
cat(sprintf("Recall: %.4f\n", lasso_recall))
## Recall: 0.8913
cat(sprintf("F1-Score: %.4f\n", lasso_f1))
## F1-Score: 0.8632
cat(sprintf("FP Rate: %.4f\n", lasso_FP_rate))
## FP Rate: 0.1548
cat(sprintf("FN Rate: %.4f\n", lasso_FN_rate))
## FN Rate: 0.1087
Interpretation:: Lasso performs both regularization and variable selection. It identifies Holidays, DOW, COVID, TOY, SSRD as the most important predictors, while it eleminates nnoisy or collinear variables. On the test set, Lasso achieves the best performance among all models, with high accuracy strong recall and highest F1-score.
Final Model Comparison
We now compare all models on the test set to determine the best approach.
# Evaluate stepwise model on test set
step_pred_test <- predict(res_step, newdata=test_data, type="response")
step_yhat_test <- ifelse(step_pred_test > 0.5, 1, 0)
step_cm_test <- table(Predicted = step_yhat_test, Actual = y_test)
step_TP <- step_cm_test[2,2]
step_TN <- step_cm_test[1,1]
step_FP <- step_cm_test[2,1]
step_FN <- step_cm_test[1,2]
step_accuracy <- (step_TP + step_TN) / sum(step_cm_test)
step_precision <- step_TP / (step_TP + step_FP)
step_recall <- step_TP / (step_TP + step_FN)
step_f1 <- 2 * (step_precision * step_recall) / (step_precision + step_recall)
# Evaluate full logistic on test set
full_pred_test <- predict(log_model, newdata=test_data, type="response")
full_yhat_test <- ifelse(full_pred_test > 0.5, 1, 0)
full_cm_test <- table(Predicted = full_yhat_test, Actual = y_test)
full_TP <- full_cm_test[2,2]
full_TN <- full_cm_test[1,1]
full_FP <- full_cm_test[2,1]
full_FN <- full_cm_test[1,2]
full_accuracy <- (full_TP + full_TN) / sum(full_cm_test)
full_precision <- full_TP / (full_TP + full_FP)
full_recall <- full_TP / (full_TP + full_FN)
full_f1 <- 2 * (full_precision * full_recall) / (full_precision + full_recall)
# Null model baseline
null_accuracy <- max(table(y_test)) / length(y_test)
# Create comparison table
comparison_table <- data.frame(
Model = c("Null (Baseline)", "Logistic (Full)", "Stepwise", "Ridge", "Lasso")
, Accuracy = c(null_accuracy, full_accuracy, step_accuracy, ridge_accuracy, lasso_accuracy)
, Precision = c(NA, full_precision, step_precision, ridge_precision, lasso_precision)
, Recall = c(NA, full_recall, step_recall, ridge_recall, lasso_recall)
, F1_Score = c(NA, full_f1, step_f1, ridge_f1, lasso_f1)
, N_Predictors = c(0, 8, length(coef(res_step))-1, 10, sum(lasso_coef_min != 0))
)
cat("=== Final Model Comparison (Test Set) ===\n")
## === Final Model Comparison (Test Set) ===
comparison_table_print <- comparison_table
comparison_table_print[, -1] <- round(comparison_table[, -1], 4)
print(comparison_table_print)
## Model Accuracy Precision Recall F1_Score N_Predictors
## 1 Null (Baseline) 0.5290 NA NA NA 0
## 2 Logistic (Full) 0.8567 0.8333 0.8696 0.8511 8
## 3 Stepwise 0.8567 0.8333 0.8696 0.8511 7
## 4 Ridge 0.8601 0.8345 0.8768 0.8551 10
## 5 Lasso 0.8669 0.8367 0.8913 0.8632 9
# Visualizations
par(mfrow=c(2,2))
# Professional color palette for final comparison
final_colors <- c("#6c757d", "#e63946", "#1982c4", "#8ac926", "#ffca3a")
final_borders <- c("#495057", "#b91c1c", "#0d5c8e", "#6b9e1f", "#c79a2c")
# 1. Accuracy comparison
barplot(comparison_table$Accuracy
, names.arg=comparison_table$Model
, col=final_colors
, border=final_borders
, main="Test Set Accuracy Comparison"
, ylab="Accuracy"
, ylim=c(0, 1)
, las=2
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5
)
abline(h=null_accuracy, lty=2, col="#fb8500", lwd=2.5)
# 2. F1-Score comparison
barplot(comparison_table$F1_Score[-1]
, names.arg=comparison_table$Model[-1]
, col=final_colors[-1]
, border=final_borders[-1]
, main="F1-Score Comparison"
, ylab="F1-Score"
, ylim=c(0, 1)
, las=2
, col.main="#023047"
, col.lab="#023047"
, lwd=1.5)
# 3. Precision vs Recall
plot(comparison_table$Recall[-1]
, comparison_table$Precision[-1]
, pch=21
, cex=3
, col=final_borders[-1]
, bg=final_colors[-1]
, lwd=2
, xlim=c(0.8, 0.95)
, ylim=c(0.8, 0.95),
, xlab="Recall"
, ylab="Precision"
, main="Precision-Recall Trade-off"
, col.main="#023047"
, col.lab="#023047"
)
text(comparison_table$Recall[-1], comparison_table$Precision[-1],
labels=comparison_table$Model[-1], pos=3, cex=0.75, col="#023047", font=2)
abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)
grid(col="#e9ecef", lty=1)
# 4. Model complexity vs performance
plot(comparison_table$N_Predictors
, comparison_table$Accuracy
, pch=21
, cex=3
, col=final_borders
, bg=final_colors
, lwd=2
, xlab="Number of Predictors"
, ylab="Accuracy"
, main="Model Complexity vs Accuracy"
, xlim=c(-1, 11)
, ylim=c(0.4, 1)
, col.main="#023047"
, col.lab="#023047"
)
text(comparison_table$N_Predictors, comparison_table$Accuracy,
labels=comparison_table$Model, pos=3, cex=0.75, col="#023047", font=2)
grid(col="#e9ecef", lty=1)
# ROC curves comparison
roc_full <- roc(y_test, full_pred_test, quiet=TRUE)
roc_step <- roc(y_test, step_pred_test, quiet=TRUE)
roc_ridge <- roc(y_test, as.numeric(ridge_pred), quiet=TRUE)
roc_lasso <- roc(y_test, as.numeric(lasso_pred), quiet=TRUE)
# Professional ROC curve colors
roc_colors <- c("#e63946", "#1982c4", "#8ac926", "#ffca3a")
plot(roc_full, col=roc_colors[1], lwd=3.5,
main="ROC Curves Comparison",
col.main="#023047", col.lab="#023047")
lines(roc_step, col=roc_colors[2], lwd=3.5)
lines(roc_ridge, col=roc_colors[3], lwd=3.5)
lines(roc_lasso, col=roc_colors[4], lwd=3.5)
abline(a=0, b=1, lty=2, col="#6c757d", lwd=2)
legend("bottomright"
, legend=c(
sprintf("Logistic (AUC=%.3f)", auc(roc_full))
, sprintf("Stepwise (AUC=%.3f)", auc(roc_step))
, sprintf("Ridge (AUC=%.3f)", auc(roc_ridge))
, sprintf("Lasso (AUC=%.3f)", auc(roc_lasso))
)
, col=roc_colors
, lwd=3.5
, bty="n"
, cex=0.95)
Conclusions
In this study, we successfully achieved the objective of predicting whether daily electricity generation in Mexico is above or below the median using meteorological and temporal predictors. Our exploratory analysis revealed strong seasonal patterns, with electricity levels largely driven by solar radiation and temperature-related variables.
Logistic regression showed that SSRD, Holidays, DOW, Covid, and TOY are the most significant factors influencing electricity generation. Ridge regression improved model stability in the presence of multicollinearity, while Lasso offered the best trade-off between sparsity and predictive performance.
Among all evaluated models, Lasso performed the best on the test set, achieving an accuracy of approximately \(86.7\%\), recall of ~ \(89\%\) , and an F1-score of \(0.86\), along with the highest AUC value (~ \(0.936\)). These results indicate excellent classification ability. Its high sensitivity and moderate specificity demonstrate that the model reliably identifies high-demand days while maintaining an acceptable false-positive rate.
In conclusion, sparse logistic regression—particularly Lasso—effectively identifies the key factors influencing electricity generation and provides the most reliable, stable, and interpretable predictive model for this dataset.