Predicting the impact of renewable energy adoption on UK carbon emissions using Decision Tree and Random Forest regressors
As the UK's energy sector transitions towards renewable sources, quantifying the actual impact of that shift on CO₂ emissions is far from straightforward. Energy consumption data is high-dimensional, noisy, and shaped by factors ranging from industrial policy to seasonal weather patterns.
This project — completed as part of the Applied AI unit of my MSc in Computer Science & AI — used machine learning to explore whether predictive models could identify meaningful relationships between green energy adoption and carbon emissions, and which modelling approaches best handle the complexity of that relationship.
The pre-optimised Random Forest regressor outperformed all other configurations, achieving an RMSE of 5.84. Counterintuitively, hyperparameter optimisation via GridSearchCV worsened the Decision Tree model (RMSE rising from 10.13 to 17.70), while only marginally affecting Random Forest performance — demonstrating that ensemble methods are inherently more robust to overfitting on small datasets.
Two datasets were merged on year and country to form the modelling dataset:
The merged dataset contained only 31 usable rows — a significant constraint acknowledged throughout the analysis. Extensive missing data ruled out interpolation and KNN imputation as reliable strategies; instead, selective deletion was used to preserve only years with reliable values across both datasets.
Preprocessing steps included one-hot encoding of the ISO country code (nominal attribute) for model compatibility, and exploratory data analysis to understand distributions and identify potential confounders.
A dataset of 31 rows places hard limits on what any model can reliably learn. Results should be interpreted as a methodological demonstration rather than a production-ready forecasting tool. The study recommends future work incorporate larger, multi-country datasets with additional covariates including climate variables, regional policy data, and technology adoption rates.
Genetic algorithms were used for feature selection — chosen over alternatives like Recursive Feature Elimination (RFE) for their global search capability, avoiding the local optima risk of hill-climbing approaches. The GA evaluated feature subsets across multiple generations and population sizes, identifying the most predictive variables for CO₂ emissions without the computational memory overhead of Tabu Search.
A second model was built using only GA-selected features, enabling direct comparison against the full-feature model for efficiency and overfitting risk.
The primary model used a Decision Tree regressor, selected for its interpretability and natural handling of non-linear relationships — important when feature interactions in energy data are poorly understood. The 80/20 train/test split was combined with K-Fold cross-validation during training to reduce overfitting risk.
GridSearchCV was applied for hyperparameter tuning, searching across combinations of max depth, min samples split, and min samples leaf on training data only, to prevent information leakage from the test set.
A Random Forest regressor was introduced as a comparison model. Random Forest builds an ensemble of decision trees, each trained on a bootstrapped subset of data with random feature selection at each split — reducing the variance and instability that Decision Trees exhibit on small datasets.
The same GridSearchCV optimisation was applied to Random Forest, and results compared across four configurations: DT pre/post-optimisation, RF pre/post-optimisation.
Models were evaluated using Root Mean Squared Error (RMSE) — preferred over MAE for its stronger penalisation of large errors, giving a sharper picture of generalisation quality. K-Fold cross-validation scores provided a secondary measure of stability across different training subsets.
| Model | Configuration | RMSE | CV Mean Score |
|---|---|---|---|
| Decision Tree | Pre-optimisation | ~10.13 | ~-5.67 |
| Decision Tree | Post-optimisation (GridSearchCV) | 17.70 ↑ worse | ~-5.28 |
| Random Forest | Pre-optimisation ★ Best overall | 5.84 | ~-1.88 |
| Random Forest | Post-optimisation (GridSearchCV) | ~6.39 | ~-2.09 |
The most striking result is that GridSearchCV optimisation made the Decision Tree worse — RMSE increased from 10.13 to 17.70 post-tuning. This likely reflects the model overfitting to the training set during the grid search, which had very little data to work with. With only 31 rows, even small changes to hyperparameters can dramatically shift performance.
The Random Forest pre-optimisation result (RMSE 5.84, CV ~-1.88) was the best across all configurations. The slight degradation post-optimisation — and the widening CV standard deviation — hints at the same overfitting dynamic: GridSearchCV is pushing parameters toward the training set at the expense of generalisation.
The CV mean scores tell a consistent story: Random Forest is substantially more stable than Decision Tree across folds (~-1.88 vs ~-5.67), confirming that ensemble diversity compensates for the dataset's small size in ways a single tree cannot.
Decision Trees were the natural starting point for this dataset: they perform implicit feature selection (assigning importance scores via impurity reduction), handle non-linear relationships without transformation, and produce interpretable rules that can be interrogated directly. For a stakeholder-facing model — where explainability matters as much as accuracy — these properties are valuable.
The trade-off is well-documented: Decision Trees are sensitive to small changes in training data and prone to high variance, particularly on small samples. The results confirmed this.
RFE was considered but ruled out due to its computational intensity and risk of missing important feature interactions on a high-dimensional-relative-to-sample dataset. GAs search the full feature space globally, are less memory-dependent than Tabu Search, and avoid the local optima traps of hill-climbing methods. The cost is that they require careful parameter tuning (generations, population size) — which was explored across multiple combinations.
The fact that the best-performing model was the un-optimised Random Forest is an important result in its own right. It illustrates that hyperparameter tuning is not inherently beneficial — on small datasets, GridSearchCV can overfit the tuning process itself. The practical implication: with limited data, simpler configurations and ensemble diversity are more reliable than exhaustive search.
This study is honest about its constraints, and those constraints point directly toward what a more robust version of this work would look like:
Language: Python
Libraries: scikit-learn (DecisionTreeRegressor, RandomForestRegressor, GridSearchCV, KFold), pandas, numpy, matplotlib
Feature selection: Genetic algorithms (custom implementation)
Data sources: Ricardo Energy & Environment / ONS (CO₂ by source), Kaggle UK energy consumption dataset
Context: MSc Computer Science & AI — Applied AI unit, University coursework
Open to collaboration on environmental data science projects and actively seeking opportunities in geospatial analysis and ecological restoration