CO₂ Emissions Prediction Using Machine Learning

Predicting the impact of renewable energy adoption on UK carbon emissions using Decision Tree and Random Forest regressors

View Code

The Challenge

As the UK's energy sector transitions towards renewable sources, quantifying the actual impact of that shift on CO₂ emissions is far from straightforward. Energy consumption data is high-dimensional, noisy, and shaped by factors ranging from industrial policy to seasonal weather patterns.

This project — completed as part of the Applied AI unit of my MSc in Computer Science & AI — used machine learning to explore whether predictive models could identify meaningful relationships between green energy adoption and carbon emissions, and which modelling approaches best handle the complexity of that relationship.

5.84
Best RMSE (Random Forest)
31
Data points post-merge
2
Models compared
1990–2020
Dataset time range

Primary Finding

The pre-optimised Random Forest regressor outperformed all other configurations, achieving an RMSE of 5.84. Counterintuitively, hyperparameter optimisation via GridSearchCV worsened the Decision Tree model (RMSE rising from 10.13 to 17.70), while only marginally affecting Random Forest performance — demonstrating that ensemble methods are inherently more robust to overfitting on small datasets.

Data & Preprocessing

Two datasets were merged on year and country to form the modelling dataset:

The merged dataset contained only 31 usable rows — a significant constraint acknowledged throughout the analysis. Extensive missing data ruled out interpolation and KNN imputation as reliable strategies; instead, selective deletion was used to preserve only years with reliable values across both datasets.

Preprocessing steps included one-hot encoding of the ISO country code (nominal attribute) for model compatibility, and exploratory data analysis to understand distributions and identify potential confounders.

Data Limitation — Flagged Upfront

A dataset of 31 rows places hard limits on what any model can reliably learn. Results should be interpreted as a methodological demonstration rather than a production-ready forecasting tool. The study recommends future work incorporate larger, multi-country datasets with additional covariates including climate variables, regional policy data, and technology adoption rates.

Methodology

  1. Feature Selection (Genetic Algorithm)

    Genetic algorithms were used for feature selection — chosen over alternatives like Recursive Feature Elimination (RFE) for their global search capability, avoiding the local optima risk of hill-climbing approaches. The GA evaluated feature subsets across multiple generations and population sizes, identifying the most predictive variables for CO₂ emissions without the computational memory overhead of Tabu Search.

    A second model was built using only GA-selected features, enabling direct comparison against the full-feature model for efficiency and overfitting risk.

  2. Model Training — Decision Tree Regressor

    The primary model used a Decision Tree regressor, selected for its interpretability and natural handling of non-linear relationships — important when feature interactions in energy data are poorly understood. The 80/20 train/test split was combined with K-Fold cross-validation during training to reduce overfitting risk.

    GridSearchCV was applied for hyperparameter tuning, searching across combinations of max depth, min samples split, and min samples leaf on training data only, to prevent information leakage from the test set.

  3. Model Comparison — Random Forest Regressor

    A Random Forest regressor was introduced as a comparison model. Random Forest builds an ensemble of decision trees, each trained on a bootstrapped subset of data with random feature selection at each split — reducing the variance and instability that Decision Trees exhibit on small datasets.

    The same GridSearchCV optimisation was applied to Random Forest, and results compared across four configurations: DT pre/post-optimisation, RF pre/post-optimisation.

  4. Evaluation

    Models were evaluated using Root Mean Squared Error (RMSE) — preferred over MAE for its stronger penalisation of large errors, giving a sharper picture of generalisation quality. K-Fold cross-validation scores provided a secondary measure of stability across different training subsets.

Results

Model Configuration RMSE CV Mean Score
Decision Tree Pre-optimisation ~10.13 ~-5.67
Decision Tree Post-optimisation (GridSearchCV) 17.70 ↑ worse ~-5.28
Random Forest Pre-optimisation ★ Best overall 5.84 ~-1.88
Random Forest Post-optimisation (GridSearchCV) ~6.39 ~-2.09

The most striking result is that GridSearchCV optimisation made the Decision Tree worse — RMSE increased from 10.13 to 17.70 post-tuning. This likely reflects the model overfitting to the training set during the grid search, which had very little data to work with. With only 31 rows, even small changes to hyperparameters can dramatically shift performance.

The Random Forest pre-optimisation result (RMSE 5.84, CV ~-1.88) was the best across all configurations. The slight degradation post-optimisation — and the widening CV standard deviation — hints at the same overfitting dynamic: GridSearchCV is pushing parameters toward the training set at the expense of generalisation.

The CV mean scores tell a consistent story: Random Forest is substantially more stable than Decision Tree across folds (~-1.88 vs ~-5.67), confirming that ensemble diversity compensates for the dataset's small size in ways a single tree cannot.

Technical Approach

Why Decision Tree first?

Decision Trees were the natural starting point for this dataset: they perform implicit feature selection (assigning importance scores via impurity reduction), handle non-linear relationships without transformation, and produce interpretable rules that can be interrogated directly. For a stakeholder-facing model — where explainability matters as much as accuracy — these properties are valuable.

The trade-off is well-documented: Decision Trees are sensitive to small changes in training data and prone to high variance, particularly on small samples. The results confirmed this.

Why Genetic Algorithms for feature selection?

RFE was considered but ruled out due to its computational intensity and risk of missing important feature interactions on a high-dimensional-relative-to-sample dataset. GAs search the full feature space globally, are less memory-dependent than Tabu Search, and avoid the local optima traps of hill-climbing methods. The cost is that they require careful parameter tuning (generations, population size) — which was explored across multiple combinations.

The optimisation paradox

The fact that the best-performing model was the un-optimised Random Forest is an important result in its own right. It illustrates that hyperparameter tuning is not inherently beneficial — on small datasets, GridSearchCV can overfit the tuning process itself. The practical implication: with limited data, simpler configurations and ensemble diversity are more reliable than exhaustive search.

Limitations & Future Work

This study is honest about its constraints, and those constraints point directly toward what a more robust version of this work would look like:

Tools & Technologies

Language: Python

Libraries: scikit-learn (DecisionTreeRegressor, RandomForestRegressor, GridSearchCV, KFold), pandas, numpy, matplotlib

Feature selection: Genetic algorithms (custom implementation)

Data sources: Ricardo Energy & Environment / ONS (CO₂ by source), Kaggle UK energy consumption dataset

Context: MSc Computer Science & AI — Applied AI unit, University coursework

Get In Touch

Open to collaboration on environmental data science projects and actively seeking opportunities in geospatial analysis and ecological restoration