CO₂ Emissions Prediction | Daniel Crompton

The Challenge

As the UK's energy sector transitions towards renewable sources, quantifying the actual impact of that shift on CO₂ emissions is far from straightforward. Energy consumption data is high-dimensional, noisy, and shaped by factors ranging from industrial policy to seasonal weather patterns.

This project — completed as part of the Applied AI unit of my MSc in Computer Science & AI — used machine learning to explore whether predictive models could identify meaningful relationships between green energy adoption and carbon emissions, and which modelling approaches best handle the complexity of that relationship.

5.84

Best RMSE (Random Forest)

31

Data points post-merge

2

Models compared

1990–2020

Dataset time range

Primary Finding

The pre-optimised Random Forest regressor outperformed all other configurations, achieving an RMSE of 5.84. Counterintuitively, hyperparameter optimisation via GridSearchCV worsened the Decision Tree model (RMSE rising from 10.13 to 17.70), while only marginally affecting Random Forest performance — demonstrating that ensemble methods are inherently more robust to overfitting on small datasets.

Data & Preprocessing

Two datasets were merged on year and country to form the modelling dataset:

UK CO₂ emissions by source (1750–2021) — Ricardo Energy & Environment / ONS, providing annual emissions broken down by coal, oil, gas, and other sources
UK energy consumption by type (1990–2020) — Kaggle, detailing renewable and non-renewable energy breakdown

The merged dataset contained only 31 usable rows — a significant constraint acknowledged throughout the analysis. Extensive missing data ruled out interpolation and KNN imputation as reliable strategies; instead, selective deletion was used to preserve only years with reliable values across both datasets.

Preprocessing steps included one-hot encoding of the ISO country code (nominal attribute) for model compatibility, and exploratory data analysis to understand distributions and identify potential confounders.

Data Limitation — Flagged Upfront

A dataset of 31 rows places hard limits on what any model can reliably learn. Results should be interpreted as a methodological demonstration rather than a production-ready forecasting tool. The study recommends future work incorporate larger, multi-country datasets with additional covariates including climate variables, regional policy data, and technology adoption rates.

Methodology

Feature Selection (Genetic Algorithm)

Genetic algorithms were used for feature selection — chosen over alternatives like Recursive Feature Elimination (RFE) for their global search capability, avoiding the local optima risk of hill-climbing approaches. The GA evaluated feature subsets across multiple generations and population sizes, identifying the most predictive variables for CO₂ emissions without the computational memory overhead of Tabu Search.

A second model was built using only GA-selected features, enabling direct comparison against the full-feature model for efficiency and overfitting risk.
Model Training — Decision Tree Regressor

The primary model used a Decision Tree regressor, selected for its interpretability and natural handling of non-linear relationships — important when feature interactions in energy data are poorly understood. The 80/20 train/test split was combined with K-Fold cross-validation during training to reduce overfitting risk.

GridSearchCV was applied for hyperparameter tuning, searching across combinations of max depth, min samples split, and min samples leaf on training data only, to prevent information leakage from the test set.
Model Comparison — Random Forest Regressor

A Random Forest regressor was introduced as a comparison model. Random Forest builds an ensemble of decision trees, each trained on a bootstrapped subset of data with random feature selection at each split — reducing the variance and instability that Decision Trees exhibit on small datasets.

The same GridSearchCV optimisation was applied to Random Forest, and results compared across four configurations: DT pre/post-optimisation, RF pre/post-optimisation.
Evaluation

Models were evaluated using Root Mean Squared Error (RMSE) — preferred over MAE for its stronger penalisation of large errors, giving a sharper picture of generalisation quality. K-Fold cross-validation scores provided a secondary measure of stability across different training subsets.

Results

Model	Configuration	RMSE	CV Mean Score
Decision Tree	Pre-optimisation	~10.13	~-5.67
Decision Tree	Post-optimisation (GridSearchCV)	17.70 ↑ worse	~-5.28
Random Forest	Pre-optimisation ★ Best overall	5.84	~-1.88
Random Forest	Post-optimisation (GridSearchCV)	~6.39	~-2.09

The most striking result is that GridSearchCV optimisation made the Decision Tree worse — RMSE increased from 10.13 to 17.70 post-tuning. This likely reflects the model overfitting to the training set during the grid search, which had very little data to work with. With only 31 rows, even small changes to hyperparameters can dramatically shift performance.

The Random Forest pre-optimisation result (RMSE 5.84, CV ~-1.88) was the best across all configurations. The slight degradation post-optimisation — and the widening CV standard deviation — hints at the same overfitting dynamic: GridSearchCV is pushing parameters toward the training set at the expense of generalisation.

The CV mean scores tell a consistent story: Random Forest is substantially more stable than Decision Tree across folds (~-1.88 vs ~-5.67), confirming that ensemble diversity compensates for the dataset's small size in ways a single tree cannot.

Technical Approach

Why Decision Tree first?

Decision Trees were the natural starting point for this dataset: they perform implicit feature selection (assigning importance scores via impurity reduction), handle non-linear relationships without transformation, and produce interpretable rules that can be interrogated directly. For a stakeholder-facing model — where explainability matters as much as accuracy — these properties are valuable.

The trade-off is well-documented: Decision Trees are sensitive to small changes in training data and prone to high variance, particularly on small samples. The results confirmed this.

Why Genetic Algorithms for feature selection?

RFE was considered but ruled out due to its computational intensity and risk of missing important feature interactions on a high-dimensional-relative-to-sample dataset. GAs search the full feature space globally, are less memory-dependent than Tabu Search, and avoid the local optima traps of hill-climbing methods. The cost is that they require careful parameter tuning (generations, population size) — which was explored across multiple combinations.

The optimisation paradox

The fact that the best-performing model was the un-optimised Random Forest is an important result in its own right. It illustrates that hyperparameter tuning is not inherently beneficial — on small datasets, GridSearchCV can overfit the tuning process itself. The practical implication: with limited data, simpler configurations and ensemble diversity are more reliable than exhaustive search.

Limitations & Future Work

This study is honest about its constraints, and those constraints point directly toward what a more robust version of this work would look like:

Dataset size: 31 rows is insufficient for reliable generalisation. A multi-country dataset covering more years, with consistent data quality, would transform this from a methodological study into a genuinely predictive tool.
Omitted variables: Climate variation, regional energy policy, technological adoption rates, and industrial structural shifts all influence CO₂ emissions and are absent from the model. Their omission risks confounding the energy-emissions relationship.
Model scope: The study did not explore gradient boosting (XGBoost, LightGBM), which may perform better on small tabular datasets than either Decision Trees or Random Forests. Bayesian optimisation would also be a more efficient alternative to GridSearchCV for hyperparameter tuning.
Time series structure: The dataset has inherent temporal ordering that standard train/test splitting ignores. Future work should use time-aware cross-validation (e.g., TimeSeriesSplit) and consider ARIMA or LSTM approaches for forecasting.

Tools & Technologies

Language: Python

Libraries: scikit-learn (DecisionTreeRegressor, RandomForestRegressor, GridSearchCV, KFold), pandas, numpy, matplotlib

Feature selection: Genetic algorithms (custom implementation)

Data sources: Ricardo Energy & Environment / ONS (CO₂ by source), Kaggle UK energy consumption dataset

Context: MSc Computer Science & AI — Applied AI unit, University coursework

CO₂ Emissions Prediction Using Machine Learning