Every Summer, due to the increased temperature, many regions including Tabriz city in Iran experience electricity shortages. Predicting the load (electricity consumption) would help utility providers as well as consumers better manage the system. One of the factors contributing to the electricity consumption is weather condition, especially in summer. In this project, I have tried to explain (and predict) load profile for Tabriz in Summer by weather data and some other features.
Methodology[edit | edit source]
There are various techniques and approaches to predict electricity consumption. These methods include, but are not to:
- Time Series
- Component Decomposition
- Artificial Intelligence (ML, DL, RL)
In this project, I decided to use Supervised Machine Learning to try to predict the load profile.
Weather Data[edit | edit source]
Using Python and the API of www.weather.com I could extract hourly weather variables into an excel file. These variables included temperature, wind velocity, cloud cover, and relative humidity.
After retrieving raw data, I needed to clean them. Cleaning data comprised of looking for missing data, duplicates, invalid data, and outliers. Linear interpolation was used to fill the missing data. It's worth noting that all the cleaning took place in Excel.
Load Data[edit | edit source]
I, with cooperation of one of my friends, received load data for the specified time (90 days). It was a bunch of CSV files for 48 substations. Using Pandas library in python, I summed up all substation hourly load data and integrated them into an excel file.
Final Dataset[edit | edit source]
Now that both weather data and load data were ready, I brought them together in a 3rd excel file. The final file contained all the features: date, hour, temperature, load, etc.
Feature Engineering[edit | edit source]
After extraction, cleaning, and integration, I decided to add some features in order to help us better understand the relations hidden in data. First, I replaced date with week days. Then, labeled data based on their day; weekdays and weekends. Also, based on some quick visualizations, 24 hours of a day were grouped into 3 categories: 1. Night 2. Working Hour 3. Evening. The load profile showed quite distinct trends in these three times.
Visualization[edit | edit source]
Weather Data[edit | edit source]
The histograms in the Table above depicts the distribution of Temperature, wind velocity, and relative humidity. While the temperature distribution is quite Normal, those of wind velocity and relative humidity are more like Chi-squared. Although the temperature ranges from 11 to 39 degrees Celsius, it's in the interval 25-30 degrees most of the time.
Temperature is distributed quite largely and in some ranges it's ascending and the others it's descending. The relationship will become more clear later on with grouped plots.
There is a strong, inverse, linear relationship between relative humidity and temperature.
Demand[edit | edit source]
The box plots above demonstrate different patterns of demand when grouped by day type (Week Day/ Weekend) and hour type (Night, Work, Evening). The median demand in weekdays is expectedly higher and its IQR (Interquartile Range) is bigger as well. Similarly, the demand at night tends to be lower than that of work time. In the evenings, the demand is the highest; Although the offices are closed, people at home consumes electricity for cooling, watching TV, etc.
Wind Speed & Cloud CoverThere seems to be no significant relationship between demand and neither wind speed nor cloud cover.
Demand vs Hour & TemperatureAccording to the grouped scatter plot above, demand shows 3 different patterns during there groups. For example, at night, it's descending and less distributed. As we expected, the demand for electricity in summer has a strong relationship with temperature. It's mainly because of the widespread use of ACs.
Curve Fitting[edit | edit source]
A cubic polynomial is fitted to the average hourly data. The curve for the average temperature fits actual data pretty well. Due to the cooling demand, electricity consumption reaches its daily peak at around 2 p.m. Although the temperature continues rising, electricity demand drops. This occurs because most offices close at 2 p.m., which in turn reduces consumption.
Modelling[edit | edit source]
Due to the small size of data, I preferred to use MATLAB. It provides a tool box with a handy user interface. We simply pass the final excel file (after importing to MATLAB workspace) to the app. the data is divided into two datasets:
- Training set
- Test set
Only training set is used in modelling while test set is used to evaluate the performance of the model. I used 75 percent of the records for training and the remaining 25 percent for testing.
All the available algorithms in the toolbox are applied to the data. In the end, the performance (accuracy) of each of them is calculated. We focus on two criteria:
- RMSE (Root Mean Square Error)
- MAE (Mean Absolute Error)
- R^2 (R-Squared)
Note that these are calculated based on TEST data so that no overfitting occurs. We want these indexes to be as small as possible.
Results[edit | edit source]
Model Name | RMSE | MAE | Model Name | RMSE | MAE |
Linear Regression--Interactions Linear | 15.9 | 11.7 | Support Vector Machine--Medium Gaussian | 12.2 | 8.5 |
Linear Regression--Linear | 17.09 | 13.2 | Support Vector Machine--Cubic | 12.8 | 8.6 |
Linear Regression--Robust Linear | 17.1 | 13.2 | Support Vector Machine--Quadratic | 13.3 | 9.4 |
Tree--Medium Tree | 12.3 | 9.0 | Support Vector Machine--Coarse Gaussian | 15.8 | 12.0 |
Tree--Fine Tree | 12.6 | 9.2 | Support Vector Machine--Linear | 17.2 | 13.2 |
Tree--Coarse Tree | 13.6 | 9.9 | Gaussian Process Regression--Exponential | 11.5 | 8.1 |
Ensemble--Bagged Trees | 12.0 | 8.5 | Gaussian Process Regression--Matern 5/2 | 11.5 | 8.1 |
Ensemble--Boosted Trees | 13.2 | 9.7 | Gaussian Process Regression--Rational Quadratic | 11.6 | 8.1 |
Stepwise Linear Regression--Stepwise Linear | 15.1 | 11.4 | Gaussian Process Regression--Squared Exponential | 11.7 | 8.2 |
The table above summarizes the results. It's evident that GPR (Gaussian Process Regression), Ensemble-Bagged Trees, SVR (Support Vector Machine) have done quite good job predicting the demand.
The tool box provides hyper parameter optimisation as well. So, I optimised GPR in order to reach even smaller numbers for RMSE (and MAE). The result is as follows:
RMSE | 10.5 |
MAE | 7.3 |
R^2 | 0.92 |
R-squared shows how well the model is fit to the data. It's value ranges from 0 to 1. The closer to 1, the better the model explains data.
The plot shows that the points are concentrated about the red line. The red line represents all points for which the Actual demand is equal to the Predicted demand by the model.
Future Work[edit | edit source]
Since we are dealing with Timeseries data, We could also add features for delay; temp(t-1), temp(t-2), Demand(t-1), etc. This may improve the model performance since the temperature and demand may have delayed effect.