Analyzing Power Outages
Project for EECS 398 at University of Michigan
By Jackson Gertner
Introduction
For my analysis, I looked at a dataset of major power outages in the U.S. from January 2000 to July 2016. These major outages are defined by the Department of Energy to have impacted at least 50,000 customers, or caused an unplanned energy demand loss of at least 300 MegaWatts. This dataset was accessed from Purdue University’s Laboratory for Advancing Sustainable Critical Infrastructure, available at Purdue LASCI.
In my analysis, I aim to answer the question: What are the characteristics of major power outages with higher severity and how can we use these to predict the Duration of severe Outage? In answering this question, I hope to have some positive impact in improving predictive measures in determining Outage length and severity,
Data Overview
The original raw DataFrame contains 1,534 rows (corresponding to 1,534 outages) and 57 columns. I focus on the following key columns for my analysis:
Column | Description |
---|---|
YEAR |
Indicates the year when the outage event occurred. |
MONTH |
Indicates the month when the outage event occurred. |
U.S._STATE |
Represents all the states in the continental U.S. |
NERC.REGION |
The North American Electric Reliability Corporation (NERC) regions involved in the outage event. |
CLIMATE.REGION |
U.S. Climate regions as specified by National Centers for Environmental Information (9 regions). |
ANOMALY.LEVEL |
The oceanic El Niño/La Niña (ONI) index, referring to cold and warm episodes by season. |
CAUSE.CATEGORY |
Categories of all the events causing the major power outages. |
OUTAGE.DURATION |
Duration of outage events (in minutes). |
TOTAL.PRICE |
Average monthly electricity price in the U.S. state (cents/kilowatt-hour). |
TOTAL.SALES |
Total electricity consumption in the U.S. state (megawatt-hour). |
TOTAL.CUSTOMERS |
Annual number of total customers served in the U.S. state. |
POPPCT_URBAN |
Percentage of the total population of the U.S. state represented by the urban population (in %). |
POPDEN_URBAN |
Population density of the urban areas (persons per square mile). |
AREAPCT_URBAN |
Percentage of the land area of the U.S. state represented by the land area of the urban areas (in %). |
UTIL.CONTRI |
Utility industry’s contribution to the total real GDP of the state (in %). |
PC.REALGSP.STATE |
Per capita real gross state product (GSP) in the U.S. state (measured in 2009 chained U.S. dollars). |
Data Cleaning and EDA
Cleaning:
Steps I took in cleaning my data for analysis:
- Selected Relevant Columns: Filtered to only focus on columns listed above.
- Replaced 0 Values: Replaced
0
values in our target columnsOUTAGE.DURATION
, withNaN
, as0
likely represents missing data. - Deleted Duplicates: There were 63 duplicates in our cleaned Data Frame with limited columns, and even though they are not duplicates in our larger data set, we will drop them here for our analysis.
YEAR | MONTH | U.S._STATE | NERC.REGION | CLIMATE.REGION | ANOMALY.LEVEL | CAUSE.CATEGORY | OUTAGE.DURATION | TOTAL.PRICE | TOTAL.SALES | TOTAL.CUSTOMERS | POPPCT_URBAN | POPDEN_URBAN | AREAPCT_URBAN | UTIL.CONTRI | PC.REALGSP.STATE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 | 7 | Minnesota | MRO | East North Central | -0.3 | severe weather | 3060 | 9.28 | 6.56252e+06 | 2595696 | 73.27 | 2279 | 2.14 | 1.75139 | 51268 |
2014 | 5 | Minnesota | MRO | East North Central | -0.1 | intentional attack | 1 | 9.28 | 5.28423e+06 | 2640737 | 73.27 | 2279 | 2.14 | 1.79 | 53499 |
2010 | 10 | Minnesota | MRO | East North Central | -1.5 | severe weather | 3000 | 8.15 | 5.22212e+06 | 2586905 | 73.27 | 2279 | 2.14 | 1.70627 | 50447 |
2012 | 6 | Minnesota | MRO | East North Central | -0.1 | severe weather | 2550 | 9.19 | 5.78706e+06 | 2606813 | 73.27 | 2279 | 2.14 | 1.93209 | 51598 |
2015 | 7 | Minnesota | MRO | East North Central | 1.2 | severe weather | 1740 | 10.43 | 5.97034e+06 | 2673531 | 73.27 | 2279 | 2.14 | 1.6687 | 54431 |
EDA:
Univariate Analysis:
In my first Univariate Analysis I wanted to look at the distribution of our target variable OUTAGE.DURATION
just to see the ranges of the values we are predicting. Initially I took the distribution of the entire variable but the outliers were so great that it was hard to see the distribution of the entire variable. Therefore for this plot, I filtered to only include OUTAGE.DURATION
’s in the IQR of the variable:
Bivariate Analysis:
I wanted to see where what kinds of outages were taking places in each CLIMATE.REGION
in order to inspect whether they may be trends or prominence of certain causes in certain regions:
Aggregated Tables:
Given this theme of looking at outage causes in different climate regions, mainly in hopes of finding some sort of geographical that makes practical sense, I grouped by CLIMATE.REGION
and CAUSE.CATEGORY
to look at the mean OUTAGE.DURATION
in each of these pairings:
Climate Region | Cause Category | Average Outage Duration (minutes) | Count of Cases |
---|---|---|---|
Central | equipment failure | 322 | 7 |
Central | fuel supply emergency | 10035.2 | 4 |
Central | intentional attack | 346.059 | 38 |
Central | islanding | 125.333 | 3 |
Central | public appeal | 1410 | 2 |
Imputation:
We have a very insignificant amount of missing values in our data. The column with the largest missing value count is OUTAGE.DURATION
with 58 null values and 78 0’s (Basically Null values as far as we are concerned). Because this only amounts to under 9% of our total dataset, I took the liberty of just dropping these null values for simplicity. Therefore, we did impute any values in this dataset.
My Prediction Problem
My goal is to predict the duration of an outage event, OUTAGE.DURATION
which is measured in minutes. This is a regression problem, as the OUTAGE.DURATION
is a continuous variable.
I chose this problem mainly because outage duration is crucial for utility companies to allocate resources effectively, improve restoration times, and minimize customer dissatisfaction during outages. A model that predicts outage duration could help operators better prepare for prolonged outages caused by severe weather, technical failures, or other factors.
The metric I am going to use to test my model is mean absolute error (MAE) instead of the standard, mean squared error. The main reason for this is that in my EDA, I found OUTAGE.DURATION
to have some extremely large outliers, and MAE will avoid emphasizing these large unlike a value like MSE or RMSE.
In order to create this model, we need to only focus on variables that are going to be available to us at the time of prediction. Fortunately, we filtered out from our Data Frame columns that would be realized later on, so all variables in our Cleaned DataFrame are fair game to be used in our model.
Baseline Model
For my Baseline model, I used a pretty simple linear regression model to predict the OUTAGE.DURATION
. I used 4 different variables one of which is Categorical(CAUSE.CATEGORY
) and the other 3 Numerical(ANOMALY.LEVEL
,MONTH
,YEAR
).
I chose this mix of variables as I believed that in a concise group, these 4 could have a strong and wholistic predictive outcome. They cover an aspect of climate as they both include time of year(MONTH
) and ANOMALY.LEVEL
. On top of this they have the general year which in our EDA we say to have an effect on the amount of outages, as well as the CAUSE.CATEGORY
to cover what type of outage this was.
I encoded CAUSE.CATEGORY
using a one hot encoder. As for our numerical columns, I used a Standard scaler transformation, just in case we decided later down the line to inspect our coefficients for any sort of reason.
I used an 80-20 train-test split in order to evaluate the models performance and ultimately used Mean Absolute Error (MAE) to understand its performance for reasons stated above. MAE is straightforward as it tells me how many minutes off my predictions are on average.
For our initial model, which I did not expect to perform very will, we saw a MAE of 2522.05 minutes, which seemed very high. We did see a lot of variance in our initial distribution plot of OUTAGE.DURATION
, and we can therefore justify having such a high error, but we will do our best to improve this.
Final Model
For my final model, I aimed to improve on the baseline by adding more features and conduction a Quantile Transformation with the numerical columns in the dataset. I also needed to impute some values as there were 12 missing values in the TOTAL.SALES
and TOTAL.PRICE
columns that were interfering with the model.
For this model I used all the columns in our cleaned dataset, scaling the numerical columns with the Quantile Transformer and all the categorical ones with One Hot Encoder again.
I experimented with a few different regression models, including Lasso, but was unable to lower the MAE until I arrived at the Random Forest Regressor. I then used GridSearchCV to tune the hyper parameters n_estimators, max_depth and min_samples_split.
This time the model performed better, lowering the Mean Absolute Error almost 300 minutes, to 2240.52 minutes. This is definitely a result of using a better regressor, as well as the use of Grid Search CV.