power-outage-analysis

Project for EECS 398

View project on GitHub

Analyzing Power Outages

Project for EECS 398 at University of Michigan

By Jackson Gertner


Introduction

For my analysis, I looked at a dataset of major power outages in the U.S. from January 2000 to July 2016. These major outages are defined by the Department of Energy to have impacted at least 50,000 customers, or caused an unplanned energy demand loss of at least 300 MegaWatts. This dataset was accessed from Purdue University’s Laboratory for Advancing Sustainable Critical Infrastructure, available at Purdue LASCI.

In my analysis, I aim to answer the question: What are the characteristics of major power outages with higher severity and how can we use these to predict the Duration of severe Outage? In answering this question, I hope to have some positive impact in improving predictive measures in determining Outage length and severity,


Data Overview

The original raw DataFrame contains 1,534 rows (corresponding to 1,534 outages) and 57 columns. I focus on the following key columns for my analysis:

Column Description
YEAR Indicates the year when the outage event occurred.
MONTH Indicates the month when the outage event occurred.
U.S._STATE Represents all the states in the continental U.S.
NERC.REGION The North American Electric Reliability Corporation (NERC) regions involved in the outage event.
CLIMATE.REGION U.S. Climate regions as specified by National Centers for Environmental Information (9 regions).
ANOMALY.LEVEL The oceanic El Niño/La Niña (ONI) index, referring to cold and warm episodes by season.
CAUSE.CATEGORY Categories of all the events causing the major power outages.
OUTAGE.DURATION Duration of outage events (in minutes).
TOTAL.PRICE Average monthly electricity price in the U.S. state (cents/kilowatt-hour).
TOTAL.SALES Total electricity consumption in the U.S. state (megawatt-hour).
TOTAL.CUSTOMERS Annual number of total customers served in the U.S. state.
POPPCT_URBAN Percentage of the total population of the U.S. state represented by the urban population (in %).
POPDEN_URBAN Population density of the urban areas (persons per square mile).
AREAPCT_URBAN Percentage of the land area of the U.S. state represented by the land area of the urban areas (in %).
UTIL.CONTRI Utility industry’s contribution to the total real GDP of the state (in %).
PC.REALGSP.STATE Per capita real gross state product (GSP) in the U.S. state (measured in 2009 chained U.S. dollars).

Data Cleaning and EDA

Cleaning:

Steps I took in cleaning my data for analysis:

  • Selected Relevant Columns: Filtered to only focus on columns listed above.
  • Replaced 0 Values: Replaced 0 values in our target columns OUTAGE.DURATION, with NaN, as 0 likely represents missing data.
  • Deleted Duplicates: There were 63 duplicates in our cleaned Data Frame with limited columns, and even though they are not duplicates in our larger data set, we will drop them here for our analysis.
YEAR MONTH U.S._STATE NERC.REGION CLIMATE.REGION ANOMALY.LEVEL CAUSE.CATEGORY OUTAGE.DURATION TOTAL.PRICE TOTAL.SALES TOTAL.CUSTOMERS POPPCT_URBAN POPDEN_URBAN AREAPCT_URBAN UTIL.CONTRI PC.REALGSP.STATE
2011 7 Minnesota MRO East North Central -0.3 severe weather 3060 9.28 6.56252e+06 2595696 73.27 2279 2.14 1.75139 51268
2014 5 Minnesota MRO East North Central -0.1 intentional attack 1 9.28 5.28423e+06 2640737 73.27 2279 2.14 1.79 53499
2010 10 Minnesota MRO East North Central -1.5 severe weather 3000 8.15 5.22212e+06 2586905 73.27 2279 2.14 1.70627 50447
2012 6 Minnesota MRO East North Central -0.1 severe weather 2550 9.19 5.78706e+06 2606813 73.27 2279 2.14 1.93209 51598
2015 7 Minnesota MRO East North Central 1.2 severe weather 1740 10.43 5.97034e+06 2673531 73.27 2279 2.14 1.6687 54431

EDA:

Univariate Analysis:

In my first Univariate Analysis I wanted to look at the distribution of our target variable OUTAGE.DURATION just to see the ranges of the values we are predicting. Initially I took the distribution of the entire variable but the outliers were so great that it was hard to see the distribution of the entire variable. Therefore for this plot, I filtered to only include OUTAGE.DURATION’s in the IQR of the variable:

Bivariate Analysis:

I wanted to see where what kinds of outages were taking places in each CLIMATE.REGION in order to inspect whether they may be trends or prominence of certain causes in certain regions:

Aggregated Tables:

Given this theme of looking at outage causes in different climate regions, mainly in hopes of finding some sort of geographical that makes practical sense, I grouped by CLIMATE.REGION and CAUSE.CATEGORY to look at the mean OUTAGE.DURATION in each of these pairings:

Climate Region Cause Category Average Outage Duration (minutes) Count of Cases
Central equipment failure 322 7
Central fuel supply emergency 10035.2 4
Central intentional attack 346.059 38
Central islanding 125.333 3
Central public appeal 1410 2

Imputation:

We have a very insignificant amount of missing values in our data. The column with the largest missing value count is OUTAGE.DURATION with 58 null values and 78 0’s (Basically Null values as far as we are concerned). Because this only amounts to under 9% of our total dataset, I took the liberty of just dropping these null values for simplicity. Therefore, we did impute any values in this dataset.

My Prediction Problem

My goal is to predict the duration of an outage event, OUTAGE.DURATION which is measured in minutes. This is a regression problem, as the OUTAGE.DURATION is a continuous variable.

I chose this problem mainly because outage duration is crucial for utility companies to allocate resources effectively, improve restoration times, and minimize customer dissatisfaction during outages. A model that predicts outage duration could help operators better prepare for prolonged outages caused by severe weather, technical failures, or other factors.

The metric I am going to use to test my model is mean absolute error (MAE) instead of the standard, mean squared error. The main reason for this is that in my EDA, I found OUTAGE.DURATION to have some extremely large outliers, and MAE will avoid emphasizing these large unlike a value like MSE or RMSE.

In order to create this model, we need to only focus on variables that are going to be available to us at the time of prediction. Fortunately, we filtered out from our Data Frame columns that would be realized later on, so all variables in our Cleaned DataFrame are fair game to be used in our model.

Baseline Model

For my Baseline model, I used a pretty simple linear regression model to predict the OUTAGE.DURATION. I used 4 different variables one of which is Categorical(CAUSE.CATEGORY) and the other 3 Numerical(ANOMALY.LEVEL,MONTH,YEAR).

I chose this mix of variables as I believed that in a concise group, these 4 could have a strong and wholistic predictive outcome. They cover an aspect of climate as they both include time of year(MONTH) and ANOMALY.LEVEL. On top of this they have the general year which in our EDA we say to have an effect on the amount of outages, as well as the CAUSE.CATEGORY to cover what type of outage this was.

I encoded CAUSE.CATEGORY using a one hot encoder. As for our numerical columns, I used a Standard scaler transformation, just in case we decided later down the line to inspect our coefficients for any sort of reason.

I used an 80-20 train-test split in order to evaluate the models performance and ultimately used Mean Absolute Error (MAE) to understand its performance for reasons stated above. MAE is straightforward as it tells me how many minutes off my predictions are on average.

For our initial model, which I did not expect to perform very will, we saw a MAE of 2522.05 minutes, which seemed very high. We did see a lot of variance in our initial distribution plot of OUTAGE.DURATION, and we can therefore justify having such a high error, but we will do our best to improve this.

Final Model

For my final model, I aimed to improve on the baseline by adding more features and conduction a Quantile Transformation with the numerical columns in the dataset. I also needed to impute some values as there were 12 missing values in the TOTAL.SALES and TOTAL.PRICE columns that were interfering with the model.

For this model I used all the columns in our cleaned dataset, scaling the numerical columns with the Quantile Transformer and all the categorical ones with One Hot Encoder again.

I experimented with a few different regression models, including Lasso, but was unable to lower the MAE until I arrived at the Random Forest Regressor. I then used GridSearchCV to tune the hyper parameters n_estimators, max_depth and min_samples_split.

This time the model performed better, lowering the Mean Absolute Error almost 300 minutes, to 2240.52 minutes. This is definitely a result of using a better regressor, as well as the use of Grid Search CV.