Skip to content

This is the final flask application for my Data Incubator Capstone Project

Notifications You must be signed in to change notification settings

rohithdesikan/SolarInformationApp

Repository files navigation

FlaskAppFinal

This is the final flask application for my Data Incubator Capstone Project: https://rdflaskappfinal.herokuapp.com/

In this project, I attempt to predict the annual solar savings by zip code for 72,000 areas around the United States. The data was obtained from a specific project called, REPLICA (Rooftop Energy Potential of Low Income Communities in America) from the National Renewable Energy Lab's Open Data catalog. This dataset contains information about residential rooftop solar technical potential at the tract level. This data was first released in April 2018, making it very new and is the most accurate data of residential rooftop solar potential using LiDAR. Combining this with demographic information yielded a unique dataset to be used by resarchers, planners, policy makers and private companies. Along with this came estimates of potential electric bill savings, which I downloaded from a related dataset separately. Considering that this dataset contained unprecedented accuracy, it was worth delving into it to find value. For private companies, the largest cost comes with customer acquisition. Any chance to shave off costs in this business aspect will sharply improve the bottom line. The total size of the dataset is ~10.5GB. I downloaded each of the zip files within the dataset and the supplemental data containing demographic information and used Python to open up all the zip files, inspect their contents and join each table on GEOID. I then downloaded solar irradiation data and potential electric bill savings from a separate source and also joined them by GEOID. Therefore, this dataset could be used to figure out which areas around the US are most likely to see savings from rooftop solar photovoltaics, thus pointing towards the most favorable locations within each state to acquire customers.

Before this, the solar irradiation (amount of solar falling on a surface) was one of the few features used to figure out savings. However, contained within this dataset is the amount of developable roof area within each zip code. For example, even if an area received a lot of solar energy, if there are only tall or sparse apartment buildings, that will not be a good location to acquire customers as the $/customer would be very high. Furthermore, demographics are vital to figure out where customers are most likely to be receptive to rooftop solar. After creating the base dataset, I removed the features I did not need, converted categorical features such as climate zone and heating degree days (a measure of the weather) and the type of utility that serves that region.

Once I created this final dataset, literature reviews showed me that demographic data has a non linear relationship with annual solar savings or any sort of monetary incentives/savings in general. Therefore, I attempted a Random Forest Regressor with 300 estimators and max depth of 8. However, with the LiDAR data, I tried both a ridge regression and a gradient boosting regression and the latter performed better. My hyperparameters were 600 estimators and a depth of 12 and a learning rate of 0.02. Finally, these two models for the two sections of the data were combined into an ensemble model (a Feature Union in Scikit Learn) where I ran a simple ridge regression with the hyperparameter value alpha = 2 to make the final predictions. For metrics, I used the mean absolute error because I did not want to penalize outliers and obtained a training error of $50 and a test error of $70 for annual savings.

As for visualization, the main page provides an interactive map that shows predicted solar savings for the entire country. The Solar Information App provides a link to the predicted solar savings by each state to drill down further into which county in each state sees maximum savings. The Exploratory Graphs page shows some scatter plots of numerous important variables as they relate to annual solar savings. Here, you can choose the variable you are interested in and the visualizations created using Bokeh will show its relationship to solar savings.

The ML Flowchart page shows a visualization of this entire process.

If similar data for other countries is available, the same type of analysis can be performed and extended on that data as well. This machine learning model can be continuously updated to reflect new training data potentially from other countries. Note that as the original dataset intended, this analysis can also be used by policy makers to enact better solar policy, especially for the low income community as there are features involving annual median income and the LiDAR data is available for both single family and multi family homes.

About

This is the final flask application for my Data Incubator Capstone Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages