For this project, I took a look into historical game data to see what factors might play a role in fan turnout. I gathered data from baseball reference by scraping game data for every MLB game from 1990 to 2016. From my findings, I built a regression model to predict the attendance for a given game. I only created one model that could be used to predict attendance for any team, and while I was able to produce an accurate model, this process could be carried out for a specific team in order to increase the performance of the model. The team could then use these results to anticipate low attendance nights and then develop effective marketing or promotional strategies.
One of the most interesting projects that I recently worked on involved collecting salary information on data science jobs in order to predict the salaries for certain jobs based on the location, title, and job summary. The project was a real test of a few newly acquired data science skills, such as gathering data and information from a webpage, performing natural language processing on text data, and building a classification prediction model. After collecting the data, I created a classification model to predict whether a job salary would be above or below the median salary for a data scientist.
The state of Iowa provides many datasets on their website, including one which contains transactions for all liquor stores in the state from January 2015 through March 2016. With this information available, my goal was to analyze the data and build a linear regression model to predict the sales for the rest of 2016. I created a model that described the relationship between the 2015 quarter 1 sales and the 2016 quarter 1 sales. Then, using that model, I predicted the quarters 2 through 4 sales for 2016 from the quarters 2 through 4 sales for 2015.
The Global Terrorism Database (GTD) is an open-source database including information on terrorist attacks around the world from 1970 through 2015. The GTD includes systematic data on domestic as well as international terrorist incidents that have occurred during this time period and now includes more than 150,000 cases. For this project, I analyzed the 45 years of global terrorism and created numerous vizualizations in order to better understand the exceptionally large dataset. I applied Bayesian statistics to compare two different countries to see if one was significantly more dangerous than the other during 21st century. Additionally, the year 1993 is missing from this dataset so I attempted to estimate the number of bombings for that year by using a time series model.