We want to express the predictions as probabilities so that we can compare our model's predicition to betting lines published by Vegas, Fanduel, Draftkings et al.
- Frame the problem and look at the big picture.
- The goal is not to build an algorithm that can predict games perfectly. There is too much randomness in baseball for that. The goal is to make probabalistic statements (the Phillies will win this game 45% of the time) and compare those statements to those made by betting institutions. I will be happy if we even come close to their level of accuracy.
- I should note that the only way to measure that accuracy will be against live data during the baseball season, so I am aiming to have this model in production by March 2021.
- There are thousands of baseball statistics that are collected every day of the season. Part of the challenge is going to be determining which are important and which are just noise.
- Get the data
- see data_gathering directory.
- I used pybaseball to scrape
- pitch information from statcast
- team and player information from fangraphs
- lineup, starting pitcher, and outcome information from Retrosheet
- TODO - for production I will need to get today's lineups and starting pitchers to make predictions against.
- Explore the data
- see data_exploration folder.
- I made detailed notes on every feature of the fangraphs data, see fangraphs_variable_description.xlsx
- I analyzed win to stat correlations on the team level to glean some guidance on which stats to pay attention to in the massive fangraphs dataset. See Team_Stats_Analysis.ipynb
- Prepare the data
- Shortlist promising models
- Fine-tune the system
- Launch!
Pandas, Numpy, Matplotlib, Seaborn, Scikit-Learn https://github.com/jldbc/pybaseball - python library to scrape baseball information from Fangraphs, Statcast, Retrosheet, etc. I am also a contributor to this project.
Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow by Aurelien Geron
Retrosheet Data Notice
Recipients of Retrosheet data are free to make any desired use of the information, including (but not limited to) selling it, giving it away, or producing a commercial product based upon the data. Retrosheet has one requirement for any such transfer of data or product development, which is that the following statement must appear prominently:
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".
Retrosheet makes no guarantees of accuracy for the information that is supplied. Much effort is expended to make our website as correct as possible, but Retrosheet shall not be held responsible for any consequences arising from the use the material presented here. All information is subject to corrections as additional data are received. We are grateful to anyone who discovers discrepancies and we appreciate learning of the details.