Hack NFL data using Postgres (and maybe win your fantasy draft)

CoolAssPuppy · on July 27, 2021

Today is opening day for most of the NFL’s training camps. I’m a huge fan of public datasets, and my coworkers and I happened upon this dataset from the NFL. We loaded it into our company’s product (TimescaleDB) and did some number crunching. We wanted to definitively answer some questions about whether or not there is a quantifiable difference in performance when playing at Mile High Stadium (there is) or if Tyreek Hill really is as fast as he seems (he is). If you’re a football fan, there’s some good data in there. At the very least, you may settle a few bar bets. Maybe you’ll even find something to help you win your fantasy league this year.

Cyclone_ · on July 27, 2021

What's the most interesting conclusion you were able to find using the advanced data that you wouldn't be able to see with the basic stats like QBR or yards per carry?

CoolAssPuppy · on July 27, 2021

I was suuuuper curious about the fastest player on the field (i.e., in full pads on game day), and the play-specific data includes acceleration as one of its data points. Miranda on our team (co-author of the post) dialed up a query to show the top 3, one of whom is Tyreek Hill. (The other two don’t play that much)

CoolAssPuppy · on July 27, 2021

Should add, I’m a big fan of all the football metrics providers and follow them all religiously during the season. The NFL dataset we found isn’t as comprehensive, but it’s still really fun!

Maven911 · on July 28, 2021

What are some of the other football data providers

CoolAssPuppy · on July 28, 2021

It’s not time-series data, but my favorite is Pro Football Focus: https://www.pff.com/subscribe

If the NFL made its data available weekly, you could probably join it with PFF data for some interesting insight. There’s a ton of power in joining time-series metrics with purely relational data.

exdsq · on July 27, 2021

This seems like a great way to get your NFL-loving ORM-crazy colleagues into the world of SQL

akulkarni · on July 27, 2021

It's also a great showcase of the power of SQL :-)

akulkarni · on July 27, 2021

The blog post is interesting, but for anyone who wants to play with the data themselves, instructions are here:

https://docs.timescale.com/timescaledb/latest/tutorials/nfl-...

swasheck · on July 27, 2021

i've always wanted to test the hypothesis that the timing and type of penalties within a game have more of an impact than the overall number of penalties. is there a way to leverage a penalty for strategic advantage at any given point in the game?

i still suspect that would be hard to determine given the subjective nature of what constitutes a penalty and what (dis)advantage the penalized team was carrying at any given moment.

CoolAssPuppy · on July 27, 2021

This reminds me of Belichick intentionally taking defensive penalties to run out the clock.

avthar · on July 27, 2021

Cool dataset! I wish there was a similar dataset for Premier League football (or even international soccer). Does anyone know of a good resource?

LoriP · on July 28, 2021

Yep that was my thought too. This one (open data commons license) seems to end at 2019 https://datahub.io/sports-data/english-premier-league and that originated from a gambling site that has later data and lots of leagues... Not sure it will include plays, assists etc. Would work wonders for fantasy football errmmm I mean soccer. There must be something around 20:20 cricket surely :)

autokad · on July 27, 2021

has anyone found a data set that has all years? football it seems kinda protected. its really easy to get all baseball data

LoriP · on July 28, 2021

You can get 2020 data from the same source that the NFL tutorial uses but that's only two years. It must exist I guess?

jonatasdp · on July 28, 2021

This tool looks promising to keep it up to date.

> The lack of publicly available National Football League (NFL) data sources has been a major obstacle in the creation of modern, reproducible research in football analytics. While clean play-by-play data is available via open-source software packages in other sports (e.g. nhlscrapr for hockey; PitchF/x data in baseball; the Basketball Reference for basketball), the equivalent datasets are not freely available for researchers interested in the statistical analysis of the NFL. To solve this issue, a group of Carnegie Mellon University statistical researchers including Maksim Horowitz, Ron Yurko, and Sam Ventura, built and released nflscrapR an R package which uses an API maintained by the NFL to scrape, clean, parse, and output clean datasets at the individual play, player, game, and season levels.

https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016