-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Questions #30
Comments
Great question! There isn't anything in the API that will directly give you aggregated team statistics, but that's because it's very easy to do. For example, rushing yards for the Patriots in week 17 last season: import nfldb
db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=2013, season_type='Regular', week=17, team='NE')
# An important line! Without it, you'll get `rushing_yds` from both teams.
q.play_player(team='NE')
pps = q.as_aggregate()
print sum(pp.rushing_yds for pp in pps) Of course, this isn't as efficient as doing it in SQL. If you have any ideas on how this might be added to the API proper, I might be open to them. (However, I do like the simplicity of the API now.)
This isn't part of the source data unfortunately. However, I think it may exist in other places. For example: http:https://www.nflgsis.com/2013/Reg/17/56084/Gamebook.xml --- Look in that first You can trivially construct those URLs using the |
Oh man, it's a good thing |
@ochawkeye You're back! ❤️ Hmm. College football. I don't really follow it myself, but I wonder if there are any data sources... There certainly would be a lot more of it! |
@BurntSushi and @ochawkeye, there are certainly data sources for college football. I've wanted to do this for some time now, and the most comprehensive I've seen are SI's json data feeds. But I've only just recently learned Python web scraping, and need to improve my skills. Maybe you guys can help out? :) Edit: It appears SI has changed their entire website design, and now it's very wieldy for me to navigate the scoreboards. It also appears that they've completely redesigned their server-side code in how they serve the JSON files :( Example: This doesn't make it any easier to scrape college football data. Turns out that they've made this change across their entire site. I used to scrape college basketball data from JSON feeds that look like the below, but now the link is broken: |
@BurntSushi Have you done any data verification between the json data and the Gamebook.xml data? I wasn't aware that those were available. Any idea if that contains the corrected and final stats for a game? |
@gojonesy If only I knew how :P The plays look like this:
The holy grail is being able to parse any play description like this and turning it into structured data (with sufficient granularity to be in |
@BurntSushi Yep :) Happened to have a kid right at the time you were fielding input on Python3 and conversation had died down by the time I returned... I have a feeling @gojonesy is referring to to using this sort of data to highlight irregularities: <RusherVisitor Player="F.Jackson" Attempts="9" Yards="47" Average="5.2" Touchdowns="0" Long="16" UniformNumber="22"/>
<ReceiverVisitor Player="F.Jackson" Number="1" Yards="13" Average="13.0" Touchdowns="0" Long="13" UniformNumber="22" PassTarget="1"/> Aggregating those stats and comparing to aggregate |
@BurntSushi @ochawkeye The Gamebook actually contains the aggregates for the game already: <TeamStatistics Description="TOTAL FIRST DOWNS" VisitorStats="19" HomeStats="24"/>
<TeamStatistics Description="By Rushing" VisitorStats="8" HomeStats="12"/>
<TeamStatistics Description="By Passing" VisitorStats="10" HomeStats="7"/>
<TeamStatistics Description="By Penalty" VisitorStats="1" HomeStats="5"/>
<TeamStatistics Description="THIRD DOWN EFFICIENCY" VisitorStats="4-13-31%" HomeStats="4-13-31%"/>
<TeamStatistics Description="FOURTH DOWN EFFICIENCY" VisitorStats="0-3-0%" HomeStats="0-0-0%"/> etc. That would make the job a little easier...I look into it a bit more after lunch. |
@ochawkeye Congrats on the kid! :-) I'm very happy to see you return. :-) @gojonesy Yeah, that's aggregated data. It's certainly useful and it would definitely be worth using it as a test. But as @ochawkeye said, it doesn't totally solve our problem. I plan on working on testing at some point this season, and those XML files will undoubtedly be part of it. (Unless it turns out that they are exactly derived from the JSON feed---or the reverse---in which case they'll be useless for testing.) At the very least, it could help us pinpoint games (or categories) where statistics are off. |
@BurntSushi Thanks for both responses. I can think of some funny analyses that could be completed with the referee data. I don't know much about Python or SQL, unfortunately. My use case requires a large amount of "box score" level data, so looping through like that wouldn't work very well. I ended up just importing all the data into R using the
|
@bayesrules Ah, I see. Caching aggregated stats for box score data is definitely a good approach. |
It feels like NFLDB was written with fantasy football in mind. Nothing wrong with that, but I need to make some adjustments for my purposes. Thanks again for putting all this together. I'm very optimistic about where my analysis could go now. |
Well, fantasy football requires summing team statistics as well. The API shouldn't do everything for you. If you need every drop of performance, I think it's reasonable to expect to write some SQL. On the other hand, I'm open to adding things to the API, if it's done elegantly. |
I really didn't know where to put this, but I am going through the process of analyzing the Gamebooks and wanted to share the script I wrote to download them all...Since we discussed it here, I thought this may be a good place for it. My first use for these will be to validate aggregated season totals that I am getting with nfldb. I'm mostly interested right now in analyzing things like Passing Efficiency, Rushing Efficiency, Turnover Differential, Penalty Yards, etc. I have found that those counts don't always mesh with some of the stats available at some of the popular statistic sites. Hopefully, the Gamebooks will help to shed some light on why the totals are off at times. import nfldb, nflgame, urllib, os
def get_team_gamebooks(year, s_type, team):
db = nfldb.connect()
q = nfldb.Query(db)
q.game(season_year=year, season_type=s_type, team=team)
game_list = []
for g in q.as_games():
game_list.append([g.gamekey, g.week])
return game_list
db = nfldb.connect()
q = nfldb.Query(db)
plays = 0
year = 2013
s_year = str(year)
s_type = 'Reg'
week = 0
for t in nflgame.teams:
path = "/your/path/Dev/nfldb/gamebooks/" + s_year
if not os.path.exists(path):
os.makedirs(path)
game_list = get_team_gamebooks(year, 'Regular', t[0])
for g in game_list:
# build the url
if len(str(g[1])) < 2:
week = "0" + str(g[1])
else:
week = str(g[1])
xmlurl = 'http:https://www.nflgsis.com/' + str(year) + "/" + s_type + "/" + week + "/" + str(g[0]) + "/Gamebook.xml"
xmlpath = path + "/" + g[0] + ".xml"
# Since we are iterating by team, gamekeys will appear twice (once for each team in the game).
# We only need to retrieve the file one time.
if not os.path.exists(xmlpath):
urllib.urlretrieve(xmlurl, xmlpath) |
@gojonesy Bit of a crossover conversation, but this overlaps a bit with some of the work @BurntSushi did to validate season long totals a couple of years ago: Both before and after that testing, it was acknowledged that
The ability exists to find 1. & 2. But without being able to pinpoint the exact play that causes the mismatch, there's not really a piece of data in the data base that we can update to help eliminate the mismatch. Just a personal observation - and I am the farthest thing from a python function naming expert! - but your function name |
Ha! I had more in the function to begin with and just never changed the name. Thanks for pointing that out! Noting that stats are changed after games are played, is there a need/desire for data that is "final"? Would it be beneficial to anyone else to have this data available in nfldb as well? |
@BurntSushi I know this issue i closed but seemed like an appropriate place to comment. First off, love the project, thanks for continuing to improve it. Looking forward to another Fantasy season. In regards to the aggregation of data, have you ever looked at using any tools for viewing the aggregated data in a friendly web interface? Last season I aggregated the data from nflgame and nfldb to elasticsearch for easy viewing and data analysis using http:https://www.elasticsearch.org/overview/kibana. Kibana was designed for log aggregation but I found it equally helpful for viewing, filtering, searching, and analyzing statistics. I'd be happy to share more once I get the 2014 season up and running with Kibana if this sort of thing is something you've been thinking about evolving to with the project. |
@gojonesy In an ideal world, stats don't change after a game is over. As far as the JSON feed goes, this is probably true. (Some of the recent bundles of missing data were my fault, I think, and not a result of an initial incomplete JSON feed.) I am quite certain that official stat corrections are not applied to the JSON feed. Once again, the issue with stat corrections is that it is hard to get your hands on meaningful data. We need to know the exact play in which the correction is issued to fix it. I am extremely reluctant to add a different source of aggregated data to I know it doesn't feel good, but until we have the ability to precisely parse human readable play descriptions, I don't think there is really much else that can be done as far as fixing the data goes. |
@albertlyu Looks like the play-by-play data is still available: http:https://www.si.com/pbp/liveupdate?json=1&sport=football%2Fcfb&id=1300292&box=true&pbp=true&linescore=true --- It definitely looks like it has enough granularity to fit into |
A web interface is a big hammer. Have you seen my nflcmd project? Its specialty is viewing and sorting aggregated data. Here are some examples:
Or for aggregating by season:
Notice the fuzzy player name matching. :-) You can also rank statistics:
With all of that said, I am working on a web interface. It will be for integrating fantasy football and broadcast footage mostly, but searching and filtering aggregated statistics will probably be part of it. You can monitor that progress here: https://github.com/BurntSushi/nflfan (It is not in any way, shape or form ready to be used yet.) |
Of course, I think we would definitely love to here about your experience with ElasticSearch. Perhaps you could create a wiki page and tell us about it? :-) |
Without dropping to the SQL layer, is there a way to obtain aggregate team statistics? For instance, I'd like to know the home team's rushing yards in a given game.
Additionally, I'd be curious to know the stadium a given game is played in. Ultimately, I'd like to derive a field which determines whether a given game is truly a home game for the home team.
Thanks in advance.
The text was updated successfully, but these errors were encountered: