-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
penalty_yds is always positive (in nflgame) #20
Comments
Hmm. I'm looking at the schema for the Aside from that though, you are absolutely right that this is a problem. But it's a problem with the data, unfortunately. Even in There's really only one good way to solve this. But as you said, this is complex and hard to get right. That's a good reason for it to be in the library. :-) Anyway, I'll accept this issue. I'm happy to receive a PR (for |
I've had this issue in the back of my mind all day. One thing that popped into my mind was that making the play.penalty_yds value negative if the possession team is the offending team, yet having a positive value for a defensive penalty, may be problematic... It would force a reference to play.pos_team for negative values in play.penalty_yds, yet an ambiguous join with game.away_team and game_home_team where they are not equal to play.pos_team for finding who was on defense. I think the best solution would be to add a column to the play table (maybe play.penalty_team) to identify the offending team. This would allow one to do a simple group by to find the total penalties and yards for each team. Thoughts? |
I think a little SQL trickery is in order. :-) SELECT
p.penalty_yds, p.pos_team,
COALESCE(
NULLIF(g.home_team, p.pos_team),
NULLIF(g.away_team, p.pos_team)
) AS def_team
FROM play AS p
LEFT JOIN game AS g ON g.gsis_id = p.gsis_id
WHERE p.gsis_id = '2014020200' AND p.penalty_yds > 0; This outputs:
Despite the SQL trickery above, I think I have to agree with you to some extent. Although, part of me thinks it might be better to add a column Or we could add both columns. I will say that I very much would like to stick with negative and positive values of I'm still kind of chewing on this. I think the negative/positive change is the only thing I feel particularly strong about at the moment, although I suspect adding at least one column will also be part of the solution too. |
It's settled then. Now it's just a simple matter of programming. :-) |
Great idea for the def_team, but would have to include the penalty_team as well to indicate which team the penalty is on... or would you do negative for offensive and positive for defensive values? |
Precisely. For example, |
SELECT That would work! Fantastic!!! BTW, I'm a ColdFusion developer and have worked with Oracle, SQLServer, MySQL, etc... but, never with Python / PostgreSql. However, I've been coding for 32 years, so languages are pretty easy to switch between... let me know if you would like some help somewhere down the road or need a 2nd set of eyes on something. Also, pretty decent with statistical analysis if you need a hand in that arena. ;-) ~Robbie |
Yup, that looks like it would work to me. Although, one think I should mention is that in Postgres, you actually won't be able to use
Thanks! I'll ping this issue when I've pushed something, and I'd definitely appreciate any feedback at that point too. Particularly if you can accomplish what you initially set out to accomplish. And... ColdFusion you say? I haven't used that in almost 10 years. I was using it back when Macromedia was still running things. Might even have a couple of their books still lying around... |
He he he... yeah, I'm really enjoying CF right now... every developer tends to teethe on CF then move on. That leaves me to be a "big fish in a small pond" when it comes to contracting, so I tend to get the contracts I want at the price I want. Before long, I'll have to move on (probably Java) because most all sizable businesses are bailing on CF. But, I'll cross that bridge when I get there. |
...and if you still happen to have an installation of CF on your server:
This may give you some ideas for the Python code? There are about 20k records created here, so it's a chunk... you might want to narrow down the SQL's WHERE clause to a more reasonable set of data before running. Also note, I've been pulling the data from PostgreSql into MySQL for my own preferences, so the table names are not what you'd expect. |
Oh heavens no, I've long since purge CF from my life completely. :-) You might want to check out The wiki has lots of neat examples that might even get a CF programmer to take a dive into Python. :-) (But there's nothing too special for penalties, other than standard comparisons on whatever is in the DB.) |
Joins are only slow if you've got bad SQL, have a Commodore 64, haven't indexed your tables appropriately, or have a crap-o-la database. ;-) I appreciate the query.py package, but I'm waaaaaaaaaay too old-school to leave my queries to someone else (no offense)... old DBA control-freak issues, ya know! Yeah, I've started playing around with Python since I found your cool little JSON scrape-o-matic. My first impressions are that it's a good little language with a lot of maturing ahead of it. I think I'd prefer something more complete (such as Perl), but that's just my personal nerd bias. ColdFusion pays the bills for the time being, but I'm realistic in that the future is short lived for the language unless Adobe starts paying more attention to their investment. Java is going to be the big-boy for the foreseeable future because of it's flexibility, power, and dominance in the current web market, and since ColdFusion is just a "wrapper language" for Java, it's the logical step in my career (baring Jennifer Aniston starts knocking on my door). |
This is a little trickier than I thought. How would a play like this be resolved?
Notice that there are two |
Regular expression?
Not sure how this works across all plays though. I would imagine you wouldn't run this unless there was a penalty on the play. So... That's one of the things I don't like about the NFL's play-by-play format. It's rather inconsistent, which makes it difficult to pull information out of. I'm currently scraping pro-reference, and the data is much cleaner, but the information is a bit more difficult to access. |
Yeah, I don't really want to spend my time writing regexes for all the various play-by-play formats. :-) |
Not really too time-consuming man, you could start here, and then elaborate . I'm working on one for pro-football-reference, and this is where it's at so far. Yea, it's hard to read, but right now I'm converting it into Python. Maybe you can adapt the two sources to suit your needs. I'm sure people would appreciate the additional information that this provides. /* Built in LinqPad. I was experimenting with Linq at the time, so please forgive the harsh use of the ternary operator. */
var temp = (from game in Game_data
select new {
data = game,
compare = game.Detail.ToUpper()
}).AsEnumerable();
var stats =
(from game in temp
let pass = game.compare.Contains("PASS")
let passIncomplete = game.compare.Contains("INCOMPLETE")
let passRange = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?<range>DEEP|SHORT)")
let passDirection = Regex.Match(game.compare, @"(?:COMPLETE|INCOMPLETE) (?:\w+) (?<direction>LEFT|MIDDLE|RIGHT)")
let rushDirection = Regex.Match(game.compare, @"(?<direction>RIGHT|LEFT) (?:GUARD|TACKLE|END)|UP THE (?<direction>MIDDLE)")
let rush = (rushDirection.Success) ? true : Regex.Match(game.compare, @"(\w+\s)+FOR").Success
let rushLineman = Regex.Match(game.compare, @"(?:RIGHT|LEFT) (?<lineman>GUARD|TACKLE|END)")
let kickoff = game.compare.Contains("KICKS OFF")
let punt = game.compare.Contains("PUNT")
let yards = Regex.Matches(game.compare, @"(-?\d+) YARDS?")
let isExtraPoint = game.compare.Contains("EXTRA POINT")
let isFieldGoal = game.compare.Contains("FIELD GOAL")
let isFieldGoalNoGood = game.compare.Contains("NO GOOD")
let isTouchback = game.compare.Contains("TOUCHBACK")
let isTouchdown = game.compare.Contains("TOUCHDOWN")
let isOnside = game.compare.Contains("ONSIDE")
let isBlocked = game.compare.Contains("BLOCKED")
let isSack = game.compare.Contains("SACK")
let isInterception = game.compare.Contains("INTERCEPT")
let isPenalty = game.compare.Contains("PENALTY")
let isTimeout = game.compare.Contains("TIMEOUT")
let isTwoPoint = game.compare.Contains("TWO POINT")
let isChallenge = game.compare.Contains("CHALLENGE")
let isOverturned = game.compare.Contains("OVERTURN")
let isKneel = game.compare.Contains("KNEEL")
let isSpike = game.compare.Contains("SPIKE")
let isSafety = game.compare.Contains("SAFETY")
let isNoPlay = game.compare.Contains("NO PLAY") || game.compare.StartsWith("PENALTY ON")
select new
{
PlayID = game.data.Play_id,
PlayType = (isTwoPoint) ? "TWO POINT" :
(isNoPlay) ? "NOPLAY" :
(pass || isSack) ? "PASS" :
(rush) ? "RUSH" :
(kickoff) ? "KICKOFF" :
(isOnside) ? "ONSIDE" :
(punt) ? "PUNT" :
(isExtraPoint) ? "EXTRA POINT" :
(isFieldGoal) ? "FIELD GOAL" :
(isKneel) ? "KNEEL" :
(isSpike) ? "SPIKE" : "",
PlayResult = ( ((pass) ?
(((passIncomplete) ? "INCOMPLETE" : "COMPLETE") +
((passRange.Success) ? "|" + passRange.Groups["range"] : "") +
((passDirection.Success) ? "|" + passDirection.Groups["direction"] : "") +
((isSack) ? "|SACK" : "") +
((isInterception) ? "|INTERCEPTION" : "")) :
(rush) ?
(((rushDirection.Success) ? "|" + rushDirection.Groups["direction"] : "") +
((rushLineman.Success) ? "|" + rushLineman.Groups["lineman"] : "")) :
(kickoff) ?
((isTouchback) ? "TOUCHBACK" :
(isBlocked) ? "BLOCKED" : "") :
(isFieldGoal) ?
((isFieldGoalNoGood) ? "NO GOOD" : "GOOD") :
"") +
((isTouchdown) ? "|TOUCHDOWN" :
(isSafety) ? "|SAFETY" : "")
).Trim('|'),
Yards1 = (yards.Count > 0) ? (int?)int.Parse(yards[0].Groups[1].Value) : (int?)null,
Yards2 = (yards.Count > 1) ? (int?)int.Parse(yards[1].Groups[1].Value) : (int?)null,
IsPenalty = isPenalty,
IsTimeout = isTimeout,
IsTwoPoint = isTwoPoint,
IsChallenge = isChallenge,
IsOverturned = isOverturned,
IsNoPlay = isNoPlay,
Detail = game.data.Detail,
GameID = game.data.Game_id
}).Dump(); |
And now for the rest of the categories: https://github.com/BurntSushi/nfldb/wiki/Statistical-categories And IIRC, the plays have some variation as time has moved on. Descriptions are slightly different in 2001 than now. But I haven't looked in a while. Yes it's time consuming. But hey, if it's no problem, and you write a program that extracts all of the stats that |
So, you're DB provides all of those categories already, or are those to-do? |
Yes. All of them. Because they're available in a structured format via an undocumented JSON feed going back to 2009. |
Excellent! Thanks. |
Many of those statistics are not in the feed. How are you accomplishing that? |
Yes they are. All of the code is open source, I invited you to examine for yourself: https://github.com/BurntSushi/nflgame/blob/master/nflgame/statmap.py and https://github.com/BurntSushi/nflgame/blob/master/nflgame/game.py |
Yea, just took a look at those. That is incredible! Awesome job, you and your contributors have done here. I'll probably borrow some of the code to add to my sqlite DB. I'm a fan of Postgresql, but I'd like to make the information more portable for those who just want to mess with the DB, and don't need an API. |
I found a data issue that you may/may not be able to fix, depending upon the pull from the feed. The play.penalty_yds data is unsigned, therefore, it does not accurately reflect the yardage gained/lost. All data is a positive integer value.
For example, if Seattle was in possession and had an offensive holding penalty, one would expect the play.penalty_yds to contain -10, yet the actual data contained currently would be 10. If Denver was in possession and there was a defensive encroachment penalty, one would expect the play.penalty_yds to contain 5.
Without properly signed data, it forces a string match within play.description for 'PENALTY on {team}', which may return inaccurate results in the event there were multiple penalties on the play (For example, gsis_id=2009092011, drive_id=4, play_id=836).
The column drive.penalty_yards is accurately signed, but is a composite of all plays within that drive, which does not break down the penalty yards on each particular team, let alone individual plays.
Obviously, this issue is dependent upon what data can be extracted from the data feed, so a fix may not be possible... but, it would be really nice to have! Thanks in advance for taking a look at the issue!
The text was updated successfully, but these errors were encountered: