This repository contains a collection of tools, projects and resources that enable effective analysis and visualisation of football data.
Table of Contents
This repository contains a collection of tools, projects and resources that aim to support the generation of meaningful insight from football data. Python is used for extraction, processing, analysis and visualisation of event data, aggregated team data, market value data and more. The repository is broken down into mutliple projects and sub-projects, each of which aims to either perform a specific analysis, generate some specific insight, or introduce some level of automation to football data analytics. Using the contents of this repository, a number of novel & informative visuals and text threads have been created and shared with the football data analytics community via Twitter (@JKDS).
To support others who are wishing to develop their data analytics skills within the context of football data, I have produced a Getting Started Guide
football-data-analytics
│
├── analysis_tools
│ ├── __init__.py
│ ├── get_football_data.py [not included in git repo]
│ ├── logos_and_badges.py
│ ├── pitch_zones.py
│ ├── statsbomb_custom_events.py
│ ├── statsbomb_data_engineering.py
│ ├── whoscored_custom_events.py
│ ├── whoscored_data_engineering.py
│ ├── wyscout_data_engineering.py
│
├── data_directory
│ ├── misc_data
│ │ ├── images
│ │ │ ├── ___.png
│ │ ├── log_regression_xg_data.pbz2
│ │ ├── neural_net_xg_data.pbz2
│ │ ├── worldcup_2010_to_2018_distcovered.xlsx
│ ├── statsbomb_data [not included in git repo]
│ ├── transfermarkt_data
│ ├── whoscored_data [not included in git repo]
│ ├── wyscout_data
│
├── projects
│ ├── 00_misc_work
│ │ ├── saudi_arabia_argentina_world_cup_def_actions.py
│ ├── 01_worldcup_b2b_midfielders
│ │ ├── import_data_statsbomb.py
│ │ ├── worldcup_b2b_mids.py
│ ├── 02_transfermarkt_scrape_and_analyse
│ │ ├── championship_forward_value_analysis.py
│ │ ├── premierleague_forward_value_analysis.py
│ │ ├── scrape_data_transfermarkt.py
│ ├── 03_xg_model
│ │ ├── shot_xg_plot.py
│ │ ├── xg_log_regression_model.py
│ │ ├── xg_neural_network.py
│ ├── 04_match_reports
│ │ ├── import_data_whoscored.py
│ │ ├── pass_report_ws.py
│ │ ├── shot_report_understat.py
│ ├── 05_competition_reports
│ │ ├── player_defensive_contribution.py
│ │ ├── player_effective_carriers.py
│ │ ├── player_effective_passers.py
│ │ ├── player_high_defensive_actions.py
│ │ ├── player_penalty_takers.py
│ │ ├── player_threat_creators.py
│ │ ├── team_ball_winning.py
│ │ ├── team_fullback_combinations.py
│ │ ├── team_threat_creation.py
│ ├── 06_player_reports
│ │ ├── ws_full_back_report.py
│
├── .gitignore
│
├── LICENSE
│
├── README.md
As shown in the folder structure above, the repository contains three key folders:
- data_directory: Storage of raw football data used for projects.
- analysis_tools: Custom python package containing modules that support football data import, processing, manipulation and visualisation.
- projects: Series of projects that cover various elements of football data analytics. Also contains any template scripts used to import raw data from various football data APIs, websites or data services.
In general, each project follows a number of logical steps:
- Create a folder within the Projects area to store files associated with the project.
- Use analysis_tools package: get_football_data module [note this module is not available within the git repo] to import raw data from football data API, website or data service:
- If imported dataset is large, save to data_directory area in compressed BZ2 format and create a new script for analysis.
- If imported dataset is small, data import and analysis can be completed in the same script (without the need to store/save data).
- Within the analysis script, import any required modules from the analysis_tools package.
- Pre-process and format data using data_engineering modules within the analysis_tools package.
- Synthesise additional information using custom_events and pitch_zones modules within the analysis_tools package.
- With data formatted appropriately, create visuals and generate insight for end-consumer.
Project table of contents:
01 - World Cup 2018 Box to Box Midfielder Analysis
02 - Transfermarkt Web-Scrape and Analyse
03 - Expected Goals Modelling
04 - Automated Match Reporting
05 - Automated Competition Reporting
Summary: Use Statsbomb data to define the most effective box to box midfielders at the 2018 World Cup. Throughout the work a number of custom metrics are used to score central midfielders in ball winning, ball retention & creativity, and mobility. A good box to box midfielder is defined as a central midfielder that excels in each of these areas. Of key interest in this work is the use of convex hulls as a proxy for player mobility / distance covered. The work also includes the development of a number of appealing visuals, as shown below.
Summary: Scrape team and player market value information from transfermarkt.co.uk. This work includes the development of a "scouting tool" that highlights players from a given league that have a favourable combination of Age and Goal Contribution per £m market value. The work also explores the use of statistical models to predict market value based on player performance, as well as identifies teams that under and over-performed (league position) based on squad value.
Summary: Implementation and testing of basic expected goals probabilistic models. This work includes development and comparison of a logistic regression expected goals model and a neural network expected goals model, each trained off over 40000 shots taken across Europe's 'big five' leagues during the 2017/2018 season. The models are used to calculated expected goals for specific players, clubs and leagues over a specified time period.
Summary: Development of automated scripts to produce match reports immediately after a match has concluded. This work includes collection and processing of public-domain match event data, and the production of multiple visuals that together constitute informative and appealing match reports. Visuals currently include shot maps, inter-zone passflows, pass plots and offensive action convex hulls.
Summary: Development of automated scripts to produce competition reports and multi-match player evaluations at any point throughout a competition. This work includes collection and processing of public-domain match event data, and the production of multiple visuals that generate novel and meaningful insight at a team and player level. Visuals currently include an assessment of progressive passes, defensive actions and penalty placement.