A web scraper for collecting data from Transfermarkt website. It recurses into the Transfermarkt hierarchy to find competitions, games, clubs, players and appearances, and extract them as JSON objects.
====> Confederations ====> Competitions ====> (Clubs, Games) ====> Players ====> Appearances
Each one of these entities can be discovered and refreshed separately by invoking the corresponding crawler.
This is a scrapy project, so it needs to be run with the
scrapy
command line util. This and all other required dependencies can be installed using poetry.
cd transfermarkt-datasets
poetry install
poetry shell
⚠️ This project will not run without a user agent string being set. This can be done one of two ways:
- add
ROBOTSTXT_USER_AGENT = <your user agent>
to your tfmkt/settings.py file, or- specify the user agent token in the command line request (for example,
scrapy crawl players -s USER_AGENT=<your user agent>
)
These are some usage examples for how the scraper may be run.
# discover confederantions and competitions on separate invokations
scrapy crawl confederations > confederations.json
scrapy crawl competitions -a parents=confederations.json > competitions.json
# you can use intermediate files or pipe crawlers one after the other to traverse the hierarchy
cat competitions.json | head -2 \
| scrapy crawl clubs \
| scrapy crawl players \
| scrapy crawl appearances
Alternatively you can also use dcaribou/transfermarkt-scraper
docker image
docker run \
-ti -v "$(pwd)"/.:/app \
dcaribou/transfermarkt-scraper:main \
scrapy crawl competitions -a parents=samples/confederations.json
Items are extracted in JSON format with one JSON object per item (confederation, league, club, player or appearance), which get printed to the stdout
. Samples of extracted data are provided in the samples folder.
Check out transfermarkt-datasets to see transfermarkt-scraper
in action on a real project.
parents
: Crawler "parents" are either a file or a piped output with the parent entities. For example,competitions
is parent ofclubs
, which in turn is a parent ofplayers
.season
: The season that the crawler is to run for. It defaults to the most recent season.
Check setting.py for a reference of available configuration options
Extending existing crawlers in this project in order to scrape additional data or even creating new crawlers is quite straightforward. If you want to contribute with an enhancement to transfermarkt-scraper
I suggest that you follow a workflow similar to
- Fork the repository
- Modify or add new crawlers to
tfmkt/spiders
. Here is an example PR that extends thegames
crawler to scrape a few additional fields from Transfermakt games page. - Create a PR with your changes and a short description for the enhancement and send it over 🚀
It is usually also a good idea to have a short discussion about the enhancement beforehand. If you want to propose a change and collect some feeback before you start coding you can do so by creating an issue with your idea in the Issues section.