Code used to scrape web data on coaches changed on the brazilian football league from wikipedia. See it at Kaggle.
There is a src/config.py
file where you can define target paths and target years. The scripts were made to work with years in the range [2008, 2023]. The data comes from the Wikipedia (in portuguese) on the Brazilian Serie A (First Division) League from those years.
The final aggregated csv file will be at the path indicated by the following configs in the config file:
-
COACHES_FIRED_CSV_TREATED_TABLE_DIR_PATH
-
COACHES_FIRED_CSV_TREATED_TABLE_PATH
The total procedure takes 4 distinct steps to complete:
-
Download the HTMLs from the target wikipedia links with respect to the years indicated (
src/download_coaches_fired_wikis.py
) -
Extract the HTML tables from the HTML files downloaded in step 1 (
src/extract_coaches_fired_tables.py
) -
Parse the raw HTML tables from step 2 into CSV files (
src/process_coaches_fired_tables.py
) -
Treat and aggregate the CSV files from step 3 into one unique CSV file (
src/treat_coaches_fired_tables.py
)
First of all, it's recommended that you create a unique python virtual environment inside the project's root folder. This can be done with:
python -m venv <environment name here>
You can name the virtual environment as venv
for example. Then, you need to activate the environment:
On windows: call venv/Scripts/activate.bat
On linux: source venv/bin/activate
Now, with the environment activated, you have to install the dependencies informed at requirements.txt
. You can do this via python pip install -r requirements.txt
. Now you can run the scripts.
There are two ways of generating the data:
-
If you have the ability to run a makefile, go to the
src/
directory and runmake all_coaches
. This will run all 4 steps mentioned above -
You can run each step, in order, manually with python3:
-
python download_coaches_fired_wikis.py
-
python extract_coaches_fired_tables.py
-
python process_coaches_fired_tables.py
-
python treat_coaches_fired_tables.py
-