Build a web scraper using Python from HTML web pages, and create a DataFrame with pandas.
Scrape data from IMDb’s Top 1,000 movies
There are 50 movies on each page.
In order to get all the necessary information about each movie, the crawler GET information from each page and go to the next page until all the data is obtained.
Extract the following fields for each movie.
- title: name of the movie
- year: the year that movie created
- time: length of the movie in minutes
- imdb_rating: the movie’s IMDb rating
- metascore: the movie’s Metascore rating
- votes: number of votes that movie achieved
- us_gross: the amount of cost for the movie in million-dollar
Always consider that maybe all the information is not available to scrape. Code should not stop or break for these cases if data is missing.
The script uses the following packages:
- jupyterlab
- pandas
- beautifulsoup4
- requests
install the required packages:
if you want to use virtual env, create it using the following commands:
python3 -m venv .venv
source ./.venv/bin/activate
To install the required packages:
pip3 install -r requirements.txt
You can either use the jupyter notebook or script to get your data. To run the script,
activate your virtualenv
and run the following command:
python imdb_scraper.py