Skip to content

This repository extracts data from the clothing website Puma using the python framework Scrapy.

Notifications You must be signed in to change notification settings

jpradas1/Crawling_Engineer_Retail

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Crawling Engineer Challenge

This repository extract data from the clothing website Puma using the python framework Scrapy. This scraper has one spiders, and extracts the following data, i.e.:

  • Product id
  • Product title
  • Product brand
  • Product description.
  • Product current price
  • Product original price
  • Product availability
  • A list of all the image URLs
  • Product URL
  • All available colors for the product
  • All available sizes for the product
  • Category paths leading to the product (e.g. Women > Footwear > Running)

Running the Code

To run this scraper properly, follow these steps.

Virtual Environment

First we need a virtual environment to display this project. We could use conda or the python module venv. We use the last one.

python -m venv venv

To active this environment

  • Linux case:
source venv/bin/activate
  • Windows case:
venv\Scripts\activate

Installing Libraries

pip install scrapy pymongo

Ride Spiders

Before to run any spider we need a MongoDB server in order to store our data. This server could be local or on cloud. To create a local one, use docker.

sudo docker pull mongo
sudo docker run -d -p 27017:27017 --name mongodb mongo

Because Puma crawler needs the Mongo server, we run:

scrapy crawl -s MONGODB_URI="mongodb:https://localhost:27017/" -s MONGODB_DATABASE="Products" puma

And to ignore the log output

scrapy crawl -s MONGODB_URI="mongodb:https://localhost:27017/" -s MONGODB_DATABASE="Products" puma 2>/dev/null

Dataset

The final extrated data is located in the compress file products which contains all data extracted from the website, storing 24544 items.

tar -xzvf products.tar.gz

About

This repository extracts data from the clothing website Puma using the python framework Scrapy.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages