Skip to content

ridwan-salau/scrapy_bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DEVEX SCRAPY_BOT

Tool Used: Scrapy Framework

How to Run

After cloning the project, cd into scrapy_bot folder and do the following:

If using pip, do:

  • python -m venv commonshare-env
  • source activate commonshare-env
  • pip install -r requirements.txt
  • scrapy crawl devex_details

If using conda, do:

  • conda env create -f environment.yml
  • conda activate commonshare-env
  • scrapy crawl devex_details

Brief Description

The goal of the project is to extract the following details into separate csv files from devex.com as listed below:

  • Organization Information

    • Company Name
    • Company Logo
    • Company Description
    • Organization Type
    • Staff
    • Development Budget
    • Headquarters
    • Founded
    • website link
  • Sectors Information

    • Funded, comma separated in one column
    • Countries, comma separated in one column
    • Skills, comma separated in one column

Challenges Faced and Solutions Attempted

I faced two major challenges and in attempt to solve them, I faced some others. The two challenges are:

  • Captcha preventing scraping by returning 403 error.
    • This challenge was solved by passing some headers which let the scraper bot mimic a regular browser
  • The bug detailed in this GitHub issue which is perculiar to Mac OS.
    • I attempted to solve this by launching a Linux EC2 instance on AWS but soon faced another issue:
      • Due to the instance using a public IP (so I believe), the website was quick to blacklist the IP address of the instance - thus preventing it from scraping much.Error 403 is returned after few scrapes.
        • I attempted to solve this by introducing a 403 error callback which attempts to get the catched version of the webpages from Google Cache but this was only modestly successful as it was getting blocked by Google intermittently. Also, rather weirdly, the component of the application that writes the Contract information was not getting any values.
    • As the first set of solutions failed, I resorted to finding a Windows PC which I utilized to run the project. I faced no issues running it on Windows. I presume it would also work fine on Linux OS that is not cloud hosted or with proxy rotation.
      • This solution also begins to fail after a while which points to the possibility of the host getting fingerprinted.

About

Scrapy bot to extract pages from devex.com

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages