Skip to content

HuaDeity/CrawlDemo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlDemo

中文说明

This is a demonstration of how to crawl images.

Features

  • Automate web crawling to retrieve images and flip pages automatically on a website.

  • Add support for multiple browsers.

Installation

Requirements

Python 3.7+
Chrome / Firefox / Edge / Safari

Install

git clone https://github.com/HuaDeity/CrawlDemo.git
cd CrawlDemo
pip install -r requirements.txt

Usage

cd millitary
scrapy crawl gettyimages -a search_term=aircraftcarrier -a page_number=3 -a browser=chrome
scrapy crawl baidu -a search_term=航空母舰 -a image_number=10000 -a browser=chrome

In order to retrieve a specific image, the user must provide the following information:

  • Website (gettyimages/alamy/google/baidu)

  • Search term (keyword)

  • Desired page number / image number (for baidu only)

  • Preferred web browser (chrome/firefox/edge/safari)

The websites support now:

Tips

  • GettyImages may need to change the keyword appropriately when searching in different regions, such as adding hyphens to get different search results.

  • Alamy images require a uniform crop of approximately 20px from the bottom.

  • The spider disabled Baidu's robots.txt file due to its anti-crawling mechanism, which may violate the website's terms of service.

Comparison of Download Speeds

  • Both Google and Baidu are streaming websites that eliminate page loading time.

  • GettyImages uses regular pagination mode.

  • Alamy need to wait for webpage images to load, the speed is relatively slow.

FAQ

  1. To avoid Gettyimages restricting access to your IP, it's advisable to reduce your crawling frequency. If you're still unable to download, consider changing your IP or waiting for a while before trying again.

Contributing

Contributions are welcome! Please refer to our CONTRIBUTING.md for details on how to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

HuaDeity
Email

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages