Skip to content

Latest commit

 

History

History
107 lines (76 loc) · 3.55 KB

README_EN.md

File metadata and controls

107 lines (76 loc) · 3.55 KB

HAipproxy

中文文档 | README

This project crawls proxy ip resources from the Internet.What we wish is to provide a anonymous ip proxy pool with highly availability and low latency for distributed spiders.

Features

  • Distributed crawlers with high performance, powered by scrapy and redis
  • Large-scale of proxy ip resources
  • HA design for both crawlers and schedulers
  • Flexible architecture with task routing
  • Support HTTP/HTTPS and Socks5 proxy
  • MIT LICENSE.Feel free to do whatever you want

Quick start

Standalone

Server

  • Install Python3 and Redis Server

  • Change redis args of the project config/settings.py according to redis conf,such as REDIS_HOST,REDIS_PASSWORD

  • Install scrapy-splash and change SPLASH_URL in config/settings.py

  • Install dependencies

    pip install -r requirements.txt

  • Start scrapy worker,including ip proxy crawler and validator

    python crawler_booter.py --usage crawler

    python crawler_booter.py --usage validator

  • Start task scheduler,including crawler task scheduler and validator task scheduler

    python scheduler_booter.py --usage crawler

    python scheduler_booter.py --usage validator

Client(Here we use squid)

  • Install squid,copy it's conf as a backup and then start squid, take ubuntu for example

    sudo apt-get install squid

    sudo sed -i 's/http_access deny all/http_access allow all/g'

    sudo cp /etc/squid/squid.conf /etc/squid/squid.conf.backup

    sudo service squid start

  • Change SQUID_BIN_PATH,SQUID_CONF_PATH and SQUID_TEMPLATE_PATH in config/settings.py according to your OS

  • Update squid conf periodically

    sudo python squid_update.py

  • After a while,you can send requests with squid proxies, the proxies url is 'https://squid_host:3128', e.g.

    import requests
    proxies = {'https': 'https://127.0.0.1:3128'}
    resp = requests.get('https://httpbin.org/ip', proxies=proxies)
    print(resp.text)

Dockerize

  • Install Docker

  • Install docker-compose

    pip install -U docker-compose

  • ChangeSPLASH_URLandREDIS_HOSTin settings.py

  • Start all the containers using docker-compose

    docker-compose up

  • Send requests with squid proxies

    import requests
    proxies = {'https': 'https://127.0.0.1:3128'}
    resp = requests.get('https://httpbin.org/ip', proxies=proxies)
    print(resp.text)

WorkFlow

Other important things

  • This project is highly dependent on redis,if you want to replace redis with another mq or database, just do it at your own risk
  • If there is no Great Fire Wall at your country,setproxy_mode=0 in both gfw_spider.py and ajax_gfw_spider.py. If you don't want to crawl some websites, set enable=0 in rules.py
  • Issues and PRs are welcome
  • Just star it if it's useful to you

Reference

Thanks to all the contributors of the following projects.

dungproxy

proxyspider

ProxyPool

proxy_pool

ProxyPool

IPProxyTool

IPProxyPool

proxy_list

proxy_pool