This is just another URL scraper in Python
This is a simple example of threadsafe use of requests, with retry, backoff, and support for user supplied headers. It's simple enough to be easy to comprehend, but also exposes tunables and options for flexibility and seeing where the point of diminishing returns is for parallelized requests.
git clone
this repo.- create a virtualenv:
python3 -m venv venv
- activate venv:
. ./venv/bin/activate
- install requirements:
pip -r requirements.txt
- run the program, as described in "usage"
./scrape.py --help
usage: scrape.py [-h] [-f FILE] [-l LOGFILE] [-t THREADS] [-H HEADERS]
[--timeout TIMEOUT] [-d]
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE file to read (list of URLs to scrape, default =>
list.txt)
-l LOGFILE, --logfile LOGFILE
logfile to write (default => log.txt)
-t THREADS, --threads THREADS
number fo threads to run
-H HEADERS, --headers HEADERS
headers to pass to the worker, specified in json
format
--timeout TIMEOUT timeout for each request
-d, --debug enable debugging
The included list.txt is this handy gist. It can also be fun to use your personal browser history or bookmarks.