What is this?

This is just another URL scraper in Python

Why?

This is a simple example of threadsafe use of requests, with retry, backoff, and support for user supplied headers. It's simple enough to be easy to comprehend, but also exposes tunables and options for flexibility and seeing where the point of diminishing returns is for parallelized requests.

Install

git clone this repo.
create a virtualenv: python3 -m venv venv
activate venv: . ./venv/bin/activate
install requirements: pip -r requirements.txt
run the program, as described in "usage"

Usage

./scrape.py --help
usage: scrape.py [-h] [-f FILE] [-l LOGFILE] [-t THREADS] [-H HEADERS]
                 [--timeout TIMEOUT] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  file to read (list of URLs to scrape, default =>
                        list.txt)
  -l LOGFILE, --logfile LOGFILE
                        logfile to write (default => log.txt)
  -t THREADS, --threads THREADS
                        number fo threads to run
  -H HEADERS, --headers HEADERS
                        headers to pass to the worker, specified in json
                        format
  --timeout TIMEOUT     timeout for each request
  -d, --debug           enable debugging

The included list.txt is this handy gist. It can also be fun to use your personal browser history or bookmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scrape		scrape
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Why?

Install

Usage

About

Releases

Packages

Languages

License

csuttles/pyscraper

Folders and files

Latest commit

History

Repository files navigation

What is this?

Why?

Install

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages