crawler

A Web crawler.

Start from the url and crawl the web pages with a specified depth.
Save the pages which contain a keyword(if provided) into database.
Support multi-threading.
Support logging.
Support self-testing.

Usage

main.py [-h] -u URL -d DEPTH [--logfile FILE] [--loglevel {1,2,3,4,5}]
               [--thread NUM] [--dbfile FILE] [--key KEYWORD] [--testself]

Optional arguments:

  -h, --help            show this help message and exit
  -u URL                Specify the begin url
  -d DEPTH              Specify the crawling depth
  --logfile FILE        The log file path, Default: spider.log
  --loglevel {1,2,3,4,5}
                        The level of logging details. Larger number record
                        more details. Default:3
  --thread NUM          The amount of threads. Default:10
  --dbfile FILE         The SQLite file path. Default:data.sql
  --key KEYWORD         The keyword for crawling. Default: None. For more than
                        one word, quote them. example: --key 'Hello world'
  --testself            Crawler self test

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
database.py		database.py
main.py		main.py
options.py		options.py
proxy.py		proxy.py
threadPool.py		threadPool.py
webPage.py		webPage.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

A Web crawler.

Usage

Optional arguments:

About

Releases

Packages

Contributors 3

Languages

yaojialyu/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

A Web crawler.

Usage

Optional arguments:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages