GitHub - nghiaqh/pycrawler: cli crawler

This repository has been archived by the owner on Jan 1, 2019. It is now read-only.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
PyCrawler.py		PyCrawler.py
README		README
config.py		config.py
setup.py		setup.py
sitelinks.sql		sitelinks.sql

Repository files navigation

PREREQUIREMENTS
1. Python 2.6
2. MySQL 5.0

INSTALLATION
1. Create a database named sitelinks
2. Update the database configuration in config.py (hostname, username, password)
3. Run setup.py: python setup.python
4. Update config.py:
   - crawl_domain: domain you need to crawl.
   - fileext: file types you don't want to crawl.

TO START CRAWLING
Run PyCrawler.py:
- the crawler will start with the first URL in queue (initialized when setup),
- parses the crawled html to get links & adds them to queue.
- It repeats above steps until there's no URLs in queue.

- URLs are saved into database with status: 
    200 OK
    302/301 Redirect
    404 Not found
    403 Not allowed
    0 Exception when request the URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyCrawler.py

PyCrawler.py

README

README

config.py

config.py

setup.py

setup.py

sitelinks.sql

sitelinks.sql

Repository files navigation

About

Releases

Packages

Languages

nghiaqh/pycrawler

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages