This repository has been archived by the owner on Jan 1, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
nghiaqh/pycrawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
PREREQUIREMENTS 1. Python 2.6 2. MySQL 5.0 INSTALLATION 1. Create a database named sitelinks 2. Update the database configuration in config.py (hostname, username, password) 3. Run setup.py: python setup.python 4. Update config.py: - crawl_domain: domain you need to crawl. - fileext: file types you don't want to crawl. TO START CRAWLING Run PyCrawler.py: - the crawler will start with the first URL in queue (initialized when setup), - parses the crawled html to get links & adds them to queue. - It repeats above steps until there's no URLs in queue. - URLs are saved into database with status: 200 OK 302/301 Redirect 404 Not found 403 Not allowed 0 Exception when request the URL
About
cli crawler
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published