titleCrawler

Python multi-threaded website title crawler to crawl all the titles of websites
using beautifulsoup4 + html5lib, you need install them on you system
- pip install beautifulsoup4
- pip install html5lib
you need change the QUEUE_FILE_PATH, QUEUE_FILE_NAME and DB_FILE_PATH, DB_FILE_NAME to adjust your system in "main.py"
you need change the sqlite table_name in "main.py"
to run the crawler you should tap the command line like:

python main.py

you can also adjust you requirment to run the python background
if you have any confusion about the titleCrawler

add exception handling, you need keep in mind that the data never be nice when you crawling the real data or facing the read world. keep in mind. keep in mind. keep in mind.

countValidCrawledTitles.py usage:

python countValidCrawledTitles.py db_path table_name
eq:
python countValidCrawledTitles.py test_db url_title_rel

makeFakeUrls.py usage:

python makeFakeUrls.py filepath start_index write_cnt
eq:
python makeFakeUrls.py test.csv 1 100

main.py usage(run in the background):

nice -n x nohup python main.py &
x is the number of the nice value, max is -20, min is 19
eq: 
nice -n -16 nohup python main.py &
you can trace the log dynamic by using: 
tail -f nohup.out

date: 2016/07/29
authour: zhangjinxing
email: [email protected]
tell: 15600616254

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
countValidCrawledTitles.py		countValidCrawledTitles.py
general.py		general.py
main.py		main.py
makeFakeUrls.py		makeFakeUrls.py
spider.py		spider.py
sqlite.py		sqlite.py
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

titleCrawler

About

Releases

Packages

Languages

Changjinxing/titleCrawler

Folders and files

Latest commit

History

Repository files navigation

titleCrawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages