Skip to content

adigaboy/web_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler CLI

About

The tool is designed as a CLI tool which crawls the given URL and calculates the url ratio in the page. Once a URL crawling process is done(depth reached) a file with the results is created and the program ends.

System Design

The tool is made up of 2 components:

WebCrawler

The module is in charge of handling the in page links extraction from the URL and calculating the ratio. The main logic is written using asyncio Queue to handle all URLs to crawl through. Main functionality is written in async in order speed up the web page fetches and avoid blocking.

FileResultGenerator

In charge of writing the results of WebCrawler into a TSV formatted file.

Technical Details

Python version

3.11

Virtual Environment Set Up

python3.11 -m venv <path_to_env>
source <path_to_env>/bin/activate # incase of linux OS
<path_to_env>\Scripts\Activate.ps1 # incase of windows OS

python3.11 -m pip install -r requirements.txt

How To Use

Once virtual environment is set up you can use the tool in the following manner: python ./app.py

How to Test

Run the following command:

python3.11 -m pip install -r dev_requirements.txt

And after that:

pytest crawler\tests --cov-report term-missing --cov=crawler

Author

Nal Zazi

About

Web crawler for home assignment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages