Krawler: A Multithreaded Web Crawler in Python

An implementation of a simple web crawler in Python. The crawler is fully multithreaded and can be used to crawl the web for a given domain name.

Installing Poetry

To get started you need to have Poetry installed. You can install Poetry by running the following command in the shell.

pip install poetry

When the installation is finished, run the following command in the shell in the root folder of this repository to install the dependencies and create a virtual environment for the project.

poetry install

After that, enter the Poetry environment by invoking the poetry shell command.

poetry shell

Installing System Dependencies

If you are using a Debian-based system, you can install the system-wide dependencies by running the following command.

sudo apt-get install python3-bs4 libnss-resolve nscd

Running the Crawler

To run the crawler, you can use the following command.

pushd src && python3 main.py --domain <domain_name> --threads <number_of_threads> --output <output_file> && popd

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
docs		docs
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Krawler: A Multithreaded Web Crawler in Python

Installing Poetry

Installing System Dependencies

Running the Crawler

License

About

Languages

License

Cirice/Krawler

Folders and files

Latest commit

History

Repository files navigation

Krawler: A Multithreaded Web Crawler in Python

Installing Poetry

Installing System Dependencies

Running the Crawler

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages