Web server-based web crawler

This is more of a toy project, so don't expect full-fledged crawler.

Running

To run, it's best to use included docker image:

docker build -t webcrawl .
docker run --rm -ti --name webcrawl -p 3000:3000 webcrawl -a 0.0.0.0:3000

And then the API should be accessible at https://localhost:3000 on the host.

Quickstart

Schedule a crawl

curl -i -XPOST \
    -d '{"url": "https://some.host.example.com", "throttle": 100}' \
    https://localhost:3000/api/crawl

List all crawled domains

curl -i -XGET https://localhost:3000/api/domains

List URLs for a domain

curl -i -XGET https://localhost:3000/api/results?id=https://some.host.example.com

List URLs count for a domain

curl -i -XGET https://localhost:3000/api/results/count?id=https://some.host.example.com

API

Get all crawled domains

GET /api/domains

Schedule a crawl

POST /api/crawl

Payload:

{
    "url": "https://example.com",
    "throttle": 50,
}

where:

url: an url to be crawled
throttle: a maximum number of concurrent requests

Response:

{
    "id": "https://example.com"
}

Additional status codes:

400 - if the payload is malformed, or it contains invalid URL
409 - if the crawl is already pending

Get results of the crawl

GET /api/results?id={id}

Response

A json list of retrieved URLs

Additional status codes:

202 - if the crawl is pending and the result is not yet available
404 - if the id is not present in the results cache

Get number of results of the crawl

GET /api/results/count?id={id}

Response:

{
    "https://example.com": 123
}

Additional status codes:

202 - if the crawl is pending and the result is not yet available
404 - if the id is not present in the results cache

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
crawler		crawler
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web server-based web crawler

Running

Quickstart

Schedule a crawl

List all crawled domains

List URLs for a domain

List URLs count for a domain

API

Get all crawled domains

Schedule a crawl

Payload:

where:

Response:

Additional status codes:

Get results of the crawl

Response

Additional status codes:

Get number of results of the crawl

Response:

Additional status codes:

About

Releases

Packages

Languages

License

forgerpl/webcrawl

Folders and files

Latest commit

History

Repository files navigation

Web server-based web crawler

Running

Quickstart

Schedule a crawl

List all crawled domains

List URLs for a domain

List URLs count for a domain

API

Get all crawled domains

Schedule a crawl

Payload:

where:

Response:

Additional status codes:

Get results of the crawl

Response

Additional status codes:

Get number of results of the crawl

Response:

Additional status codes:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages