PLNCrawler

A web crawler for datasets creation

PLNCrawler is a web crawler focused on the automated creation of datasets to be used in natural language processing. The included sites are Sensacionalista, The piauí Herald and Estadao, with their datasets available in this repository in the datasets folder.

About the article published with reference to this repository – Some of the websites used previously were discontinued because: 1. Nexo Jornal having started asking for registration in order to be able to view its content and; 2. HuffPost Brasil stopped publishing content and all of its news was limited. These two websites were exchanged for Estadão

Dependencies

How to use

To make good use of this repository it is recommended to use the pipenv package. If you don't want to use it, install the necessary dependencies the way you want and following the versions mentioned in the Pipfile file in the packages division.

The following installation tutorial will be based on pipenv, if you don't have it, install it with pip.

> pip install pipenv

Clone or download the repository.

https://github.com/schuberty/PLNCrawler.git

Being in the directory where the repository was imported, install a new virtual environment with the correct dependencies from the Pipfile file using the following command:

> pipenv install

Next, activate the Pipenv shell.

> pipenv shell

This will spawn a new shell subprocess, which can be deactivated by using:

(env) > exit

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
PLNCrawler		PLNCrawler
datasets		datasets
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
README.pt-BR.md		README.pt-BR.md
sites.json		sites.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLNCrawler

Dependencies

How to use

README in development

About

Languages

License

schuberty/PLNCrawler

Folders and files

Latest commit

History

Repository files navigation

PLNCrawler

Dependencies

How to use

README in development

About

Topics

Resources

License

Stars

Watchers

Forks

Languages