English | Portuguese
PLNCrawler is a web crawler focused on the automated creation of datasets to be used in natural language processing. The included sites are Sensacionalista, The piauí Herald and Estadao, with their datasets available in this repository in the datasets folder.
About the article published with reference to this repository – Some of the websites used previously were discontinued because: 1. Nexo Jornal having started asking for registration in order to be able to view its content and; 2. HuffPost Brasil stopped publishing content and all of its news was limited. These two websites were exchanged for Estadão
To make good use of this repository it is recommended to use the pipenv package. If you don't want to use it, install the necessary dependencies the way you want and following the versions mentioned in the Pipfile file in the packages division.
The following installation tutorial will be based on pipenv, if you don't have it, install it with pip.
> pip install pipenv
Clone or download the repository.
https://github.com/schuberty/PLNCrawler.git
Being in the directory where the repository was imported, install a new virtual environment with the correct dependencies from the Pipfile file using the following command:
> pipenv install
Next, activate the Pipenv shell.
> pipenv shell
This will spawn a new shell subprocess, which can be deactivated by using:
(env) > exit