Skip to content
This repository has been archived by the owner on Apr 25, 2022. It is now read-only.
/ PLNCrawler Public archive

A web crawler focused in the automated creation of datasets for NLP use.

License

Notifications You must be signed in to change notification settings

schuberty/PLNCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | Portuguese

PLNCrawler

A web crawler for datasets creation

Python version Latest commit License Article

PLNCrawler is a web crawler focused on the automated creation of datasets to be used in natural language processing. The included sites are Sensacionalista, The piauí Herald and Estadao, with their datasets available in this repository in the datasets folder.

About the article published with reference to this repository – Some of the websites used previously were discontinued because: 1. Nexo Jornal having started asking for registration in order to be able to view its content and; 2. HuffPost Brasil stopped publishing content and all of its news was limited. These two websites were exchanged for Estadão

Dependencies

How to use

To make good use of this repository it is recommended to use the pipenv package. If you don't want to use it, install the necessary dependencies the way you want and following the versions mentioned in the Pipfile file in the packages division.

The following installation tutorial will be based on pipenv, if you don't have it, install it with pip.

> pip install pipenv

Clone or download the repository.

https://github.com/schuberty/PLNCrawler.git

Being in the directory where the repository was imported, install a new virtual environment with the correct dependencies from the Pipfile file using the following command:

> pipenv install

Next, activate the Pipenv shell.

> pipenv shell

This will spawn a new shell subprocess, which can be deactivated by using:

(env) > exit

README in development

About

A web crawler focused in the automated creation of datasets for NLP use.

Topics

Resources

License

Stars

Watchers

Forks

Languages