GitHub

Short Text Classification

This repository provides tools to perform automated classification of public spending based on its text description. To this end, we use xgboost, and the doc2vec algorithm to perform text feature learning.

Data preparation

We use several years of manually labeled examples:

2011 - 51500 examples
2012 - 47777 examples
2013 - 31890 examples
2014 - 47260 examples
2015 - 51101 examples
2016 - 27655 examples

For a total of 256764 labeled examples.

Each of these examples include a paragraph describing the expenditure. A total amount for it. The state in which it took place. And a categorical variable indicating the "type" of expenditure it belongs to.

Aditionally, a data table corresponding to the year 2017 is available. It includes all the previously mentioned fields except the "type" (the examples are unlabeled). A total of 240178 belong to this data set.

First of all, columns names must match between years. To achieve this we perform the following substitutions:

Entidad -> Estado
FOLIO/Clave de Proyecto -> Folio
Presupuesto -> Total
Nombre de Proyecto -> Destino
Tipo de gasto -> Tipo

After this, all the tables are stacked (naturally the "type" variable for the year 2017 is left empty: NA).

Then we proceed to clean the corpus (the expenditure description paragraph) by:

Making everything lowe case
Removing numbers
Removing punctuation
Removing unnecessary white spaces
Removing all special characters

Feature engineering

First, the "type" variable has some typos that must be corrected:

"au" -> "AU"
"ed" -> "ED"
"re" -> "RE"
"S" -> "SA"

Second, the "state" variable shows some differences that also need to be corrected:

"Veracruz" -> "Veracruz de Ignacio de la Llave"
"Michoacán" -> "Michoacán de Ocampo"
"AGS" -> "Aguascalientes"
"Quintana Roo." -> "Quintana Roo"
"Distrito Federal" -> "Ciudad de México"
"Coahuila" -> "Coahuila de Zaragoza"

Then the "state" variable is transformed into a set of 32 dummy variables to be able to enter the XGBoost model.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
evaluation_stats		evaluation_stats
grid_search_results		grid_search_results
models		models
.gitignore		.gitignore
README.md		README.md
classifier_hyperparameter_gridsearch.R		classifier_hyperparameter_gridsearch.R
classify_new_cases.R		classify_new_cases.R
create_classifier_final.R		create_classifier_final.R
create_classifier_with_weights.R		create_classifier_with_weights.R
create_clean_corpus.R		create_clean_corpus.R
tools.R		tools.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Short Text Classification

Data preparation

Feature engineering

About

Releases

Packages

Contributors 2

Languages

jequihua/short_text_classification

Folders and files

Latest commit

History

Repository files navigation

Short Text Classification

Data preparation

Feature engineering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages