Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for getting-started-preprocessing #183

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
pre-processing draft schema
  • Loading branch information
Giovanni Liotta authored and Giovanni Liotta committed Sep 11, 2020
commit f303f0bf8792280353e892e3f80cd395e46bebb5
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.pythonPath": "/Users/giovanniliotta/opt/anaconda3/envs/texthero/bin/python"
}
251 changes: 251 additions & 0 deletions test_gio/FirstTest.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import texthero as hero\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "/Users/giovanniliotta/Dev/texthero/test_gio\n"
}
],
"source": [
"!pwd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\n",
" \"../dataset/bbcsport.csv\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " text topic\n0 Claxton hunting first major medal\\n\\nBritish h... athletics\n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>topic</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Claxton hunting first major medal\\n\\nBritish h...</td>\n <td>athletics</td>\n </tr>\n <tr>\n <th>1</th>\n <td>O'Sullivan could run in Worlds\\n\\nSonia O'Sull...</td>\n <td>athletics</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"execution_count": 4
}
],
"source": [
"df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df['clean_text'] = hero.clean(df['text'])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": " text topic \\\n0 Claxton hunting first major medal\\n\\nBritish h... athletics \n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics \n\n clean_text \n0 claxton hunting first major medal british hurd... \n1 sullivan could run worlds sonia sullivan indic... ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>topic</th>\n <th>clean_text</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Claxton hunting first major medal\\n\\nBritish h...</td>\n <td>athletics</td>\n <td>claxton hunting first major medal british hurd...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>O'Sullivan could run in Worlds\\n\\nSonia O'Sull...</td>\n <td>athletics</td>\n <td>sullivan could run worlds sonia sullivan indic...</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"execution_count": 6
}
],
"source": [
"df.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"class Category:\n",
" BOOKS= \"BOOKS\"\n",
" CLOTHING = \"CLOTHING\"\n",
" \n",
"train_x = [\"I love the book\", \"this is a great book\", \"the fit is great\", \"I love the shoes\"]\n",
"train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 0 0 0 1 0 1 0]\n",
" [1 0 1 1 0 0 0 1]\n",
" [0 1 1 1 0 0 1 0]\n",
" [0 0 0 0 1 1 1 0]]\n",
"['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']\n"
]
}
],
"source": [
"vectorizer = CountVectorizer()\n",
"train_x_vectors = vectorizer.fit_transform(train_x)\n",
"print(vectors.toarray())\n",
"print(vectorizer.get_feature_names())"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import svm"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"SVC(kernel='linear')"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf_svm = svm.SVC(kernel='linear')\n",
"clf_svm.fit(train_x_vectors, train_y)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['BOOKS'], dtype='<U8')"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_x = vectorizer.transform(['I like the book'])\n",
"\n",
"clf_svm.predict(test_x)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.3 64-bit ('texthero': conda)",
"language": "python",
"name": "python38364bittextheroconda0d8aef95e9f34e7a99e6918fcf1a4326"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3-final"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
7 changes: 7 additions & 0 deletions texthero.code-workspace
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"folders": [
{
"path": "."
}
]
}
78 changes: 56 additions & 22 deletions website/docs/getting-started-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,66 +4,100 @@ id: getting-started-preprocessing

## Getting started with <span style="color: #ff8c42">pre-processing</span>

Pre-processing is a fundamental step in text analysis. Being consistent and methodical in pre-processing operations is a necessary condition for the success of text-based analysis.
Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis.

## Overview

--

## Intro

When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to allow us to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc... you get it!
The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words, paragraphs, etc.
The machine however knows how to read numerical vectors and text has good properties that easily allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require that the text given as input is as clean and simple as possible, in other words **pre-processed**.
Clean and simple basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept).
When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc.
The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_.
Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**.
Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept).
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

How useful is this step?
Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% for actually using it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly implemented.
Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented.

In text hero it only takes one command:
With text hero it only takes one command!
To clean text data in a reliable way all we have to do is:
#Note for this section we use the same dataset as in **Getting Started**

```python
df['clean_text'] = hero.clean(df['text'])
```
or ...
[Pipeline explanation]

> NOTE. In this section we use the same [BBC Sport Dataset](http:https://mlg.ucd.ie/datasets/bbc.html) as in **Getting Started**. To load the `bbc sport` dataset in a Pandas DataFrame run:
```python
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
```

## Clean

Texthero clean method allows a rapid implementation of key cleaning steps that are:
Iota87 marked this conversation as resolved.
Show resolved Hide resolved
- Derived from survey of relevant academic literature #cite
- Validated by a group of NLP enthusiasts with experience in applying these methods in different contexts #background
- Accepted by the NLP community as inescapable and standard

- Derived from review of relevant academic literature (#include citations)
- Validated by a group of NLP enthusiasts with applied experience in different contexts
- Accepted by the NLP community as standard and inescapable

The default steps do the following:

#[TABLE]
| Step | Description |
|----------------------|--------------------------------------------------------|
|`fillna()` |Replace missing values with empty spaces |
|`lowercase()` |Lowercase all text to make the analysis case-insensitive|
|`remove_digits()` |Remove numbers |
|`remove_punctuation()`|Remove punctuation symbols (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) |
|`remove_diacritics()` |Remove accents
|`remove_stopwords()` |Remove the most common words ("i", "me", "myself", "we", "our", etc.) |

|`remove_whitespace()` |Remove spaces between words|


in just one command:

in just one command!

```python
df['clean_text'] = hero.clean(df['text'])
```

## Custom Pipeline
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond" or if you think that stopwords contain relevant information for your analysis setting.
If this is the case, you can easily edit the pre-processing pipeline by:
Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That").
Iota87 marked this conversation as resolved.
Show resolved Hide resolved
If this is the case, you can easily customize the pre-processing pipeline by implementing only specifics cleaning steps:
Iota87 marked this conversation as resolved.
Show resolved Hide resolved

```python
from texthero import preprocessing

custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_punctuation
preprocessing.remove_whitespace]
df['clean_text'] = hero.clean(df['text'], custom_pipeline)
```

or alternatively

```python
df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline)
```

```#Comment/explain what it does
In the above example we want to pre-process the text despite keeping accents, digits and stop words.

If you are interested in learning more about text cleaning, check out these resources:
#Links list
##### Preprocessing API

Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs.


If you are interested in learning more about text cleaning, check out these resources:
(#Links list)




## Customize it

Let's see how texthero STANDARDIZE this step...


### Stemming
Expand Down