pre-processing draft schema

jbesomi · Iota87 · Sep 11, 2020 · Sep 11, 2020 · Sep 11, 2020 · Oct 7, 2020
commit f303f0bf8792280353e892e3f80cd395e46bebb5
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -0,0 +1,3 @@
+{
+ "python.pythonPath": "/Users/giovanniliotta/opt/anaconda3/envs/texthero/bin/python"
+}
diff --git a/test_gio/FirstTest.ipynb b/test_gio/FirstTest.ipynb
@@ -0,0 +1,251 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import texthero as hero\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [
+ {
+ "output_type": "stream",
+ "name": "stdout",
+ "text": "/Users/giovanniliotta/Dev/texthero/test_gio\n"
+ }
+ ],
+ "source": [
+ "!pwd"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df = pd.read_csv(\n",
+ " \"../dataset/bbcsport.csv\"\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": " text topic\n0 Claxton hunting first major medal\\n\\nBritish h... athletics\n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics",
+ "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>topic</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Claxton hunting first major medal\\n\\nBritish h...</td>\n <td>athletics</td>\n </tr>\n <tr>\n <th>1</th>\n <td>O'Sullivan could run in Worlds\\n\\nSonia O'Sull...</td>\n <td>athletics</td>\n </tr>\n </tbody>\n</table>\n</div>"
+ },
+ "metadata": {},
+ "execution_count": 4
+ }
+ ],
+ "source": [
+ "df.head(2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df['clean_text'] = hero.clean(df['text'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/plain": " text topic \\\n0 Claxton hunting first major medal\\n\\nBritish h... athletics \n1 O'Sullivan could run in Worlds\\n\\nSonia O'Sull... athletics \n\n clean_text \n0 claxton hunting first major medal british hurd... \n1 sullivan could run worlds sonia sullivan indic... ",
+ "text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>text</th>\n <th>topic</th>\n <th>clean_text</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>Claxton hunting first major medal\\n\\nBritish h...</td>\n <td>athletics</td>\n <td>claxton hunting first major medal british hurd...</td>\n </tr>\n <tr>\n <th>1</th>\n <td>O'Sullivan could run in Worlds\\n\\nSonia O'Sull...</td>\n <td>athletics</td>\n <td>sullivan could run worlds sonia sullivan indic...</td>\n </tr>\n </tbody>\n</table>\n</div>"
+ },
+ "metadata": {},
+ "execution_count": 6
+ }
+ ],
+ "source": [
+ "df.head(2)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "class Category:\n",
+ " BOOKS= \"BOOKS\"\n",
+ " CLOTHING = \"CLOTHING\"\n",
+ " \n",
+ "train_x = [\"I love the book\", \"this is a great book\", \"the fit is great\", \"I love the shoes\"]\n",
+ "train_y = [Category.BOOKS, Category.BOOKS, Category.CLOTHING, Category.CLOTHING]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn.feature_extraction.text import CountVectorizer"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[[1 0 0 0 1 0 1 0]\n",
+ " [1 0 1 1 0 0 0 1]\n",
+ " [0 1 1 1 0 0 1 0]\n",
+ " [0 0 0 0 1 1 1 0]]\n",
+ "['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']\n"
+ ]
+ }
+ ],
+ "source": [
+ "vectorizer = CountVectorizer()\n",
+ "train_x_vectors = vectorizer.fit_transform(train_x)\n",
+ "print(vectors.toarray())\n",
+ "print(vectorizer.get_feature_names())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from sklearn import svm"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "SVC(kernel='linear')"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "clf_svm = svm.SVC(kernel='linear')\n",
+ "clf_svm.fit(train_x_vectors, train_y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['BOOKS'], dtype='<U8')"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test_x = vectorizer.transform(['I like the book'])\n",
+ "\n",
+ "clf_svm.predict(test_x)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3.8.3 64-bit ('texthero': conda)",
+ "language": "python",
+ "name": "python38364bittextheroconda0d8aef95e9f34e7a99e6918fcf1a4326"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.3-final"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/texthero.code-workspace b/texthero.code-workspace
@@ -0,0 +1,7 @@
+{
+ "folders": [
+ {
+ "path": "."
+ }
+ ]
+}
diff --git a/website/docs/getting-started-preprocessing.md b/website/docs/getting-started-preprocessing.md
@@ -4,66 +4,100 @@ id: getting-started-preprocessing
 
 ## Getting started with <span style="color: #ff8c42">pre-processing</span>
 
-Pre-processing is a fundamental step in text analysis. Being consistent and methodical in pre-processing operations is a necessary condition for the success of text-based analysis.
+Pre-processing is a fundamental step in text analysis. Consistent, methodical and reproducible pre-processing operations are a necessary pre-requisite for success of any type of text-based analysis.
 
 ## Overview
 
 --
 
 ## Intro
 
-When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to allow us to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc... you get it!
-The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words, paragraphs, etc.
-The machine however knows how to read numerical vectors and text has good properties that easily allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require that the text given as input is as clean and simple as possible, in other words **pre-processed**.
-Clean and simple basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept).
+When we (as humans) read text from a book or a newspaper, the _input_ that our brain gets to understand that text is in the form of individual letters, that are then combined into words, sentences, paragraphs, etc.
+The problem with having a machine reading text is simple: the machine doesn't know how to read letters, words or paragraphs. The machine knows instead how to read _numerical vectors_. 
+Text data has good properties that allow its conversion into a numerical representation. There are several sophisticated methods to make this conversion but, in order to perform well, all of them require the input text in a form that is as clean and simple as possible, in other words **pre-processed**.
+Pre-processing text basically means eliminating any unnecessary information (e.g. the machine does not need to know about punctuation, page numbers or spacing between paragraphs) and solving as many ambiguities as possibe (so that, for instance, the verb "run" and its forms "ran", "runs", "running" will all refer to the same concept).
 
 How useful is this step?
-Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% for actually using it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly implemented.
+Have you ever heard the story that Data Scientists typically spend ~80% of their time to obtain a proper dataset and the remaining ~20% to actually analyze it? Well, for text is kind of the same thing. Pre-processing is a **fundamental step** in text analysis and it usually takes some time to be properly and unambiguously implemented.
 
-In text hero it only takes one command:
+With text hero it only takes one command!
 To clean text data in a reliable way all we have to do is:
-#Note for this section we use the same dataset as in **Getting Started**
 
 ```python
 df['clean_text'] = hero.clean(df['text'])
 ```
-or ...
-[Pipeline explanation]
+
+> NOTE. In this section we use the same [BBC Sport Dataset](http:https://mlg.ucd.ie/datasets/bbc.html) as in **Getting Started**. To load the `bbc sport` dataset in a Pandas DataFrame run:
+```python
+df = pd.read_csv(
+ "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
+)
+```
 
 ## Clean
 
 Texthero clean method allows a rapid implementation of key cleaning steps that are:
-- Derived from survey of relevant academic literature #cite
-- Validated by a group of NLP enthusiasts with experience in applying these methods in different contexts #background
-- Accepted by the NLP community as inescapable and standard
+
+- Derived from review of relevant academic literature (#include citations)
+- Validated by a group of NLP enthusiasts with applied experience in different contexts
+- Accepted by the NLP community as standard and inescapable
 
 The default steps do the following:
 
-#[TABLE]
+| Step | Description |
+|----------------------|--------------------------------------------------------|
+|`fillna()` |Replace missing values with empty spaces |
+|`lowercase()` |Lowercase all text to make the analysis case-insensitive|
+|`remove_digits()` |Remove numbers |
+|`remove_punctuation()`|Remove punctuation symbols (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~) |
+|`remove_diacritics()` |Remove accents
+|`remove_stopwords()` |Remove the most common words ("i", "me", "myself", "we", "our", etc.) |
+
+|`remove_whitespace()` |Remove spaces between words|
+
 
-in just one command:
+
+in just one command!
 
 ```python
 df['clean_text'] = hero.clean(df['text'])
 ```
+
 ## Custom Pipeline
 
-Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond" or if you think that stopwords contain relevant information for your analysis setting.
-If this is the case, you can easily edit the pre-processing pipeline by:
+Sometimes, project specificities might require different approach to pre-processing. For instance, you might decide that digits are important to your analyses if you are analyzing movies and one of them is "007-James Bond". Or, you might decide that in your specific setting stopwords contain relevant information (e.g. if your data is about music bands and contains "The Who" or "Take That").
+If this is the case, you can easily customize the pre-processing pipeline by implementing only specifics cleaning steps:
+
+```python
+from texthero import preprocessing
+
+custom_pipeline = [preprocessing.fillna,
+ preprocessing.lowercase,
+ preprocessing.remove_punctuation
+ preprocessing.remove_whitespace]
+df['clean_text'] = hero.clean(df['text'], custom_pipeline)
+```
+
+or alternatively
+
 ```python
+df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline)
+```
 
-```#Comment/explain what it does
+In the above example we want to pre-process the text despite keeping accents, digits and stop words.
 
-If you are interested in learning more about text cleaning, check out these resources:
-#Links list
+##### Preprocessing API
+
+Check-out the complete [preprocessing API](/docs/api-preprocessing) to discover how to customize the preprocessing steps according to your specific needs.
 
 
+If you are interested in learning more about text cleaning, check out these resources:
+(#Links list)
+
 
 
 
-## Customize it 
 
-Let's see how texthero STANDARDIZE this step...
 
 
 ### Stemming