Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started: Kind of Series (HeroSeries) #135

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
env: PATH=/c/Python38:/c/Python38/Scripts:$PATH
install:
- pip3 install --upgrade pip # all three OSes agree about 'pip3'
- pip3 install black
- pip3 install black==19.10b0
- pip3 install ".[dev]" .
# 'python' points to Python 2.7 on macOS but points to Python 3.8 on Linux and Windows
# 'python3' is a 'command not found' error on Windows but 'py' works on Windows only
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ install_requires =
# TODO pick the correct version.
[options.extras_require]
dev =
black>=19.10b0
black==19.10b0
pytest>=4.0.0
Sphinx>=3.0.3
sphinx-markdown-builder>=0.5.4
Expand Down
153 changes: 153 additions & 0 deletions website/docs/getting-started-herotypes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
id: getting-started-herotypes
title: Getting started - HeroTypes
---

<h1 align="center">HeroTypes</h1>

In Texthero, we're always working with Pandas Series and Pandas Dataframes to gain insights from text data! To make things easier and more intuitive, we differentiate between different types of Series/DataFrames, depending on where we are on the road to understanding our dataset.

<h2 align="center">Overview</h2>

When working with text data, it is easy to get overwhelmed by the many different functions that can be applied to the data. We want to make the whole journey as clear as possible. For example, when we start working with a new dataset, we usually want to do some preprocessing first. At the beginning, the data is in a DataFrame or Series where every document is one string. It might look like this:
```python
text
document_id
0 "Text in the first document"
1 "Text in the second document"
2 "Text in the third document"
3 "Text in the fourth document"
4 ...

```

Consequently, in Texthero's _preprocessing_ module, the functions usually take as input a Series where every cell is a string, and return as output a Series where every cell is a string. We will call this kind of Series _TextSeries_, so users know immediately what kind of Series the functions can work on. For example, you might see a function
```python
remove_punctuation(s: TextSeries) -> TextSeries
```
in the documentation. You then know that this function takes as input a _TextSeries_ and returns as output a _TextSeries_, so it can be used in the preprocessing phase of your work, where each document is one string.

<h3 align="center">The HeroSeries Types</h3>

These are the three types currently supported by the library; almost all of the libraries functions takes as input and return as output one of these types:

1. **TextSeries**: Every cell is a text, i.e. a string. For example,
`pd.Series(["test", "test"])` is a valid TextSeries.

2. **TokenSeries**: Every cell is a list of words/tokens, i.e. a list
of strings. For example, `pd.Series([["test"], ["token2", "token3"]])` is a valid TokenSeries.

3. **VectorSeries**: Every cell is a vector representing text, i.e.
a list of floats. For example, `pd.Series([[1.0, 2.0], [3.0, 4.0]])` is a valid VectorSeries.

Additionally, sometimes Texthero functions (most that accept a
VectorSeries as input) also accept a Pandas _DataFrame_
as input that is representing a matrix. Every cell value
is then one entry in the matrix. An example is
`pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["word1", "word2", "word3"])`.

Now, if you see a function in the documentation that looks like this:
```python
tfidf(s: TokenSeries) -> DataFrame
```

then you know that the function takes a Pandas Series
whose cells are lists of strings (tokens) and will
return a Pandas DataFrame representing a matrix (in this case a [_Document-Term-Matrix_](https://en.wikipedia.org/wiki/Document-term_matrix) ).
You might call it like this:
```python
>>> import texthero as hero
>>> import pandas as pd
>>> s = pd.Series(["Text of first document", "Text of second document"])
>>> df_tfidf = s.pipe(hero.tokenize).pipe(hero.tfidf)
>>> df_tfidf

Text document first of second
0 1.0 1.0 1.405465 1.0 0.000000
1 1.0 1.0 0.000000 1.0 1.405465
```


And this function:
```python
pca(s: Union[VectorSeries, DataFrame]) -> VectorSeries
```
needs a _DataFrame_ or _VectorSeries_ as input and always returns a _VectorSeries_.

<h2 align="center">The Types in Detail</h2>

We'll now have a closer look at each of the types and learn where they are used in a typical NLP workflow.

<h3 align="left">TextSeries</h3>

In a _TextSeries_, every cell is a string. As we saw at the beginning of this tutorial, this type is mostly used in preprocessing. It is very simple and allows us to easily clean a text dataset. Additionally, many NLP functions such as `named_entities, noun_chunks, pos_tag` take a _TextSeries_ as input.

Example of a function that takes and returns a _TextSeries_:
```python
>>> s = pd.Series(["Text: of first! document", "Text of second ... document"])
>>> hero.clean(s)
0 text first document
1 text second document
dtype: object
```

<h3 align="left">TokenSeries</h3>

In a _TokenSeries_, every cell is a list of words/tokens. We use this to prepare our data for _representation_, so to gain insights from it through mathematical methods. This is why the functions that initially transform your documents to vectors, namely `tfidf, term_frequency, count`, take a _TokenSeries_ as input.

Example of a function that takes a _TextSeries_ and returns a _TokenSeries_:
```python
>>> s = pd.Series(["text first document", "text second document"])
>>> hero.tokenize(s)
0 [text, first, document]
1 [text, second, document]
dtype: object
```

<h3 align="left">VectorSeries</h3>

In a _VectorSeries_, every cell is a vector representing text. We use this when we have a low-dimensional (e.g. vectors with length <=1000), dense (so not a lot of zeroes) representation of our texts that we want to work on. For example, the dimensionality reduction functions `pca, nmf, tsne` all take a high-dimensional representation of our text (in the form of a _DataFrame_ (see below) or _VectorSeries_, and return a low-dimensional representation of our text in the form of a _VectorSeries_.

Example of a function that takes as input a _DataFrame_ or _VectorSeries_ and returns a _VectorSeries_:
```python
>>> s = pd.Series(["text first document", "text second document"]).pipe(hero.tokenize).pipe(hero.term_frequency)
>>> hero.pca(s)
0 [0.118, 0.0]
1 [-0.118, 0.0]
dtype: object
```

<h3 align="left">DataFrame</h3>

In Natural Language Processing, we are often working with matrices that contain information about our dataset. For example, the output of the functions `tfidf`, `count`, and `term_frequency` is a [Document-Term Matrix](https://en.wikipedia.org/wiki/Document-term_matrix), i.e. a matrix where each row is one document and each column is one term / word.

We use a Pandas DataFrame for this for two reasons:
1. It looks nice.
2. It can be sparse.

The second reason is worth explaining in more detail: In e.g. a big Document-Term Matrix, we might have 10,000 different terms, so 10,000 columns in our DataFrame. Additionally, most documents will only contain a small subset of all the terms. Thus, in each row, there will be lots of zeros in our matrix. This is why we use a [sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix): A sparse matrix only stores the non-zero fields. And Pandas DataFrames support sparse data, so Texthero users fully profit from the sparseness!

This is a massive advantage when dealing with *big datasets*: In a _sparse DataFrame_, we only store the data that's relevant to save lots and lots of time and space!

Let's look at an example with some more data.
```python
>>> data = pd.read_csv("https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv")
>>> data_count = data["text"].pipe(hero.count)
>>> data_count
! " "' ", # $ % ... £62m £6m £70m £7m £7million £80,000 £8m
0 0 5 0 0 0 0 0 ... 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
2 0 14 0 0 0 0 0 ... 0 0 0 0 0 0 0
3 0 10 0 0 0 0 0 ... 0 0 0 0 0 0 0
4 0 4 0 0 0 0 0 ... 0 0 0 0 0 0 0
.. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ...
732 0 2 0 0 0 0 2 ... 0 0 0 0 0 0 0
733 0 6 0 0 0 0 0 ... 0 0 0 0 0 0 0
734 0 5 0 0 0 0 0 ... 0 0 0 0 0 0 0
735 0 14 0 0 0 0 0 ... 0 0 0 0 0 0 0
736 0 6 0 0 0 0 0 ... 0 0 0 0 0 0 0

>>> data_count.sparse.density
0.010792808715706939
```
We can see that only around 1% of our DataFrame `data_count` is filled with non-zero values, so using the sparse DataFrame is saving us a lot of space.
3 changes: 2 additions & 1 deletion website/sidebars.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
{
"docs": {
"Getting Started": [
"getting-started"
"getting-started",
"getting-started-herotypes"
]
},
"api": {
Expand Down