Add Automated Readability Index (ARI). Closes #20 #46

henrifroese · 2020-07-08T21:23:14Z

The function (added to visualization.py) returns a new series where each entry corresponds to the ARI of the given series at that position. It uses exactly the wikipedia formula & description. Numpy is imported so NaNs can be returned for invalid entries.

Also added unit tests to test_visualization.py.

Add ARI to visualization module. Add unit tests to test_visualization. Additionally import numpy in visualization and test_visualization to be able to return NaNs in Series.

jbesomi · 2020-07-09T08:15:25Z

Hi @hf2000510, thank you for your PR and welcome!

In general, Texthero's source code should be as minimal, as fast and as concise as possible. Probably, we can change a bit the code to make it more clear and easy to read and probably faster.

Learn from textstat

Textstat, a python toolkit that provide some text statistics has already implemented ari. Their solution is a bit more concise:

def automated_readability_index(self, text):
        chrs = self.char_count(text)
        words = self.lexicon_count(text)
        sentences = self.sentence_count(text)
        try:
            a = float(chrs) / float(words)
            b = float(words) / float(sentences)
            readability = (
                    (4.71 * legacy_round(a, 2))
                    + (0.5 * legacy_round(b, 2))
                    - 21.43)
            return legacy_round(readability, 1)
        except ZeroDivisionError:
            return 0.0

We can probably learn from them.
Notice also how they use char_count, lexicon_count and sentence_count.

Pandas way

If you look carefully at almost all Texthero's functions, when not strictly necessary, we try to avoid using apply as this is slower compared to the built-in Pandas function.

For the ARI function, we can probably obtain the same results by first computing the Pandas Series for characters_s, words_s and sentence_s and then, using broadcasting as in numpy, compute ari_s = 4.71 * character_s/ ...

x_s is a convention used in Textehero's code to show that the variable is a Pandas Series. p.s if you spot some code that does not use this convention, feel free to open a PR.

Check input is string

Using preprocessing.remove_whitespace for checking if the input is a Pandas Series of a string is not the best solution as is not very clear why and also as it uses a function not designed for that.

Instead, we should probably use pandas.api.types.is_string_dtype from Pandas: link.

Having a solid knowledge of Pandas might help when contributing to Texthero. If you haven't already done, I encourage you to have a look at the Pandas API in details. This are good pages to check: General utility functions and Pandas Series string handling.

Extra comments

For correctly achieving this function, we need to implement basically three sub-tasks:

words_s is the number of spaces, so we need to count this. This is just s.str.split().str.len() -1, right?
characters_ is the number of characters. This is s.str.len()
sentence_s is the number of sentences. This is more subtle to compute and we need spaCy, right? Splitting by . and counts would not be enough (example: Stop, F.B.I, do not move. ...)

For sentence_s would be interesting to add a hero function hero.count_sentences that does exactly that. We might, therefore, want to first add a PR that adds this new function and only later implement the ari function.

For words_s and characters_s probably an extra hero function is not necessary but probably not everyone knows these tricks. For this part, the idea is to write a "how-to" tutorial on the blog that explains how to profit the most of Pandas.

This was quite a long PR review, hope you got some interesting and useful hints.

Let me know your feedback!
Regards,

henrifroese · 2020-07-09T11:01:29Z

Thanks for your help! I've converted this to a draft pull request and opened #51 to first implement a count_sentences function, which makes sense independently, and also makes the automated readability index implementation easier. Will finish this when count_sentences is done.

* Added Remove Tags and Replace Tags * removed contributor

* README.md * updated README

…ssary (jbesomi#64)

* added replace hashtags and remove hashtag * Fixed the Documentation * Preprocessing Hashtag Regex as a raw string

* Add count_sentences function to nlp.py Also add tests for the function to test_nlp.py * Implement suggestions from pull request. Add more tests, change style (docstring, tests naming). Remove unicode-casting to avoid unexpected behaviour. * Add link to spacy documentation. Additionally update index tests, they're cleaner now. Co-authored-by: Henri Froese <[email protected]>

Now incorporates suggested changes. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>

New pull request from jbesomi#46 as we had some Git problems. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>

henrifroese · 2020-07-12T20:05:40Z

We had some Git trouble (as you can probably see above 🥉 ) so we closed this and moved the PR to #74 , sorry about that

Add Automated Readability Index (ARI). Closes jbesomi#20

d850ce2

Add ARI to visualization module. Add unit tests to test_visualization. Additionally import numpy in visualization and test_visualization to be able to return NaNs in Series.

vercel bot deployed to Preview July 8, 2020 21:23 View deployment

jbesomi linked an issue Jul 9, 2020 that may be closed by this pull request

Add automated_readability_index(s) under visualization #20

Open

henrifroese marked this pull request as draft July 9, 2020 09:53

henrifroese mentioned this pull request Jul 9, 2020

Add count_sentences function to nlp.py #51

Merged

jbesomi and others added 15 commits July 12, 2020 20:35

Add MIT license

cecfbf8

Add MIT license without url

ad25ddc

Update README.md

5751f22

Website: fix github stars button

6186788

Website: add css media queries for better responsiveness on mobile

eb1164e

Added Remove Tags and Replace Tags (jbesomi#50)

7ac1649

* Added Remove Tags and Replace Tags * removed contributor

remove_tags and replace_tags: improve docstring

6045edf

PR for contributor addition. (jbesomi#52)

19925ee

* README.md * updated README

Update CONTRIBUTING.md

4337b07

Fix language name (jbesomi#53)

301822d

Preprocessing removed the capturing group from regex => it was unnece…

fcb286e

…ssary (jbesomi#64)

added replace hashtags and remove hashtag (jbesomi#58)

81411c2

* added replace hashtags and remove hashtag * Fixed the Documentation * Preprocessing Hashtag Regex as a raw string

Merge remote-tracking branch 'origin/master'

79a0805

Improve automated_readability_index.

7e1ad2f

Now incorporates suggested changes. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>

vercel bot deployed to Preview July 12, 2020 19:16 View deployment

henrifroese mentioned this pull request Jul 12, 2020

Implement Automated Readability Index, Closes #20 ; new PR; Waiting until Checking for NaNs is implemented. #74

Draft

henrifroese closed this Jul 12, 2020

henrifroese deleted the Automated_Readability_Index branch July 12, 2020 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Automated Readability Index (ARI). Closes #20 #46

Add Automated Readability Index (ARI). Closes #20 #46

henrifroese commented Jul 8, 2020

jbesomi commented Jul 9, 2020

henrifroese commented Jul 9, 2020

henrifroese commented Jul 12, 2020

Add Automated Readability Index (ARI). Closes #20 #46

Add Automated Readability Index (ARI). Closes #20 #46

Conversation

henrifroese commented Jul 8, 2020

jbesomi commented Jul 9, 2020

henrifroese commented Jul 9, 2020

henrifroese commented Jul 12, 2020