Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automated Readability Index (ARI). Closes #20 #46

Conversation

henrifroese
Copy link
Collaborator

The function (added to visualization.py) returns a new series where each entry corresponds to the ARI of the given series at that position. It uses exactly the wikipedia formula & description. Numpy is imported so NaNs can be returned for invalid entries.

Also added unit tests to test_visualization.py.

Add ARI to visualization module. Add unit tests to test_visualization.
Additionally import numpy in visualization and test_visualization to be
able to return NaNs in Series.
@jbesomi
Copy link
Owner

jbesomi commented Jul 9, 2020

Hi @hf2000510, thank you for your PR and welcome!

In general, Texthero's source code should be as minimal, as fast and as concise as possible. Probably, we can change a bit the code to make it more clear and easy to read and probably faster.

Learn from textstat

Textstat, a python toolkit that provide some text statistics has already implemented ari. Their solution is a bit more concise:

def automated_readability_index(self, text):
        chrs = self.char_count(text)
        words = self.lexicon_count(text)
        sentences = self.sentence_count(text)
        try:
            a = float(chrs) / float(words)
            b = float(words) / float(sentences)
            readability = (
                    (4.71 * legacy_round(a, 2))
                    + (0.5 * legacy_round(b, 2))
                    - 21.43)
            return legacy_round(readability, 1)
        except ZeroDivisionError:
            return 0.0

We can probably learn from them.
Notice also how they use char_count, lexicon_count and sentence_count.

Pandas way

If you look carefully at almost all Texthero's functions, when not strictly necessary, we try to avoid using apply as this is slower compared to the built-in Pandas function.

For the ARI function, we can probably obtain the same results by first computing the Pandas Series for characters_s, words_s and sentence_s and then, using broadcasting as in numpy, compute ari_s = 4.71 * character_s/ ...

x_s is a convention used in Textehero's code to show that the variable is a Pandas Series. p.s if you spot some code that does not use this convention, feel free to open a PR.

Check input is string

Using preprocessing.remove_whitespace for checking if the input is a Pandas Series of a string is not the best solution as is not very clear why and also as it uses a function not designed for that.

Instead, we should probably use pandas.api.types.is_string_dtype from Pandas: link.

Having a solid knowledge of Pandas might help when contributing to Texthero. If you haven't already done, I encourage you to have a look at the Pandas API in details. This are good pages to check: General utility functions and Pandas Series string handling.

Extra comments

For correctly achieving this function, we need to implement basically three sub-tasks:

  1. words_s is the number of spaces, so we need to count this. This is just s.str.split().str.len() -1, right?
  2. characters_ is the number of characters. This is s.str.len()
  3. sentence_s is the number of sentences. This is more subtle to compute and we need spaCy, right? Splitting by . and counts would not be enough (example: Stop, F.B.I, do not move. ...)

For sentence_s would be interesting to add a hero function hero.count_sentences that does exactly that. We might, therefore, want to first add a PR that adds this new function and only later implement the ari function.

For words_s and characters_s probably an extra hero function is not necessary but probably not everyone knows these tricks. For this part, the idea is to write a "how-to" tutorial on the blog that explains how to profit the most of Pandas.

This was quite a long PR review, hope you got some interesting and useful hints.

Let me know your feedback!
Regards,

@jbesomi jbesomi linked an issue Jul 9, 2020 that may be closed by this pull request
@henrifroese henrifroese marked this pull request as draft July 9, 2020 09:53
@henrifroese
Copy link
Collaborator Author

Thanks for your help! I've converted this to a draft pull request and opened #51 to first implement a count_sentences function, which makes sense independently, and also makes the automated readability index implementation easier. Will finish this when count_sentences is done.

jbesomi and others added 15 commits July 12, 2020 20:35
* Added Remove Tags and Replace Tags

* removed contributor
* README.md

* updated README
* added replace hashtags and remove hashtag

* Fixed the Documentation

* Preprocessing Hashtag Regex as a raw string
* Add count_sentences function to nlp.py

Also add tests for the function to test_nlp.py

* Implement suggestions from pull request.

Add more tests, change style (docstring, tests naming).
Remove unicode-casting to avoid unexpected behaviour.

* Add link to spacy documentation.

Additionally update index tests, they're cleaner now.

Co-authored-by: Henri Froese <[email protected]>
Now incorporates suggested changes.

Input checking done with pd.api.types.is_string_dtype. Not a
permanent solution, will be improved by jbesomi#60 etc.

Co-authored-by: Maximilian Krahn <[email protected]>
henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this pull request Jul 12, 2020
New pull request from jbesomi#46 as we had some Git problems.

Input checking done with pd.api.types.is_string_dtype. Not a
permanent solution, will be improved by jbesomi#60 etc.

Co-authored-by: Maximilian Krahn <[email protected]>
@henrifroese
Copy link
Collaborator Author

We had some Git trouble (as you can probably see above 🥉 ) so we closed this and moved the PR to #74 , sorry about that :neckbeard:

@henrifroese henrifroese deleted the Automated_Readability_Index branch July 12, 2020 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add automated_readability_index(s) under visualization
4 participants