Readability

The readability component adds the following readabiltiy metrics under the ._.readability attribute to Doc objects.

Note

Note, that the hyphenation module (Pyphen) does not support all languages. If the language is not supported, a warning will be raised and np.nan will be set as the value for metrics requiring hyphenation.

`Gunning-Fog <https://en.wikipedia.org/wiki/Gunning_fog_index>`__, is a readability index originally developed for English writing, but works for any language. The index estimates the years of formal education needed to understand the text on a first reading. A Gunning-Fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The formula for calculating the index is:

Grade level = 0.4 × (ASL + PHW)

Where ASL is the average sentence length (total words / total sentences), and PHW is the percentage of hard words (words with three or more syllables).

Note: requires hyphenation.
`SMOG <https://en.wikipedia.org/wiki/SMOG>`__, or Simple Measure of Gobbledygook, is a readability formula that estimates the years of education required to understand a piece of writing. It primarily focuses on the complexity of words, using the number of polysyllabic words in the text. The formula is:

SMOG Index = 1.043 × √(30 × (hard words / n_sentences)) + 3.1291

Note: requires hyphenation.
`Flesch reading ease <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease>`__, is a readability score that indicates how easy a text is to read. Higher scores indicate easier reading, while lower scores indicate more difficult reading. The score is calculated using the following formula:

Flesch Reading Ease = 206.835 - (1.015 × ASL) - (84.6 × ASW)

Where ASL is the average sentence length and ASW is the average number of syllables per word.

Note: requires hyphenation.
`Flesch-Kincaid grade <https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level>`__, is a readability metric that estimates the grade level needed to comprehend a text. It is based on the average sentence length and average number of syllables per word. The formula is:

Flesch-Kincaid Grade = 0.39 × (ASL) + 11.8 × (ASW) - 15.59

Note: requires hyphenation.
`Automated readability index <https://en.wikipedia.org/wiki/Automated_readability_index>`__, is a readability test that calculates an approximate U.S. grade level needed to understand a text. It is based on the average number of characters per word and the average sentence length. The formula is:

ARI = 4.71 × (n_chars / n_words) + 0.5 × (n_words / n_sentences) - 21.43
`Coleman-Liau index <https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index>`___, is a readability test that estimates the U.S. grade level needed to understand a text. It is based on the average number of letters per 100 words and the average number of sentences per 100 words. The original formula is:

CLI = 0.0588 × L - 0.296 × S - 15.8

Where L is the average number of characters per 100 words and S is the average number of sentences per 100 words. In our implementation we average over the entire text instead of just 100 words.
`Lix <https://en.wikipedia.org/wiki/Lix_(readability_test)>`__, or Lesbarhetsindex, is a readability measure that calculates a readability score based on the average sentence length and the percentage of long words (more than six characters) in the text. The formula is:

Lix = (n_words / n_sentences) + (n_long_words * 100) / n_words
`Rix <https://www.jstor.org/stable/40031755>`__, is a readability measure that estimates the difficulty of a text based on the proportion of long words (more than six characters) in the text. The formula is:

Rix = (n_long_words / n_sentences)

Usage

import spacy
import textdescriptives as td
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives/readability")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# all attributes are stored as a dict in the ._.readability attribute
doc._.readability

# extract to dataframe
td.extract_df(doc)

	text	flesch_reading_ease	flesch_kincaid_grade	smog	gunning_fog	automated_readability_index	coleman_liau_index	lix	rix	token_length_mean	token_length_median	token_length_std	sentence_length_mean	sentence_length_median	sentence_length_std	syllables_per_token_mean	syllables_per_token_median	syllables_per_token_std	n_tokens	n_unique_tokens	proportion_unique_tokens	n_characters	n_sentences
0	The world is changed(...)	107.879	-0.0485714	5.68392	3.94286	-2.45429	-0.708571	12.7143	0.4	3.28571	3	1.54127	7	6	3.09839	1.08571	1	0.368117	35	23	0.657143	121	5

Component

.. autofunction:: textdescriptives.components.readability.create_readability_component

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readability.rst

readability.rst

Readability

Usage

Component

Files

readability.rst

Latest commit

History

readability.rst

File metadata and controls

Readability

Usage

Component