Skip to content

Latest commit

 

History

History
90 lines (66 loc) · 2.98 KB

textstats.rst

File metadata and controls

90 lines (66 loc) · 2.98 KB

Text Stats

Wordview provides an overview of your text data, as well as general statistics and different distributions and plots via TextStatsPlots class. To get started, import and instantiate an object of TextStatsPlots using your dataset:

Overview

Use the show_stats method to see a set of different statistics about of your dataset.

ta.show_stats()
┌───────────────────┬─────────┐
│ Language/sEN      │
├───────────────────┼─────────┤
│ Unique Words48,791  │
├───────────────────┼─────────┤
│ All Words666,898 │
├───────────────────┼─────────┤
│ Documents5,000   │
├───────────────────┼─────────┤
│ Median Doc Length211.0   │
├───────────────────┼─────────┤
│ Nouns28,482  │
├───────────────────┼─────────┤
│ Adjectives19,519  │
├───────────────────┼─────────┤
│ Verbs15,241  │
└───────────────────┴─────────┘

Distributions

You can look into different distributions using the show_distplot method. For instance, you can see the distribution of document lengths to decide on a sequence length in sequence models with a fixed input or when you carry out mini-batch training.

ta.show_distplot(plot='doc_len')

doclen

Or, you can see the Zipf distribution of words:

ta.show_distplot(plot='word_frequency_zipf')

wordszipf

See this excellent article to learn how Zipf’s law can be used to improve some NLP models.

Part of Speech Tags

To see different Part of Speech tags in the form of word clouds, you can use the show_word_clouds method.

# To see verbs
ta.show_word_clouds(type="VB")
# To see nouns
ta.show_word_clouds(type="NN")
# To see adjectives
ta.show_word_clouds(type="JJ")

verbs nouns adjs