Skip to content

stephantul/wordninja2

Repository files navigation

wordninja2

wordninja2 is a faster version of wordninja. Wordninja is a word-based unigram LM that splits strings that contain words without spaces into words, as follows:

>>> from wordninja2 import split
>>> split("waldorfastorianewyork")
['waldorf', 'astoria', 'new', 'york']
>>> split("besthotelpricesyoucanfind")
['best', 'hotel', 'prices', 'you', 'can', 'find']

Wordninja was originally defined in a stackoverflow thread, and then rewritten into a Python package.

As the original wordninja isn't really maintained, and contains some inconsistencies, I decided to rewrite it. See below for a comparison between wordninja and wordninja2.

Usage

wordninja2 is packaged with a wordlist, which allows you to use it out of the box. To facilitate migrating from wordninja to wordninja2, we use the exact same wordlist.

>>> from wordninja2 import split
>>> split("HelloIfoundanewhousewiththreebedroomswouldwebeabletoshareit?")
['Hello',
 'I',
 'found',
 'a',
 'new',
 'house',
 'with',
 'three',
 'bedrooms',
 'would',
 'we',
 'be',
 'able',
 'to',
 'share',
 'it',
 '?']

Using wordninja2 with your own wordlist is easy, and works regardless of punctuation in tokens or the languages of those tokens.

>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "cat", "房子"]
>>> wn = WordNinja(my_words)
>>> wn.split("idogcat房子house")
["i", "dog", "cat", "房子", "h", "o", "u", "s", "e"]

Note that any wordlist you supply should be in descending order of importance. That is, wordninja assumes that words higher in the list should get precedence in segmentation over words that are lower in the list. The example that follows shows what happens.

>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "s", "a", "b", "c", "d", "e", "f", "dogs"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogs")
["dog", "s"]

>>> my_words = ["dogs", "dog", "s"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogsdog")
["dogs", "dog"]

Wordfreq integration

If you want multilingual wordlists, or a better English wordlist, you can download wordfreq, by the great rspeer (go give it a star on Github). This works as follows:

>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wordlist = top_n_list("de", 500_000)
>>> print(wordlist[:10])
['die', 'der', 'und', 'in', 'das', 'ich', 'ist', 'nicht', 'zu', 'den']

>>> wn = WordNinja(wordlist)
>>> wn.split("erinteressiertsichfüralles,aberbesondersfürschmetterlingeundandereinsekten")
['er',
 'interessiert',
 'sich',
 'für',
 'alles',
 ',',
 'aber',
 'besonders',
 'für',
 'schmetterlinge',
 'und',
 'andere',
 'insekten']

One interesting avenue is that you could segment strings using different languages, and take the best one.

>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wns = {}
>>> for language in ["de", "nl", "en", "fr"]:
        wordlist = top_n_list(language, 500_000)
        wns[language] = WordNinja(wordlist)

>>> # This is a dutch string.
>>> string = "ditiseennederlandsetekstmeteenheelmooiverhaalofmeerdereverhalen"
>>> segmentations = {}
>>> for language, model in wns.items():
        segmentation = model.split_with_cost(string)
        segmentations[language] = segmentation

>>> for language, segmentation in sorted(segmentations.items(), key=lambda x: x[1].cost):
        print(language)
        print(segmentation.tokens)
nl
['dit', 'is', 'een', 'nederlandse', 'tekst', 'meteen', 'heel', 'mooi', 'verhaal', 'of', 'meerdere', 'verhalen']
en
['diti', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'moo', 'iver', 'haal', 'of', 'meer', 'der', 'ever', 'halen']
fr
['dit', 'ise', 'en', 'nederlandse', 'tekst', 'me', 'te', 'en', 'heel', 'mooi', 'verh', 'aal', 'of', 'meer', 'de', 'rever', 'halen']
de
['dit', 'i', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'mooi', 'verha', 'al', 'of', 'meer', 'der', 'ever', 'halen']

Differences with wordninja

In this section I'll highlight some differences between wordninja and wordninja2.

Consistency

The original wordninja is not self-consistent, that is, the following assert fails.

string = "this,string-split it"
assert "".join(split(string)) == string

This is because wordninja removes all non-word characters from the string before processing it. This also has the consequence of wordninja never being able to detect words with these special characters in them.

wordninja2 is completely self-consistent, and does not remove any special characters from a string.

Speed

wordninja2 is twice as fast than wordninja. Segmenting the entire text of Mary Shelley's Frankenstein (which you can download here):

>>> import re

>>> from wordninja2 import split
>>> from wordninja import split as old_split

>>> # Remove all spaces.
>>> txt = re.sub(r"\s", "", open("pg84.txt").read())
>>> %timeit split(txt)
299 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit old_split(txt)
1.89 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The original wordninja has an algorithm that backtracks up to the length of the longest word for each character in the string. Thus, if your wordlist has even a single long word, the entire algorithm will start taking a really long time. Coincidentally, the default wordlist used in wordninja has a really long word: llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch, see here for additional background.

To avoid backtracking, wordninja2 uses the aho-corasick algorithm. We use a fast implementation in Rust: aho-corasick, with python bindings: aho-corasick-rs.

Dependencies

See the pyproject.toml file. We only rely on the aforementioned aho-corasick implementation and numpy.

Installation

Clone the repo and run make install. I might put this on pypi later.

Tests

wordninja2 has 100% test coverage, run make test to run the tests.

License

MIT

Authors

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published