wordninja2 is a faster version of wordninja. Wordninja is a word-based unigram LM that splits strings that contain words without spaces into words, as follows:
>>> from wordninja2 import split
>>> split("waldorfastorianewyork")
['waldorf', 'astoria', 'new', 'york']
>>> split("besthotelpricesyoucanfind")
['best', 'hotel', 'prices', 'you', 'can', 'find']
Wordninja was originally defined in a stackoverflow thread, and then rewritten into a Python package.
As the original wordninja isn't really maintained, and contains some inconsistencies, I decided to rewrite it. See below for a comparison between wordninja and wordninja2.
wordninja2 is packaged with a wordlist, which allows you to use it out of the box. To facilitate migrating from wordninja to wordninja2, we use the exact same wordlist.
>>> from wordninja2 import split
>>> split("HelloIfoundanewhousewiththreebedroomswouldwebeabletoshareit?")
['Hello',
'I',
'found',
'a',
'new',
'house',
'with',
'three',
'bedrooms',
'would',
'we',
'be',
'able',
'to',
'share',
'it',
'?']
Using wordninja2 with your own wordlist is easy, and works regardless of punctuation in tokens or the languages of those tokens.
>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "cat", "房子"]
>>> wn = WordNinja(my_words)
>>> wn.split("idogcat房子house")
["i", "dog", "cat", "房子", "h", "o", "u", "s", "e"]
Note that any wordlist you supply should be in descending order of importance. That is, wordninja assumes that words higher in the list should get precedence in segmentation over words that are lower in the list. The example that follows shows what happens.
>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "s", "a", "b", "c", "d", "e", "f", "dogs"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogs")
["dog", "s"]
>>> my_words = ["dogs", "dog", "s"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogsdog")
["dogs", "dog"]
If you want multilingual wordlists, or a better English wordlist, you can download wordfreq, by the great rspeer (go give it a star on Github). This works as follows:
>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wordlist = top_n_list("de", 500_000)
>>> print(wordlist[:10])
['die', 'der', 'und', 'in', 'das', 'ich', 'ist', 'nicht', 'zu', 'den']
>>> wn = WordNinja(wordlist)
>>> wn.split("erinteressiertsichfüralles,aberbesondersfürschmetterlingeundandereinsekten")
['er',
'interessiert',
'sich',
'für',
'alles',
',',
'aber',
'besonders',
'für',
'schmetterlinge',
'und',
'andere',
'insekten']
One interesting avenue is that you could segment strings using different languages, and take the best one.
>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wns = {}
>>> for language in ["de", "nl", "en", "fr"]:
wordlist = top_n_list(language, 500_000)
wns[language] = WordNinja(wordlist)
>>> # This is a dutch string.
>>> string = "ditiseennederlandsetekstmeteenheelmooiverhaalofmeerdereverhalen"
>>> segmentations = {}
>>> for language, model in wns.items():
segmentation = model.split_with_cost(string)
segmentations[language] = segmentation
>>> for language, segmentation in sorted(segmentations.items(), key=lambda x: x[1].cost):
print(language)
print(segmentation.tokens)
nl
['dit', 'is', 'een', 'nederlandse', 'tekst', 'meteen', 'heel', 'mooi', 'verhaal', 'of', 'meerdere', 'verhalen']
en
['diti', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'moo', 'iver', 'haal', 'of', 'meer', 'der', 'ever', 'halen']
fr
['dit', 'ise', 'en', 'nederlandse', 'tekst', 'me', 'te', 'en', 'heel', 'mooi', 'verh', 'aal', 'of', 'meer', 'de', 'rever', 'halen']
de
['dit', 'i', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'mooi', 'verha', 'al', 'of', 'meer', 'der', 'ever', 'halen']
In this section I'll highlight some differences between wordninja
and wordninja2
.
The original wordninja
is not self-consistent, that is, the following assert fails.
string = "this,string-split it"
assert "".join(split(string)) == string
This is because wordninja
removes all non-word characters from the string before processing it. This also has the consequence of wordninja
never being able to detect words with these special characters in them.
wordninja2
is completely self-consistent, and does not remove any special characters from a string.
wordninja2
is twice as fast than wordninja
. Segmenting the entire text of Mary Shelley's Frankenstein (which you can download here):
>>> import re
>>> from wordninja2 import split
>>> from wordninja import split as old_split
>>> # Remove all spaces.
>>> txt = re.sub(r"\s", "", open("pg84.txt").read())
>>> %timeit split(txt)
299 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit old_split(txt)
1.89 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The original wordninja has an algorithm that backtracks up to the length of the longest word for each character in the string. Thus, if your wordlist has even a single long word, the entire algorithm will start taking a really long time. Coincidentally, the default wordlist used in wordninja
has a really long word: llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
, see here for additional background.
To avoid backtracking, wordninja2
uses the aho-corasick algorithm. We use a fast implementation in Rust: aho-corasick, with python bindings: aho-corasick-rs.
See the pyproject.toml file. We only rely on the aforementioned aho-corasick implementation and numpy.
Clone the repo and run make install
. I might put this on pypi
later.
wordninja2
has 100% test coverage, run make test
to run the tests.
MIT
- Stéphan Tulkens
- The original code is by keredson
- The original algorithm was written by Generic Human