wordninja2

wordninja2 is a faster version of wordninja. Wordninja is a word-based unigram LM that splits strings that contain words without spaces into words, as follows:

>>> from wordninja2 import split
>>> split("waldorfastorianewyork")
['waldorf', 'astoria', 'new', 'york']
>>> split("besthotelpricesyoucanfind")
['best', 'hotel', 'prices', 'you', 'can', 'find']

Wordninja was originally defined in a stackoverflow thread, and then rewritten into a Python package.

As the original wordninja isn't really maintained, and contains some inconsistencies, I decided to rewrite it. See below for a comparison between wordninja and wordninja2.

Usage

wordninja2 is packaged with a wordlist, which allows you to use it out of the box. To facilitate migrating from wordninja to wordninja2, we use the exact same wordlist.

>>> from wordninja2 import split
>>> split("HelloIfoundanewhousewiththreebedroomswouldwebeabletoshareit?")
['Hello',
 'I',
 'found',
 'a',
 'new',
 'house',
 'with',
 'three',
 'bedrooms',
 'would',
 'we',
 'be',
 'able',
 'to',
 'share',
 'it',
 '?']

Using wordninja2 with your own wordlist is easy, and works regardless of punctuation in tokens or the languages of those tokens.

>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "cat", "房子"]
>>> wn = WordNinja(my_words)
>>> wn.split("idogcat房子house")
["i", "dog", "cat", "房子", "h", "o", "u", "s", "e"]

Note that any wordlist you supply should be in descending order of importance. That is, wordninja assumes that words higher in the list should get precedence in segmentation over words that are lower in the list. The example that follows shows what happens.

>>> from wordninja2 import WordNinja
>>> my_words = ["dog", "s", "a", "b", "c", "d", "e", "f", "dogs"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogs")
["dog", "s"]

>>> my_words = ["dogs", "dog", "s"]
>>> wn = WordNinja(my_words)
>>> wn.split("dogsdog")
["dogs", "dog"]

Wordfreq integration

If you want multilingual wordlists, or a better English wordlist, you can download wordfreq, by the great rspeer (go give it a star on Github). This works as follows:

>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wordlist = top_n_list("de", 500_000)
>>> print(wordlist[:10])
['die', 'der', 'und', 'in', 'das', 'ich', 'ist', 'nicht', 'zu', 'den']

>>> wn = WordNinja(wordlist)
>>> wn.split("erinteressiertsichfüralles,aberbesondersfürschmetterlingeundandereinsekten")
['er',
 'interessiert',
 'sich',
 'für',
 'alles',
 ',',
 'aber',
 'besonders',
 'für',
 'schmetterlinge',
 'und',
 'andere',
 'insekten']

One interesting avenue is that you could segment strings using different languages, and take the best one.

>>> from wordfreq import top_n_list
>>> from wordninja2 import WordNinja
>>> wns = {}
>>> for language in ["de", "nl", "en", "fr"]:
        wordlist = top_n_list(language, 500_000)
        wns[language] = WordNinja(wordlist)

>>> # This is a dutch string.
>>> string = "ditiseennederlandsetekstmeteenheelmooiverhaalofmeerdereverhalen"
>>> segmentations = {}
>>> for language, model in wns.items():
        segmentation = model.split_with_cost(string)
        segmentations[language] = segmentation

>>> for language, segmentation in sorted(segmentations.items(), key=lambda x: x[1].cost):
        print(language)
        print(segmentation.tokens)
nl
['dit', 'is', 'een', 'nederlandse', 'tekst', 'meteen', 'heel', 'mooi', 'verhaal', 'of', 'meerdere', 'verhalen']
en
['diti', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'moo', 'iver', 'haal', 'of', 'meer', 'der', 'ever', 'halen']
fr
['dit', 'ise', 'en', 'nederlandse', 'tekst', 'me', 'te', 'en', 'heel', 'mooi', 'verh', 'aal', 'of', 'meer', 'de', 'rever', 'halen']
de
['dit', 'i', 'seen', 'nederlandse', 'tekst', 'me', 'teen', 'heel', 'mooi', 'verha', 'al', 'of', 'meer', 'der', 'ever', 'halen']

Differences with wordninja

In this section I'll highlight some differences between wordninja and wordninja2.

Consistency

The original wordninja is not self-consistent, that is, the following assert fails.

string = "this,string-split it"
assert "".join(split(string)) == string

This is because wordninja removes all non-word characters from the string before processing it. This also has the consequence of wordninja never being able to detect words with these special characters in them.

wordninja2 is completely self-consistent, and does not remove any special characters from a string.

Speed

wordninja2 is twice as fast than wordninja. Segmenting the entire text of Mary Shelley's Frankenstein (which you can download here):

>>> import re

>>> from wordninja2 import split
>>> from wordninja import split as old_split

>>> # Remove all spaces.
>>> txt = re.sub(r"\s", "", open("pg84.txt").read())
>>> %timeit split(txt)
299 ms ± 4.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit old_split(txt)
1.89 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The original wordninja has an algorithm that backtracks up to the length of the longest word for each character in the string. Thus, if your wordlist has even a single long word, the entire algorithm will start taking a really long time. Coincidentally, the default wordlist used in wordninja has a really long word: llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch, see here for additional background.

To avoid backtracking, wordninja2 uses the aho-corasick algorithm. We use a fast implementation in Rust: aho-corasick, with python bindings: aho-corasick-rs.

Dependencies

See the pyproject.toml file. We only rely on the aforementioned aho-corasick implementation and numpy.

Installation

Clone the repo and run make install. I might put this on pypi later.

Tests

wordninja2 has 100% test coverage, run make test to run the tests.

License

MIT

Authors

Stéphan Tulkens
The original code is by keredson
The original algorithm was written by Generic Human

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
tests		tests
wordninja2		wordninja2
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wordninja2

Usage

Wordfreq integration

Differences with wordninja

Consistency

Speed

Dependencies

Installation

Tests

License

Authors

About

Releases

Packages

Languages

License

stephantul/wordninja2

Folders and files

Latest commit

History

Repository files navigation

wordninja2

Usage

Wordfreq integration

Differences with wordninja

Consistency

Speed

Dependencies

Installation

Tests

License

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages