pip install flashtext2
flashtext2
is an optimized version of the flashtext
library for fast keyword extraction and replacement.
Its orders of magnitude faster compared to regular expressions.
- Rewritten for Better Performance: Completely rewritten in Rust, making it approximately 3-10x faster than the original version.
- Unicode Standard Annex #29: Instead of relying on arbitrary regex patterns like flashtext
does:
[A-Za-z0-9_]+
, flashtext2 uses the Unicode Standard Annex #29 to split strings into tokens. This ensures compatibility with all languages, not just Latin-based ones. - Unicode Case Folding: Instead of converting strings to lowercase for case-insensitive matches, it uses Unicode case folding, ensuring accurate normalization of characters according to the Unicode standard.
- Fully Type-Hinted API: The entire API is fully type-hinted, providing better code clarity and improved development experience.
Click to unfold usage
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Python')
kp.add_keyword('flashtext')
kp.add_keyword('program')
text = "I love programming in Python and using the flashtext library."
keywords_found = kp.extract_keywords(text)
print(keywords_found)
# Output: ['Python', 'flashtext']
keywords_found = kp.extract_keywords_with_span(text)
print(keywords_found)
# Output: [('Python', 22, 28), ('flashtext', 43, 52)]
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('Java', 'Python')
kp.add_keyword('regex', 'flashtext')
text = "I love programming in Java and using the regex library."
new_text = kp.replace_keywords(text)
print(new_text)
# Output: "I love programming in Python and using the flashtext library."
from flashtext2 import KeywordProcessor
text = 'abc aBc ABC'
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc']
kp = KeywordProcessor(case_sensitive=False)
kp.add_keyword('aBc')
print(kp.extract_keywords(text))
# Output: ['aBc', 'aBc', 'aBc']
Overlapping keywords (returns the longest sequence)
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=True)
kp.add_keyword('machine')
kp.add_keyword('machine learning')
text = "machine learning is a subset of artificial intelligence"
print(kp.extract_keywords(text))
# Output: ['machine learning']
Case folding
from flashtext2 import KeywordProcessor
kp = KeywordProcessor(case_sensitive=False)
kp.add_keywords_from_iter(["flour", "Maße", "ᾲ στο διάολο"])
text = "flour, MASSE, ὰι στο διάολο"
print(kp.extract_keywords(text))
# Output: ['flour', 'Maße', 'ᾲ στο διάολο']
Click to unfold performance
Extracting keywords is usually 2.5-3x faster, and replacing them is about 10x.
There is still room to optimize the code and improve performance.
You can find the benchmarks here.
The words have on average 6 characters, and a sentence has 10k words, so the length is 60k.
Click to unfold TODO
- Add multiple ways of normalizing strings: simple case folding, full case folding, and locale-aware folding
- Remove all clones in src code
Credit to Vikash Singh, the author of the original flashtext
package.