This code implements a basic, Twitter-aware tokenizer. Originally developed by Christopher Potts (Happy Fun Tokenizer) and updated by H. Andrew Schwartz. Shared with Christopher's permission.
from happierfuntokenizing.happierfuntokenizing import Tokenizer
tokenizer = Tokenizer()
message = """OMG!!!! :) I looooooove this tokenizer lololol"""
tokens = tokenizer.tokenize(message)
print(tokens)
['omg', '!', '!', '!', '!', ':)', 'i', 'looooooove', 'this', 'tokenizer', 'lololol']
message = """OMG!!!! :) I looooooove this tokenizer LoLoLoLoLooOOOOL"""
tokenizer = Tokenizer(preserve_case=True)
tokens = tokenizer.tokenize(message)
print(tokens)
['OMG', '!', '!', '!', '!', ':)', 'I', 'looooooove', 'this', 'tokenizer', 'LoLoLoLoLooOOOOL']
This is available through pip
pip install happierfuntokenizing
If you do not have sudo privileges you can use the --user
flag
pip install --user happierfuntokenizing
This uses Python 2.7. Package dependencies include re
and htmlentitydefs
.
Licensed under a GNU General Public License v3 (GPLv3)
Adapted by the World Well-Being Project based out of the University of Pennsylvania and Stony Brook University. Originally developed by Christopher Potts.