You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I ran janitor.decontaminate(input) from janitor.py in C++ mode, my Python code threw UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 86: unexpected end of data. I cleaned my input of non-UTF-8 characters before passing it to decontaminate() by running input = input.encode("utf-32", errors="ignore").decode("utf-32", errors="ignore"), but the error still showed. I suspect the error was caused by the string splitting in janitor_util.cpp's clean_ngram_with_indices. When some multibyte UTF-8 characters are split, the resulting bytes are outside UTF range. For example, the en-dash is \xe2\x80\x93 in UTF-8, but \xe2 is not a UTF character.
The text was updated successfully, but these errors were encountered:
When I ran
janitor.decontaminate(input)
fromjanitor.py
in C++ mode, my Python code threwUnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 86: unexpected end of data.
I cleaned myinput
of non-UTF-8 characters before passing it todecontaminate()
by runninginput = input.encode("utf-32", errors="ignore").decode("utf-32", errors="ignore")
, but the error still showed. I suspect the error was caused by the string splitting injanitor_util.cpp
'sclean_ngram_with_indices
. When some multibyte UTF-8 characters are split, the resulting bytes are outside UTF range. For example, the en-dash is\xe2\x80\x93
in UTF-8, but\xe2
is not a UTF character.The text was updated successfully, but these errors were encountered: