janitor_util C++ splits multibyte characters into non-UTF bytes(?) #1452

mycoalchen · 2024-02-21T06:11:13Z

When I ran janitor.decontaminate(input) from janitor.py in C++ mode, my Python code threw UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 86: unexpected end of data. I cleaned my input of non-UTF-8 characters before passing it to decontaminate() by running input = input.encode("utf-32", errors="ignore").decode("utf-32", errors="ignore"), but the error still showed. I suspect the error was caused by the string splitting in janitor_util.cpp's clean_ngram_with_indices. When some multibyte UTF-8 characters are split, the resulting bytes are outside UTF range. For example, the en-dash is \xe2\x80\x93 in UTF-8, but \xe2 is not a UTF character.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

janitor_util C++ splits multibyte characters into non-UTF bytes(?) #1452

janitor_util C++ splits multibyte characters into non-UTF bytes(?) #1452

mycoalchen commented Feb 21, 2024

janitor_util C++ splits multibyte characters into non-UTF bytes(?) #1452

janitor_util C++ splits multibyte characters into non-UTF bytes(?) #1452

Comments

mycoalchen commented Feb 21, 2024