-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Composition bug #128
Comments
Thanks, I can reproduce the bug. |
It looks like an earlier version of utf8proc didn't have this bug, so it would be extremely helpful if you could do a bisect. (Happened somewhere between utf8proc 1.3 and 2.0.2, it seems?) |
It looks like this was introduced by eeebf70. I think the relevant change is "merge 1st and 2nd comb index (saves 50kb)". This particular change may have to be reverted and the tables regenerated. |
@benibela, can you look into this? |
Does this help? I do not really remember what I wrote back then
Closed by #129. |
This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting.
* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting. * added test for JuliaStrings/utf8proc#128 * update checksums
* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128). * added test for JuliaStrings/utf8proc#128 * update checksums Ref #26917 (cherry picked from commit 64040ce)
* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128). * added test for JuliaStrings/utf8proc#128 * update checksums Ref #26917 (cherry picked from commit 64040ce)
* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128). * added test for JuliaStrings/utf8proc#128 * update checksums Ref #26917 (cherry picked from commit 64040ce)
It appears that some code point pairs are being incorrectly composed; a starter followed by a non starter seems to always produce a composition, even if the logical resulting character doesn't exist. For instance, the sequence U+72 U+307 U+323 represents a lowercase latin r with a dot below and a dot above. There does exist LATIN SMALL LETTER R WITH DOT BELOW, so the composition of the first pair results in a valid character. The composition of the resulting code point (U+1E5B) together with U+323, however, results in U+1E64 (LATIN CAPITAL LETTER S WITH ACUTE AND DOT ABOVE) because there is no LATIN SMALL LETTER R WITH DOT BELOW AND DOT ABOVE.
Many other sequences produce errors. All of the following compositions are incorrectly produced by utf8proc, but of course there are infinitely many more:
The actual character that's produced in error, of course, is the result of a reasonable index into the data table, so I doubt any check on the value of
composition
could reliably fix this.Here's a minimal program demonstrating the bug:
The text was updated successfully, but these errors were encountered: