Composition bug #128

KrokodileGlue · 2018-04-19T23:51:11Z

It appears that some code point pairs are being incorrectly composed; a starter followed by a non starter seems to always produce a composition, even if the logical resulting character doesn't exist. For instance, the sequence U+72 U+307 U+323 represents a lowercase latin r with a dot below and a dot above. There does exist LATIN SMALL LETTER R WITH DOT BELOW, so the composition of the first pair results in a valid character. The composition of the resulting code point (U+1E5B) together with U+323, however, results in U+1E64 (LATIN CAPITAL LETTER S WITH ACUTE AND DOT ABOVE) because there is no LATIN SMALL LETTER R WITH DOT BELOW AND DOT ABOVE.

Many other sequences produce errors. All of the following compositions are incorrectly produced by utf8proc, but of course there are infinitely many more:

U+61 + U+307 = U+227
U+227 + U+323 = U+2E
U+227 + U+307 = U+1FC
U+1E5B + U+323 = U+1E64

The actual character that's produced in error, of course, is the result of a reasonable index into the data table, so I doubt any check on the value of composition could reliably fix this.

Here's a minimal program demonstrating the bug:

#include <stdio.h>
#include <utf8proc.h>
int main(void)
{
	char *src = "\x72\xCC\x87\xCC\xA3";
	char *norm = (char *)utf8proc_NFC((utf8proc_uint8_t*)src);
	puts(src);
	puts(norm);
	return 0;
}

The text was updated successfully, but these errors were encountered:

stevengj · 2018-04-20T00:18:23Z

Thanks, I can reproduce the bug.

stevengj · 2018-04-20T00:24:46Z

It looks like an earlier version of utf8proc didn't have this bug, so it would be extremely helpful if you could do a bisect.

(Happened somewhere between utf8proc 1.3 and 2.0.2, it seems?)

KrokodileGlue · 2018-04-20T01:05:33Z

It looks like this was introduced by eeebf70. I think the relevant change is "merge 1st and 2nd comb index (saves 50kb)". This particular change may have to be reverted and the tables regenerated.

stevengj · 2018-04-20T02:07:22Z

@benibela, can you look into this?

Does this help? I do not really remember what I wrote back then

stevengj · 2018-04-27T12:06:29Z

Closed by #129.

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting.

* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting. * added test for JuliaStrings/utf8proc#128 * update checksums

* bump utf8proc to 2.1.1 This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128). * added test for JuliaStrings/utf8proc#128 * update checksums Ref #26917 (cherry picked from commit 64040ce)

stevengj added the bug label Apr 20, 2018

benibela added a commit to benibela/utf8proc that referenced this issue Apr 22, 2018

possible fix for JuliaStrings#128

5820e4d

Does this help? I do not really remember what I wrote back then

stevengj pushed a commit that referenced this issue Apr 27, 2018

possible fix for #128 (#129)

acc204f

Does this help? I do not really remember what I wrote back then

stevengj closed this as completed Apr 27, 2018

stevengj added a commit that referenced this issue Apr 27, 2018

added test for #128

53d7968

stevengj mentioned this issue Apr 27, 2018

version bump to 2.1.1 #131

Merged

stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018

bump utf8proc to 2.1.1

0aa7e49

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting.

stevengj mentioned this issue Apr 27, 2018

bump utf8proc to 2.1.1 JuliaLang/julia#26917

Merged

stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018

added test for JuliaStrings/utf8proc#128

dff4fe5

stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018

added test for JuliaStrings/utf8proc#128

690a18e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composition bug #128

Composition bug #128

KrokodileGlue commented Apr 19, 2018

stevengj commented Apr 20, 2018

stevengj commented Apr 20, 2018

KrokodileGlue commented Apr 20, 2018

stevengj commented Apr 20, 2018

stevengj commented Apr 27, 2018 •

edited

Loading

Composition bug #128

Composition bug #128

Comments

KrokodileGlue commented Apr 19, 2018

stevengj commented Apr 20, 2018

stevengj commented Apr 20, 2018

KrokodileGlue commented Apr 20, 2018

stevengj commented Apr 20, 2018

stevengj commented Apr 27, 2018 • edited Loading

stevengj commented Apr 27, 2018 •

edited

Loading