Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composition bug #128

Closed
KrokodileGlue opened this issue Apr 19, 2018 · 5 comments
Closed

Composition bug #128

KrokodileGlue opened this issue Apr 19, 2018 · 5 comments
Labels

Comments

@KrokodileGlue
Copy link

It appears that some code point pairs are being incorrectly composed; a starter followed by a non starter seems to always produce a composition, even if the logical resulting character doesn't exist. For instance, the sequence U+72 U+307 U+323 represents a lowercase latin r with a dot below and a dot above. There does exist LATIN SMALL LETTER R WITH DOT BELOW, so the composition of the first pair results in a valid character. The composition of the resulting code point (U+1E5B) together with U+323, however, results in U+1E64 (LATIN CAPITAL LETTER S WITH ACUTE AND DOT ABOVE) because there is no LATIN SMALL LETTER R WITH DOT BELOW AND DOT ABOVE.

Many other sequences produce errors. All of the following compositions are incorrectly produced by utf8proc, but of course there are infinitely many more:

  • U+61 + U+307 = U+227
  • U+227 + U+323 = U+2E
  • U+227 + U+307 = U+1FC
  • U+1E5B + U+323 = U+1E64

The actual character that's produced in error, of course, is the result of a reasonable index into the data table, so I doubt any check on the value of composition could reliably fix this.

Here's a minimal program demonstrating the bug:

#include <stdio.h>
#include <utf8proc.h>
int main(void)
{
	char *src = "\x72\xCC\x87\xCC\xA3";
	char *norm = (char *)utf8proc_NFC((utf8proc_uint8_t*)src);
	puts(src);
	puts(norm);
	return 0;
}
@stevengj stevengj added the bug label Apr 20, 2018
@stevengj
Copy link
Member

Thanks, I can reproduce the bug.

@stevengj
Copy link
Member

It looks like an earlier version of utf8proc didn't have this bug, so it would be extremely helpful if you could do a bisect.

(Happened somewhere between utf8proc 1.3 and 2.0.2, it seems?)

@KrokodileGlue
Copy link
Author

It looks like this was introduced by eeebf70. I think the relevant change is "merge 1st and 2nd comb index (saves 50kb)". This particular change may have to be reverted and the tables regenerated.

@stevengj
Copy link
Member

@benibela, can you look into this?

benibela added a commit to benibela/utf8proc that referenced this issue Apr 22, 2018
Does this help? I do not really remember what I wrote back then
stevengj pushed a commit that referenced this issue Apr 27, 2018
Does this help? I do not really remember what I wrote back then
@stevengj
Copy link
Member

stevengj commented Apr 27, 2018

Closed by #129.

stevengj added a commit that referenced this issue Apr 27, 2018
stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018
This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting.
stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018
stevengj added a commit to JuliaLang/julia that referenced this issue Apr 27, 2018
ararslan pushed a commit to JuliaLang/julia that referenced this issue May 2, 2018
* bump utf8proc to 2.1.1

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128), so I would recommend backporting.

* added test for JuliaStrings/utf8proc#128

* update checksums
ararslan pushed a commit to JuliaLang/julia that referenced this issue May 2, 2018
* bump utf8proc to 2.1.1

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128).

* added test for JuliaStrings/utf8proc#128

* update checksums

Ref #26917
(cherry picked from commit 64040ce)
ararslan pushed a commit to JuliaLang/julia that referenced this issue May 8, 2018
* bump utf8proc to 2.1.1

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128).

* added test for JuliaStrings/utf8proc#128

* update checksums

Ref #26917
(cherry picked from commit 64040ce)
ararslan pushed a commit to JuliaLang/julia that referenced this issue May 27, 2018
* bump utf8proc to 2.1.1

This contains a bug fix for a normalization error (JuliaStrings/utf8proc#128).

* added test for JuliaStrings/utf8proc#128

* update checksums

Ref #26917
(cherry picked from commit 64040ce)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants