Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_valid_char does not correctly follow the Unicode standard #11171

Closed
ScottPJones opened this issue May 6, 2015 · 13 comments
Closed

is_valid_char does not correctly follow the Unicode standard #11171

ScottPJones opened this issue May 6, 2015 · 13 comments
Labels
domain:unicode Related to unicode characters and encodings

Comments

@ScottPJones
Copy link
Contributor

is_valid_char returns false for values which are valid Unicode codepoints.
This is due to a misunderstanding of the way the 66 Unicode "non character" codepoints are supposed to be handled. See: "FAQ - Private-Use Characters, Noncharacters, and Sentinels"

Here are the relevant sections:

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

[@jiahao - edited formatting of hyperlink]

@mschauer
Copy link
Contributor

mschauer commented May 6, 2015

I agree, I understand that this means that functions should handle noncharacters gracefully and not bail out when encountering one.

@ihnorton ihnorton added the domain:unicode Related to unicode characters and encodings label May 6, 2015
@ScottPJones
Copy link
Contributor Author

Here is my proposed replacement, I'll submit a PR very shortly...
function is_valid_char(ch::Unsigned) ; !Bool((ch-0xd800<0x800)|(ch>0x10ffff)) ; end

@jiahao
Copy link
Member

jiahao commented May 6, 2015

What is really meant here is that is_valid_char does not correctly identify valid Unicode scalar values, as opposed to valid characters or valid Unicode code points (the surrogates U+0D800 - U+0DFFF are valid code points). Only valid Unicode scalar values can have a code unit sequence that can appear in a valid Unicode string.

See Unicode 7.0.0, p119 (pdf):

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate
code points.
• As a result of this definition, the set of Unicode scalar values consists of the
ranges 0 to D7FF16 and E00016 to 10FFFF16 , inclusive.

The documentation of is_valid_char should also be changed to

Returns true if the given char or integer is a valid Unicode code
point
scalar value.

perhaps even including a reference to definition in the Unicode standard.

jiahao referenced this issue May 6, 2015
use simple rejection sampling over valid codepoint range
@jakebolewski
Copy link
Member

The relevant function is_valid_char is calling in utf8proc is called utf8proc_codepoint_valid

https://github.com/JuliaLang/utf8proc/blob/7c14ef5f8371e463a01e0f1de971caa600384390/utf8proc.c#L151

@jiahao
Copy link
Member

jiahao commented May 6, 2015

Ref #11033

@jiahao
Copy link
Member

jiahao commented May 6, 2015

utf8proc_codepoint_valid is not documented, so its meaning could be changed to be in sync with what we have here.

@ScottPJones
Copy link
Contributor Author

@jiahao Good point about specifying Unicode scale values, and that would be good to fix utf8proc I have to submit several issues in utf8proc, where it doesn't conform to the Unicode standard correctly.
@jakebolewski I won't use utf8proc, Julia is faster than C anyway! ;-)

@nalimilan
Copy link
Member

@ScottPJones Now, contrary to what what asked in other PRs, it might be better to fix the problem in utf8proc if that's indeed a bug there. :-) As long as we depend on utf8proc at all, better make it work right.

@StefanKarpinski
Copy link
Sponsor Member

utf8proc is justified as C code since it's used by other outside of Julia.

@ScottPJones
Copy link
Contributor Author

@nalimilan I didn't say that I wouldn't get around to fixing it in utf8proc as well, which I hope to do... but I also have other things to do than fixing Julia bugs ;-)
@StefanKarpinski Yes, I understand that... as soon as I get a "round tuit", I'll fix it, but right now, I wanted to get Julia (which I'm using) fixed.

@StefanKarpinski
Copy link
Sponsor Member

Thanks, @ScottPJones! Very much appreciated.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski I'm positively 😊ing from the kind words today! 😉 I do owe all of you a beer (or cider) or two (or three) at the Muddy Charles during JuliaCon, for putting up with me being such a long-winded PITA!

@StefanKarpinski
Copy link
Sponsor Member

No worries, @ScottPJones. Glad you've persevered.

jakebolewski added a commit that referenced this issue May 7, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 9, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 9, 2015
Add reference to issue JuliaLang#11171
mbauman pushed a commit that referenced this issue May 11, 2015
mbauman pushed a commit that referenced this issue May 11, 2015
Add reference to issue #11171
mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015
Add reference to issue JuliaLang#11171
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015
tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015
Add reference to issue JuliaLang#11171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

7 participants