UTF-8 decoding should reject overlong sequences and surrogates #25452

JeffBezanson · 2018-01-07T22:02:09Z

These should be treated as invalid data; see e.g. rust-lang/rust#3787

The following calls to UInt32 should throw errors:

julia> UInt32("\xf0\x80\x80\x80"[1])
0x00000000

julia> UInt32("\xc0\x80"[1])
0x00000000

julia> UInt32("\xc0\xae"[1])
0x0000002e

julia> UInt32("\ud800"[1])
0x0000d800

Otherwise you get things like this:

julia> "\xc0\xae"[1] == '.'
false

julia> UInt32("\xc0\xae"[1]) == UInt32('.')
true

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2018-01-07T23:09:55Z

This is by design (see discussion in #24420). If you compare characters, you get a comparison-based on exact encoding. If you want to convert an overlong encoding to an integer value, you get the code point that it encodes, regardless of whether it is standard or overlong. This allows you to express two different things in natural ways:

is this the same character encoded in the standard (or non-standard) way?
is this the same code points regardless of how it's encoded?

Those are both distinct, useful questions one can ask, which can be expressed straightforwardly in this scheme. Note that there is a bug here, which is that this should return false:

julia> Unicode.isvalid("\xf0\x80\x80\x80"[1])
true

The alternative design would be to throw an error upon attempting to convert an overlong encoding to an integer value. Instead, you'd have to have a different function for getting the code point value of an overlong character encoding. Or maybe a character normalization function? Unclear, but it requires more API surface to express the same thing the current design does with less. The trade off would be that it would be harder to accidentally match an overlong encoded character. I'm not sure whether that's a good thing or a bad thing though.

JeffBezanson · 2018-01-07T23:23:27Z

It's quite clear that overlong sequences and surrogates are invalid UTF-8, as invalid as anything. Getting the code point value of an overlong encoding is not a normal thing to do, and I think it should require some other function.

If data has a stray byte, like "\xff", it would similarly be useful to be able to determine the value of the byte, but conversion to integer (rightly) throws an error. I think overlongs and surrogates should be treated the same. And overlong encodings are far more sinister than stray bytes.

StefanKarpinski · 2018-01-08T02:59:00Z

I'm unclear on what is so "sinister" about overlong encodings. Sure, they're invalid, but calling them sinister is a bit dramatic. Modified UTF-8 is quite common and intentionally uses \xc0\x80 to represent the \0 code point. It would be helpful for the conversation if you explained what you're worried about happening if we don't throw an error when asking for the code point of an overlong encoding. There was a security concern in the original thread, which was addressed there. Surrogates are also decoded as their code points, but are correctly classified as invalid already, e.g. Unicode.isvalid('\uxf00') == false.

JeffBezanson · 2018-01-08T03:37:21Z

It has to do with input validation, e.g. checking that a string doesn't have certain characters. Allowing decoding of overlong sequences provides a way for characters to evade such checking.

But I also think it's better to treat all invalid UTF-8 uniformly instead of throwing errors only in some cases. Yes, there needs to be some way to look at the actual data without an error, but that also applies equally to all invalid data.

StefanKarpinski · 2018-01-08T04:03:29Z

Ok, can you propose a different design then?

JeffBezanson · 2018-01-08T04:13:29Z

My proposal is for anything where isvalid is false to throw an error when passed to UInt32.

StefanKarpinski · 2018-01-08T04:16:45Z

And how would you get the corresponding code point value if you didn't want an error? E.g. if you were looking for an overlong \0 as in Modified UTF-8.

JeffBezanson · 2018-01-08T05:36:45Z

I think that's really a second-order concern. Modified UTF-8 and CESU-8 are the only exceptions, and most languages seem to consider them marginal and support them in external packages if at all. For example JavaCall could have a function to transcode CESU-8/Modified UTF-8 and a function to test whether a Char is an overlong \0.

nalimilan · 2018-01-08T10:51:12Z

In general I agree we should be careful about overlong encodings, but I'm not sure the fact that UInt32(c) accepts invalid codepoints is a serious problem. When are people supposed to use UInt32(c)? It is more common to perform comparisons between Char values, so I'm mostly concerned about how these behave. Converting to UInt32 can be seen as a more advanced operation.

OTOH what can be dangerous is that one would check c == ".", but it wouldn't match an overlong encoding, and the string would be passed further to a program which would incorrectly interpret it as valid Unicode. Making UInt32(c) throw an error wouldn't protect against that risk. Only "\xc0\xae"[1] == '.' returning true would, but doing that would imply accepting even more invalid encodings. Paradoxically, it sounds like in this particular area being more secure implies adopting a less strict definition of equality, to be sure you don't miss any problematic encoding of a character.

Other than doing that, the only way to protect against it is to call isvalid, and the only way to make Julia safe by default would be to call it on string construction. But AFAIK we've decided not to do that people that makes it impossible to read arbitrary data and output it without modifying it. Rust has taken a different approach, since its String type is guaranteed to contain valid UTF-8, and its OsString type is used for arbitrary bytes.

JeffBezanson · 2018-01-08T16:28:26Z

My interpretation of the standard is that a correct UTF-8 decoder should not accept invalid data, and a correct encoder should not produce it. In our case UInt32(c) happens to be where we do decoding, so that's where we'd do the check. But it would apply to anywhere we decode UTF-8. In our design, it does not have to apply to just reading and writing data, since no encoding or decoding happens.

Other than that, the argument is very simple: if isvalid is false (which it should be for overlongs), you should not be able to get a code point out.

I agree "\xc0\xae"[1] != '.' is an issue, but I think the only reasonable ways to deal with that are for users to call isvalid, or for us to do it for them.

nalimilan · 2018-01-08T16:35:25Z

It's not clear how to define "decoding" in the Julia case, but it sounds weird to me to consider that UInt32(c) is decoding, but that c == '.' is not. The latter expression definitely needs to decode the Char to compare it with '.'. Or do we consider that . is encoded to UTF-8 when creating the Char '.', and then that we just compare bytes with c? That's technically correct, but that sounds like a pay on words for practical purposes.

JeffBezanson · 2018-01-08T16:42:53Z

Decoding means converting a UTF-8 sequence to a code point, which UInt32(c) undeniably does.

I do think throwing an error when comparing an invalid character to a valid one might make sense. I don't know if it was discussed, or if it's possible within the design, but it might be possible to represent valid characters as their code point (restoring the representation of Vector{Char} for valid data). Then x == '.' would involve decoding and throwing an error would make sense.

StefanKarpinski · 2018-01-08T22:50:12Z

Ok, I'm sold on being conservative here. If it turns out to be a problem, we can always relax it. So the actionable items here are:

Fix the isvalid bug that doesn't report overlong encoded char values as invalid. While I'm at it, I might as well implement that predicate in Julia using comparisons instead of calling utf8proc, it will be much faster as well.
Make conversion of Char to integers more conservative: throw an error for any value that doesn't represent a valid Unicode code point. Note that this will break some code since (as discussed in one of the other threads), people write code like this:

 n = '0' <= c <= '9' ? n<<4 + c - '0' :
     'a' <= c <= 'f' ? n<<4 + c - 'a' + 10 :
     'A' <= c <= 'F' ? n<<4 + c - 'A' + 10 : break

This is a real case that was an error in Base, I already parenthesized this to avoid the problem:

julia/base/strings/io.jl

Lines 333 to 335 in 310667b

 n = '0' <= c <= '9' ? n<<4 + (c-'0') : 

 'a' <= c <= 'f' ? n<<4 + (c-'a'+10) : 

 'A' <= c <= 'F' ? n<<4 + (c-'A'+10) : break

However, this change might break people's code since intermediate character values can be invalid that would later come back into the valid range at the end of the computation. In this case, the end value is supposed to be an integer in any case and the parenthesized version is more efficient. I suspect that will generally be the case, so this is fine.

part of #25452

…har (#25559) part of #25452

StefanKarpinski · 2019-01-11T15:34:51Z

All but the last now throw InvalidCharErrors:

julia> UInt32("\xf0\x80\x80\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xf0\x80\x80\x80')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[34]:1

julia> UInt32("\xc0\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\x80')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[35]:1

julia> UInt32("\xc0\xae"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\xae')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[36]:1

julia> UInt32("\ud800"[1])
0x0000d800

The last one is safe from the objection given above since it's 0xd800 is not the code point of any valid character and '\ud800' is correctly identified as an invalid character:

julia> isvalid("\ud800"[1])
false

So I think this can be closed (and probably should have been before 1.0).

jocutajar · 2019-03-08T01:06:24Z

Hi, I'd like to propose a cleaner approach (IMHO). Not confident to optimize the ascii fast forward in unsafe, but I think state machine would describe the problem better and is more readable.

As a bonus, the state machine could be made available for other more advanced parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams.

let (state, act) = match (&self, input) {
            (Normal, 0x00...0x1F) => (Normal, CtrlAscii),
            (Normal, 0x20...0x7E) => (Normal, Print),
            (Normal, 0x7F...0x9F) => (Normal, CtrlC1),
            (Normal, 0xA0...0xC1) => (Normal, Invalid(1)),
            (Normal, 0xC2...0xDF) => (C___x_, NeedMore(1)),
            (Normal, 0xE0...0xE0) => (C__0__, NeedMore(2)),
            (Normal, 0xE1...0xEC) => (C__x__, NeedMore(2)),
            (Normal, 0xED...0xED) => (C__D__, NeedMore(2)),
            (Normal, 0xEE...0xEF) => (C__x__, NeedMore(2)),
            (Normal, 0xF0...0xF0) => (C_0___, NeedMore(3)),
            (Normal, 0xF1...0xF3) => (C_x___, NeedMore(3)),
            (Normal, 0xF4...0xF4) => (C_4___, NeedMore(3)),
            (Normal, 0xF5...0xFF) => (Normal, Invalid(1)),

            (C___x_, 0x80...0xBF) => print,
            (C__xx_, 0x80...0xBF) => print,
            (C_xxx_, 0x80...0xBF) => print,

            (C__0__, 0xA0...0xBF) => (C__xx_, NeedMore(1)),
            (C__D__, 0x80...0x9F) => (C__xx_, NeedMore(1)),
            (C__x__, 0x80...0xBF) => (C__xx_, NeedMore(1)),
            (C_0___, 0x90...0xBF) => (C_xx__, NeedMore(2)),
            (C_4___, 0x80...0x8F) => (C_xx__, NeedMore(2)),
            (C_x___, 0x80...0xBF) => (C_xx__, NeedMore(2)),
            (C_xx__, 0x80...0xBF) => (C_xxx_, NeedMore(1)),

            (C___x_, _) => (Normal, Invalid(2)),
            (C__0__, _) => (Normal, Invalid(2)),
            (C__D__, _) => (Normal, Invalid(2)),
            (C__x__, _) => (Normal, Invalid(2)),
            (C__xx_, _) => (Normal, Invalid(3)),
            (C_0___, _) => (Normal, Invalid(2)),
            (C_4___, _) => (Normal, Invalid(2)),
            (C_x___, _) => (Normal, Invalid(2)),
            (C_xx__, _) => (Normal, Invalid(3)),
            (C_xxx_, _) => (Normal, Invalid(4)),
};

StefanKarpinski · 2019-03-08T17:05:40Z

Are you proposing a different behavior or a different implementation or something else entirely?

jocutajar · 2019-03-24T14:45:56Z

Hi @StefanKarpinski, I'd suggest the state machine approach as a replacement for Rust utf8 handling. I've just realized I'm in the wrong thread then as this is about Julia. I've created a separate issue for rust.

JeffBezanson added domain:strings "Strings!" domain:unicode Related to unicode characters and encodings labels Jan 7, 2018

StefanKarpinski added this to the 1.0 milestone Jan 8, 2018

StefanKarpinski self-assigned this Jan 8, 2018

JeffBezanson self-assigned this Jan 14, 2018

JeffBezanson added a commit that referenced this issue Jan 14, 2018

make overlong Chars !isvalid; throw error on UInt32 of overlong Char

a7a9f43

part of #25452

JeffBezanson mentioned this issue Jan 14, 2018

make overlong Chars !isvalid; throw error on UInt32 of overlong Char #25559

Merged

JeffBezanson added a commit that referenced this issue Jan 14, 2018

make overlong Chars !isvalid; throw error on UInt32 of overlong Char

b0e682d

part of #25452

StefanKarpinski mentioned this issue Jan 22, 2018

JSON.parse fails for cweb/url-testing/master/urls.json JuliaIO/JSON.jl#232

Open

JeffBezanson added a commit that referenced this issue Jan 22, 2018

make overlong Chars !isvalid; throw error on UInt32 of overlong Char

507d206

part of #25452

JeffBezanson modified the milestones: 1.0, 1.0.x Jan 23, 2018

JeffBezanson added a commit that referenced this issue Jan 23, 2018

make overlong Chars !isvalid; throw error on UInt32 of overlong C…

70e3430

…har (#25559) part of #25452

StefanKarpinski closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 decoding should reject overlong sequences and surrogates #25452

UTF-8 decoding should reject overlong sequences and surrogates #25452

JeffBezanson commented Jan 7, 2018

StefanKarpinski commented Jan 7, 2018

JeffBezanson commented Jan 7, 2018

StefanKarpinski commented Jan 8, 2018 •

edited

Loading

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

nalimilan commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

nalimilan commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

StefanKarpinski commented Jan 11, 2019

jocutajar commented Mar 8, 2019

StefanKarpinski commented Mar 8, 2019 •

edited

Loading

jocutajar commented Mar 24, 2019 •

edited

Loading

UTF-8 decoding should reject overlong sequences and surrogates #25452

UTF-8 decoding should reject overlong sequences and surrogates #25452

Comments

JeffBezanson commented Jan 7, 2018

StefanKarpinski commented Jan 7, 2018

JeffBezanson commented Jan 7, 2018

StefanKarpinski commented Jan 8, 2018 • edited Loading

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

nalimilan commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

nalimilan commented Jan 8, 2018

JeffBezanson commented Jan 8, 2018

StefanKarpinski commented Jan 8, 2018

StefanKarpinski commented Jan 11, 2019

jocutajar commented Mar 8, 2019

StefanKarpinski commented Mar 8, 2019 • edited Loading

jocutajar commented Mar 24, 2019 • edited Loading

StefanKarpinski commented Jan 8, 2018 •

edited

Loading

StefanKarpinski commented Mar 8, 2019 •

edited

Loading

jocutajar commented Mar 24, 2019 •

edited

Loading