Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 decoding should reject overlong sequences and surrogates #25452

Closed
JeffBezanson opened this issue Jan 7, 2018 · 17 comments
Closed

UTF-8 decoding should reject overlong sequences and surrogates #25452

JeffBezanson opened this issue Jan 7, 2018 · 17 comments
Assignees
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings

Comments

@JeffBezanson
Copy link
Sponsor Member

These should be treated as invalid data; see e.g. rust-lang/rust#3787

The following calls to UInt32 should throw errors:

julia> UInt32("\xf0\x80\x80\x80"[1])
0x00000000

julia> UInt32("\xc0\x80"[1])
0x00000000

julia> UInt32("\xc0\xae"[1])
0x0000002e

julia> UInt32("\ud800"[1])
0x0000d800

Otherwise you get things like this:

julia> "\xc0\xae"[1] == '.'
false

julia> UInt32("\xc0\xae"[1]) == UInt32('.')
true
@JeffBezanson JeffBezanson added domain:strings "Strings!" domain:unicode Related to unicode characters and encodings labels Jan 7, 2018
@StefanKarpinski
Copy link
Sponsor Member

This is by design (see discussion in #24420). If you compare characters, you get a comparison-based on exact encoding. If you want to convert an overlong encoding to an integer value, you get the code point that it encodes, regardless of whether it is standard or overlong. This allows you to express two different things in natural ways:

  • is this the same character encoded in the standard (or non-standard) way?
  • is this the same code points regardless of how it's encoded?

Those are both distinct, useful questions one can ask, which can be expressed straightforwardly in this scheme. Note that there is a bug here, which is that this should return false:

julia> Unicode.isvalid("\xf0\x80\x80\x80"[1])
true

The alternative design would be to throw an error upon attempting to convert an overlong encoding to an integer value. Instead, you'd have to have a different function for getting the code point value of an overlong character encoding. Or maybe a character normalization function? Unclear, but it requires more API surface to express the same thing the current design does with less. The trade off would be that it would be harder to accidentally match an overlong encoded character. I'm not sure whether that's a good thing or a bad thing though.

@JeffBezanson
Copy link
Sponsor Member Author

It's quite clear that overlong sequences and surrogates are invalid UTF-8, as invalid as anything. Getting the code point value of an overlong encoding is not a normal thing to do, and I think it should require some other function.

If data has a stray byte, like "\xff", it would similarly be useful to be able to determine the value of the byte, but conversion to integer (rightly) throws an error. I think overlongs and surrogates should be treated the same. And overlong encodings are far more sinister than stray bytes.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jan 8, 2018

I'm unclear on what is so "sinister" about overlong encodings. Sure, they're invalid, but calling them sinister is a bit dramatic. Modified UTF-8 is quite common and intentionally uses \xc0\x80 to represent the \0 code point. It would be helpful for the conversation if you explained what you're worried about happening if we don't throw an error when asking for the code point of an overlong encoding. There was a security concern in the original thread, which was addressed there. Surrogates are also decoded as their code points, but are correctly classified as invalid already, e.g. Unicode.isvalid('\uxf00') == false.

@JeffBezanson
Copy link
Sponsor Member Author

It has to do with input validation, e.g. checking that a string doesn't have certain characters. Allowing decoding of overlong sequences provides a way for characters to evade such checking.

But I also think it's better to treat all invalid UTF-8 uniformly instead of throwing errors only in some cases. Yes, there needs to be some way to look at the actual data without an error, but that also applies equally to all invalid data.

@StefanKarpinski
Copy link
Sponsor Member

Ok, can you propose a different design then?

@JeffBezanson
Copy link
Sponsor Member Author

My proposal is for anything where isvalid is false to throw an error when passed to UInt32.

@StefanKarpinski
Copy link
Sponsor Member

And how would you get the corresponding code point value if you didn't want an error? E.g. if you were looking for an overlong \0 as in Modified UTF-8.

@JeffBezanson
Copy link
Sponsor Member Author

I think that's really a second-order concern. Modified UTF-8 and CESU-8 are the only exceptions, and most languages seem to consider them marginal and support them in external packages if at all. For example JavaCall could have a function to transcode CESU-8/Modified UTF-8 and a function to test whether a Char is an overlong \0.

@nalimilan
Copy link
Member

In general I agree we should be careful about overlong encodings, but I'm not sure the fact that UInt32(c) accepts invalid codepoints is a serious problem. When are people supposed to use UInt32(c)? It is more common to perform comparisons between Char values, so I'm mostly concerned about how these behave. Converting to UInt32 can be seen as a more advanced operation.

OTOH what can be dangerous is that one would check c == ".", but it wouldn't match an overlong encoding, and the string would be passed further to a program which would incorrectly interpret it as valid Unicode. Making UInt32(c) throw an error wouldn't protect against that risk. Only "\xc0\xae"[1] == '.' returning true would, but doing that would imply accepting even more invalid encodings. Paradoxically, it sounds like in this particular area being more secure implies adopting a less strict definition of equality, to be sure you don't miss any problematic encoding of a character.

Other than doing that, the only way to protect against it is to call isvalid, and the only way to make Julia safe by default would be to call it on string construction. But AFAIK we've decided not to do that people that makes it impossible to read arbitrary data and output it without modifying it. Rust has taken a different approach, since its String type is guaranteed to contain valid UTF-8, and its OsString type is used for arbitrary bytes.

@JeffBezanson
Copy link
Sponsor Member Author

My interpretation of the standard is that a correct UTF-8 decoder should not accept invalid data, and a correct encoder should not produce it. In our case UInt32(c) happens to be where we do decoding, so that's where we'd do the check. But it would apply to anywhere we decode UTF-8. In our design, it does not have to apply to just reading and writing data, since no encoding or decoding happens.

Other than that, the argument is very simple: if isvalid is false (which it should be for overlongs), you should not be able to get a code point out.

I agree "\xc0\xae"[1] != '.' is an issue, but I think the only reasonable ways to deal with that are for users to call isvalid, or for us to do it for them.

@nalimilan
Copy link
Member

It's not clear how to define "decoding" in the Julia case, but it sounds weird to me to consider that UInt32(c) is decoding, but that c == '.' is not. The latter expression definitely needs to decode the Char to compare it with '.'. Or do we consider that . is encoded to UTF-8 when creating the Char '.', and then that we just compare bytes with c? That's technically correct, but that sounds like a pay on words for practical purposes.

@JeffBezanson
Copy link
Sponsor Member Author

Decoding means converting a UTF-8 sequence to a code point, which UInt32(c) undeniably does.

I do think throwing an error when comparing an invalid character to a valid one might make sense. I don't know if it was discussed, or if it's possible within the design, but it might be possible to represent valid characters as their code point (restoring the representation of Vector{Char} for valid data). Then x == '.' would involve decoding and throwing an error would make sense.

@StefanKarpinski
Copy link
Sponsor Member

Ok, I'm sold on being conservative here. If it turns out to be a problem, we can always relax it. So the actionable items here are:

  1. Fix the isvalid bug that doesn't report overlong encoded char values as invalid. While I'm at it, I might as well implement that predicate in Julia using comparisons instead of calling utf8proc, it will be much faster as well.

  2. Make conversion of Char to integers more conservative: throw an error for any value that doesn't represent a valid Unicode code point. Note that this will break some code since (as discussed in one of the other threads), people write code like this:

 n = '0' <= c <= '9' ? n<<4 + c - '0' :
     'a' <= c <= 'f' ? n<<4 + c - 'a' + 10 :
     'A' <= c <= 'F' ? n<<4 + c - 'A' + 10 : break

This is a real case that was an error in Base, I already parenthesized this to avoid the problem:

julia/base/strings/io.jl

Lines 333 to 335 in 310667b

n = '0' <= c <= '9' ? n<<4 + (c-'0') :
'a' <= c <= 'f' ? n<<4 + (c-'a'+10) :
'A' <= c <= 'F' ? n<<4 + (c-'A'+10) : break

However, this change might break people's code since intermediate character values can be invalid that would later come back into the valid range at the end of the computation. In this case, the end value is supposed to be an integer in any case and the parenthesized version is more efficient. I suspect that will generally be the case, so this is fine.

@StefanKarpinski
Copy link
Sponsor Member

All but the last now throw InvalidCharErrors:

julia> UInt32("\xf0\x80\x80\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xf0\x80\x80\x80')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[34]:1

julia> UInt32("\xc0\x80"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\x80')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[35]:1

julia> UInt32("\xc0\xae"[1])
ERROR: Base.InvalidCharError{Char}('\xc0\xae')
Stacktrace:
 [1] invalid_char(::Char) at ./char.jl:85
 [2] UInt32(::Char) at ./char.jl:130
 [3] top-level scope at REPL[36]:1

julia> UInt32("\ud800"[1])
0x0000d800

The last one is safe from the objection given above since it's 0xd800 is not the code point of any valid character and '\ud800' is correctly identified as an invalid character:

julia> isvalid("\ud800"[1])
false

So I think this can be closed (and probably should have been before 1.0).

@jocutajar
Copy link

Hi, I'd like to propose a cleaner approach (IMHO). Not confident to optimize the ascii fast forward in unsafe, but I think state machine would describe the problem better and is more readable.

As a bonus, the state machine could be made available for other more advanced parsers. My case is for ANSI escape sequences which may not be valid UTF-8 but are embedded in UTF-8 streams.

let (state, act) = match (&self, input) {
            (Normal, 0x00...0x1F) => (Normal, CtrlAscii),
            (Normal, 0x20...0x7E) => (Normal, Print),
            (Normal, 0x7F...0x9F) => (Normal, CtrlC1),
            (Normal, 0xA0...0xC1) => (Normal, Invalid(1)),
            (Normal, 0xC2...0xDF) => (C___x_, NeedMore(1)),
            (Normal, 0xE0...0xE0) => (C__0__, NeedMore(2)),
            (Normal, 0xE1...0xEC) => (C__x__, NeedMore(2)),
            (Normal, 0xED...0xED) => (C__D__, NeedMore(2)),
            (Normal, 0xEE...0xEF) => (C__x__, NeedMore(2)),
            (Normal, 0xF0...0xF0) => (C_0___, NeedMore(3)),
            (Normal, 0xF1...0xF3) => (C_x___, NeedMore(3)),
            (Normal, 0xF4...0xF4) => (C_4___, NeedMore(3)),
            (Normal, 0xF5...0xFF) => (Normal, Invalid(1)),

            (C___x_, 0x80...0xBF) => print,
            (C__xx_, 0x80...0xBF) => print,
            (C_xxx_, 0x80...0xBF) => print,

            (C__0__, 0xA0...0xBF) => (C__xx_, NeedMore(1)),
            (C__D__, 0x80...0x9F) => (C__xx_, NeedMore(1)),
            (C__x__, 0x80...0xBF) => (C__xx_, NeedMore(1)),
            (C_0___, 0x90...0xBF) => (C_xx__, NeedMore(2)),
            (C_4___, 0x80...0x8F) => (C_xx__, NeedMore(2)),
            (C_x___, 0x80...0xBF) => (C_xx__, NeedMore(2)),
            (C_xx__, 0x80...0xBF) => (C_xxx_, NeedMore(1)),

            (C___x_, _) => (Normal, Invalid(2)),
            (C__0__, _) => (Normal, Invalid(2)),
            (C__D__, _) => (Normal, Invalid(2)),
            (C__x__, _) => (Normal, Invalid(2)),
            (C__xx_, _) => (Normal, Invalid(3)),
            (C_0___, _) => (Normal, Invalid(2)),
            (C_4___, _) => (Normal, Invalid(2)),
            (C_x___, _) => (Normal, Invalid(2)),
            (C_xx__, _) => (Normal, Invalid(3)),
            (C_xxx_, _) => (Normal, Invalid(4)),
};

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Mar 8, 2019

Are you proposing a different behavior or a different implementation or something else entirely?

@jocutajar
Copy link

jocutajar commented Mar 24, 2019

Hi @StefanKarpinski, I'd suggest the state machine approach as a replacement for Rust utf8 handling. I've just realized I'm in the wrong thread then as this is about Julia. I've created a separate issue for rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

4 participants