Should not prepend gb18030 second and third when range decode fails upon fourth #110

hsivonen · 2017-05-11T09:31:56Z

Consider a run of byte pairs where each pair starts with a byte that's valid as first or third and the second byte is valid as second or fourth in a four-byte gb18030 sequence.

The first two pairs form a four-byte sequence that fails the gb18030 ranges lookup. The spec, as written, realigns the decoder such that the second pair of the first out-of-range four byte sequence is next treated as the first half of another four-byte sequence.

The test case linked to above decodes to four replacement characters in Firefox, Chromium and into four question marks in IE and Edge. (Didn't test Safari.) However, per spec, the run decodes into "�9¶ĭ¶�9".

One error causing the decoder to misalign such that subsequent non-error characters come into existence where there previously were none in browsers can't be good.

Suggested fix:
When processing the fourth byte in a sequence, first check it it is in the range 0x30 to 0x39 inclusive. If it isn't, prepend the second, third and fourth bytes in the sequence per current spec and return error.

Then set code point to the index gb18030 ranges code point. If code point is null, prepend only the fourth byte and return error.

This would deviate from the current browser behavior by unmasking the last ASCII byte in the sequence the way we always unmask failed ASCII in two-byte sequences. However, unlike in the current spec, the ASCII byte that's second in the sequence would get swallowed into a REPLACEMENT CHARACTER. This is fine, because we know that ASCII byte had non-ASCII bytes before and after.

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

hsivonen · 2017-05-11T09:38:17Z

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

Nevermind. This can be addressed by adding more state to the decoder.

hsivonen · 2017-05-11T09:42:07Z

Nevermind. This can be addressed by adding more state to the decoder.

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

hsivonen · 2017-05-11T09:44:48Z

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

Hmm. On fourth thought, maybe the machinery I put in place for ISO-2022-JP could work here.

Let's do what's best for the Web and I'll sort out the error reporting.

annevk · 2017-05-11T10:11:06Z

@jungshik any thoughts with regards to ICU?

For four byte sequences we're dealing with this:

Non-ASCII
ASCII digit
Non-ASCII
ASCII digit

Now if 4 is not an ASCII digit it sounds like @hsivonen is suggesting it's reasonable to unwind, U+FFFD 1, and reprocess from 2.

If 4 is an ASCII digit, but there's no code point, the question is what to do then:

U+FFFD 1-2-3 and create a theoretical issue for formats where ASCII digits are delimiters of sorts.
U+FFFD 1 and 3, and emit 2 and 4 (in their original order).
U+FFFD 1-2-3-4 which is what browsers do today and don't worry about ASCII digits getting consumed.

annevk · 2017-05-11T10:26:32Z

90 31 90 40 yields "�1怈" in Chrome/Firefox/Safari.
FD 31 FD 7F yields "�1��" Chrome/Firefox/Safari (although the last code point does not display in Chrome and Safari, it does seem to copy-and-paste).
FD 31 FD 21 yields "�1�!" in Chrome/Firefox/Safari.

So given that they do seem to consistently unwind there. So the main question here is what to do when the pointer lookup goes astray.

Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.

…oding#110.

Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes #110.

Instead of always unwinding if there’s no code point when consuming the fourth byte, only unwind when the fourth byte is not an ASCII digit. This does mean that ASCII digits can be masked, but since ASCII digits are not used as delimiter in any format this is highly unlikely to be used in any attacks (and also matches existing implementations better). Fixes whatwg#110.

annevk mentioned this issue May 11, 2017

gb18030 decoder: unwind from fourth byte when it's not a digit #111

Merged

hsivonen added a commit to hsivonen/encoding_rs that referenced this issue May 11, 2017

Don't prepend anything when gb18030 range decode fails per whatwg/enc…

0919915

…oding#110.

annevk closed this as completed in #111 May 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should not prepend gb18030 second and third when range decode fails upon fourth #110

Should not prepend gb18030 second and third when range decode fails upon fourth #110

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

annevk commented May 11, 2017 •

edited

Loading

annevk commented May 11, 2017

Should not prepend gb18030 second and third when range decode fails upon fourth #110

Should not prepend gb18030 second and third when range decode fails upon fourth #110

Comments

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

hsivonen commented May 11, 2017

annevk commented May 11, 2017 • edited Loading

annevk commented May 11, 2017

annevk commented May 11, 2017 •

edited

Loading