Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not prepend gb18030 second and third when range decode fails upon fourth #110

Closed
hsivonen opened this issue May 11, 2017 · 5 comments

Comments

@hsivonen
Copy link
Member

Consider a run of byte pairs where each pair starts with a byte that's valid as first or third and the second byte is valid as second or fourth in a four-byte gb18030 sequence.

The first two pairs form a four-byte sequence that fails the gb18030 ranges lookup. The spec, as written, realigns the decoder such that the second pair of the first out-of-range four byte sequence is next treated as the first half of another four-byte sequence.

The test case linked to above decodes to four replacement characters in Firefox, Chromium and into four question marks in IE and Edge. (Didn't test Safari.) However, per spec, the run decodes into "�9¶ĭ¶�9".

One error causing the decoder to misalign such that subsequent non-error characters come into existence where there previously were none in browsers can't be good.

Suggested fix:
When processing the fourth byte in a sequence, first check it it is in the range 0x30 to 0x39 inclusive. If it isn't, prepend the second, third and fourth bytes in the sequence per current spec and return error.

Then set code point to the index gb18030 ranges code point. If code point is null, prepend only the fourth byte and return error.

This would deviate from the current browser behavior by unmasking the last ASCII byte in the sequence the way we always unmask failed ASCII in two-byte sequences. However, unlike in the current spec, the ASCII byte that's second in the sequence would get swallowed into a REPLACEMENT CHARACTER. This is fine, because we know that ASCII byte had non-ASCII bytes before and after.

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

@hsivonen
Copy link
Member Author

(Trying to emit two errors with an ASCII character in between as one step would be a novel behavior for the spec and would totally mess up the assumptions I've made about error emission in implementation.)

Nevermind. This can be addressed by adding more state to the decoder.

@hsivonen
Copy link
Member Author

Nevermind. This can be addressed by adding more state to the decoder.

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

@hsivonen
Copy link
Member Author

On third thought: Having two errors would then be unprecedented in another way, which even with the added state from the previous comment would mess up encoding_rs's mechanism for designating which bytes a given U+FFFD corresponds to (not currently made use of in Firefox).

Hmm. On fourth thought, maybe the machinery I put in place for ISO-2022-JP could work here.

Let's do what's best for the Web and I'll sort out the error reporting.

@annevk
Copy link
Member

annevk commented May 11, 2017

@jungshik any thoughts with regards to ICU?

For four byte sequences we're dealing with this:

  1. Non-ASCII
  2. ASCII digit
  3. Non-ASCII
  4. ASCII digit

Now if 4 is not an ASCII digit it sounds like @hsivonen is suggesting it's reasonable to unwind, U+FFFD 1, and reprocess from 2.

If 4 is an ASCII digit, but there's no code point, the question is what to do then:

  1. U+FFFD 1-2-3 and create a theoretical issue for formats where ASCII digits are delimiters of sorts.
  2. U+FFFD 1 and 3, and emit 2 and 4 (in their original order).
  3. U+FFFD 1-2-3-4 which is what browsers do today and don't worry about ASCII digits getting consumed.

@annevk
Copy link
Member

annevk commented May 11, 2017

90 31 90 40 yields "�1怈" in Chrome/Firefox/Safari.
FD 31 FD 7F yields "�1��" Chrome/Firefox/Safari (although the last code point does not display in Chrome and Safari, it does seem to copy-and-paste).
FD 31 FD 21 yields "�1�!" in Chrome/Firefox/Safari.

So given that they do seem to consistently unwind there. So the main question here is what to do when the pointer lookup goes astray.

annevk added a commit that referenced this issue May 11, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes #110.
annevk added a commit that referenced this issue May 11, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes #110.
annevk added a commit that referenced this issue May 11, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes #110.
annevk added a commit that referenced this issue May 15, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes #110.
ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes whatwg#110.
ricea pushed a commit to ricea/encoding that referenced this issue Nov 16, 2017
Instead of always unwinding if there’s no code point when consuming the
fourth byte, only unwind when the fourth byte is not an ASCII digit.
This does mean that ASCII digits can be masked, but since ASCII digits
are not used as delimiter in any format this is highly unlikely to be
used in any attacks (and also matches existing implementations better).

Fixes whatwg#110.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants