Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters #18255

Closed
lideming opened this issue Mar 17, 2023 · 2 comments · Fixed by #18518
Closed

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters #18255

lideming opened this issue Mar 17, 2023 · 2 comments · Fixed by #18518
Assignees
Labels
bug Something isn't working correctly

Comments

@lideming
Copy link
Contributor

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters (code point range 128 - 255).

To reproduce the bug:

import { assertEquals } from "https://deno.land/[email protected]/testing/asserts.ts";

const encoder = new TextEncoder();

// Looping required (maybe because of V8 optimizations)
for (let tries = 0; tries < 10000; tries++) {
  const str = String.fromCodePoint(200);
  
  const expected = encoder.encode(str);

  const buffer = new Uint8Array(expected.byteLength);
  const { written } = encoder.encodeInto(str, buffer);
  const actual = buffer.subarray(0, written);

  assertEquals(actual, expected);
}

Error:

    [Diff] Actual / Expected


+   Uint8Array(2) [
+     195,
+     136,
-   Uint8Array(1) [
-     200,
    ]

The issue was introduced in v1.31.2 (I have tested v1.31.1 - v1.31.3).

I might be wrong, but it seems to be PR #17996, which assumes:

Since input is already UTF-8, we can simply find the last UTF-8 code
point boundary from input that fits in buffer, and copy the bytes up to
that point.

Maybe input is not always UTF-8?

cc @littledivy

@andrewnester
Copy link
Contributor

I was looking into it a bit and it seems the problem comes from fast version of encode function

fn op_encoding_encode_into(
and (potentially) from SeqOneByteString optimisation. If I look at the value of input.as_bytes() inside this function for provide example it equals to [200, 0] instead of [195, 136]

Not entirely sure why it happens though, any ideas? @littledivy

@njhanley
Copy link
Contributor

Small correction to @andrewnester: input.as_bytes() returns [200] for the example.

input should be valid UTF-8 as guaranteed by str; the fact that it isn't points to unsound unsafe code.

The problem is that FastApiOneByteString.as_str() assumes V8's various OneByteStrings are UTF-8 when they are actually ISO-8859-1 (see 2, 3, 4, git -3 -i -e Latin-1 -e ISO-8859-1). Adding UTF-8 validation in rusty_v8 with the following patch causes a panic when the example is run.

diff --git a/src/fast_api.rs b/src/fast_api.rs
index e4ac272..8c64c1a 100644
--- a/src/fast_api.rs
+++ b/src/fast_api.rs
@@ -221,10 +221,10 @@ impl FastApiOneByteString {
   pub fn as_str(&self) -> &str {
     // SAFETY: The string is guaranteed to be valid UTF-8.
     unsafe {
-      std::str::from_utf8_unchecked(std::slice::from_raw_parts(
+      std::str::from_utf8(std::slice::from_raw_parts(
         self.data,
         self.length as usize,
-      ))
+      )).unwrap()
     }
   }
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants