`TextEncoder.encodeInto` may produce wrong result for one-byte non-ASCII characters #18255

lideming · 2023-03-17T20:32:30Z

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters (code point range 128 - 255).

To reproduce the bug:

import { assertEquals } from "https://deno.land/[email protected]/testing/asserts.ts";

const encoder = new TextEncoder();

// Looping required (maybe because of V8 optimizations)
for (let tries = 0; tries < 10000; tries++) {
  const str = String.fromCodePoint(200);
  
  const expected = encoder.encode(str);

  const buffer = new Uint8Array(expected.byteLength);
  const { written } = encoder.encodeInto(str, buffer);
  const actual = buffer.subarray(0, written);

  assertEquals(actual, expected);
}

Error:

    [Diff] Actual / Expected


+   Uint8Array(2) [
+     195,
+     136,
-   Uint8Array(1) [
-     200,
    ]

The issue was introduced in v1.31.2 (I have tested v1.31.1 - v1.31.3).

I might be wrong, but it seems to be PR #17996, which assumes:

Since input is already UTF-8, we can simply find the last UTF-8 code
point boundary from input that fits in buffer, and copy the bytes up to
that point.

Maybe input is not always UTF-8?

cc @littledivy

The text was updated successfully, but these errors were encountered:

andrewnester · 2023-03-27T14:12:40Z

I was looking into it a bit and it seems the problem comes from fast version of encode function

deno/ext/web/lib.rs

Line 371 in 3487fde

fn op_encoding_encode_into(

and (potentially) from SeqOneByteString optimisation. If I look at the value of input.as_bytes() inside this function for provide example it equals to [200, 0] instead of [195, 136]

Not entirely sure why it happens though, any ideas? @littledivy

njhanley · 2023-03-31T05:12:11Z

Small correction to @andrewnester: input.as_bytes() returns [200] for the example.

input should be valid UTF-8 as guaranteed by str; the fact that it isn't points to unsound unsafe code.

The problem is that FastApiOneByteString.as_str() assumes V8's various OneByteStrings are UTF-8 when they are actually ISO-8859-1 (see 2, 3, 4, git -3 -i -e Latin-1 -e ISO-8859-1). Adding UTF-8 validation in rusty_v8 with the following patch causes a panic when the example is run.

diff --git a/src/fast_api.rs b/src/fast_api.rs
index e4ac272..8c64c1a 100644
--- a/src/fast_api.rs
+++ b/src/fast_api.rs
@@ -221,10 +221,10 @@ impl FastApiOneByteString {
   pub fn as_str(&self) -> &str {
     // SAFETY: The string is guaranteed to be valid UTF-8.
     unsafe {
-      std::str::from_utf8_unchecked(std::slice::from_raw_parts(
+      std::str::from_utf8(std::slice::from_raw_parts(
         self.data,
         self.length as usize,
-      ))
+      )).unwrap()
     }
   }
 }

Fixes #18255

lideming mentioned this issue Mar 17, 2023

byte length error when Deno upgraded to v1.31.2 lideming/btrdb#2

Closed

dsherret added the bug Something isn't working correctly label Mar 23, 2023

dsherret assigned littledivy Mar 23, 2023

littledivy mentioned this issue Mar 31, 2023

fix(ops): fallback when FastApiOneByteString is not utf8 #18518

Merged

bartlomieju closed this as completed in #18518 Mar 31, 2023

bartlomieju pushed a commit that referenced this issue Mar 31, 2023

fix(ops): fallback when FastApiOneByteString is not utf8 (#18518)

feab94f

Fixes #18255

mmastrac pushed a commit that referenced this issue Mar 31, 2023

fix(ops): fallback when FastApiOneByteString is not utf8 (#18518)

a23252d

Fixes #18255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TextEncoder.encodeInto` may produce wrong result for one-byte non-ASCII characters #18255

`TextEncoder.encodeInto` may produce wrong result for one-byte non-ASCII characters #18255

lideming commented Mar 17, 2023

andrewnester commented Mar 27, 2023

njhanley commented Mar 31, 2023

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters #18255

TextEncoder.encodeInto may produce wrong result for one-byte non-ASCII characters #18255

Comments

lideming commented Mar 17, 2023

andrewnester commented Mar 27, 2023

njhanley commented Mar 31, 2023

`TextEncoder.encodeInto` may produce wrong result for one-byte non-ASCII characters #18255

`TextEncoder.encodeInto` may produce wrong result for one-byte non-ASCII characters #18255