Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: validate UTF8 on fast path #54526

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ronag
Copy link
Member

@ronag ronag commented Aug 23, 2024

Fast API handles invalid UTF differently than the slow API.

13:05:25 buffers/buffer-write-string-short.js n=1000000 len=1 encoding='ascii'                            ***    266.52 %       ±7.27%  ±9.75% ±12.84%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=1 encoding='latin1'                           ***    201.16 %       ±4.74%  ±6.33%  ±8.29%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=1 encoding='utf8'                             ***    242.07 %       ±6.29%  ±8.44% ±11.14%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=16 encoding='ascii'                           ***    264.81 %       ±5.88%  ±7.86% ±10.29%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=16 encoding='latin1'                          ***    197.28 %       ±6.96%  ±9.32% ±12.26%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=16 encoding='utf8'                            ***    349.02 %       ±8.80% ±11.84% ±15.68%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=32 encoding='ascii'                           ***    271.75 %       ±8.70% ±11.69% ±15.43%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=32 encoding='latin1'                          ***    214.58 %       ±5.85%  ±7.84% ±10.32%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=32 encoding='utf8'                            ***    351.17 %       ±7.64% ±10.28% ±13.60%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=8 encoding='ascii'                            ***    273.30 %       ±8.65% ±11.63% ±15.38%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=8 encoding='latin1'                           ***    191.62 %       ±5.62%  ±7.52%  ±9.88%
13:05:25 buffers/buffer-write-string-short.js n=1000000 len=8 encoding='utf8'                             ***    224.55 %       ±4.72%  ±6.33%  ±8.33%

Fixes: #54521
PR-URL: #54525

@nodejs-github-bot nodejs-github-bot added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Aug 23, 2024
src/node_buffer.cc Outdated Show resolved Hide resolved
@ronag ronag force-pushed the fast-utf8-2 branch 2 times, most recently from a930ad4 to 0fda693 Compare August 23, 2024 14:45
@ronag ronag marked this pull request as ready for review August 23, 2024 14:46
@ronag ronag force-pushed the fast-utf8-2 branch 2 times, most recently from 23f1b23 to 8be5c43 Compare August 23, 2024 14:50
{
let i = 0;

while (i < 1_000_000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a non-deterministic test is unpleasant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can trigger fast api call using the method @targos implemented in his previous pull requests.

Copy link
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to me. I wish we could use a deterministic test. Having to loop a million times to trigger and issue is both slow and error prone.

Copy link

codecov bot commented Aug 23, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 8 lines in your changes missing coverage. Please review.

Project coverage is 87.61%. Comparing base (01f751b) to head (8c1de9b).
Report is 20 commits behind head on main.

Files with missing lines Patch % Lines
src/node_buffer.cc 88.88% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #54526      +/-   ##
==========================================
+ Coverage   87.31%   87.61%   +0.29%     
==========================================
  Files         649      650       +1     
  Lines      182755   182892     +137     
  Branches    35051    35390     +339     
==========================================
+ Hits       159575   160236     +661     
+ Misses      16452    15925     -527     
- Partials     6728     6731       +3     
Files with missing lines Coverage Δ
src/node_buffer.cc 71.41% <88.88%> (+1.46%) ⬆️

... and 85 files with indirect coverage changes

@ronag ronag added the request-ci Add this label to start a Jenkins CI on a PR. label Aug 23, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Aug 23, 2024
@nodejs-github-bot

This comment was marked as outdated.

@ronag ronag added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Aug 23, 2024
@ronag ronag mentioned this pull request Aug 23, 2024
Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some lint issues but otherwise LGTM

@ronag ronag added the request-ci Add this label to start a Jenkins CI on a PR. label Aug 24, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Aug 24, 2024
@nodejs-github-bot

This comment was marked as outdated.

@blexrob
Copy link

blexrob commented Aug 24, 2024

With the current commits, the following test will fail:

let i = 0;
const testStr = "\xc2\x80"; // when stored in 1-byte chars, this is a valid UTF-8 encoding
const expected = Buffer.from(testStr).toString("hex");
for(; i < 1_000_000; i++) {
  const buf = Buffer.from(testStr);
  const ashex = buf.toString("hex");
  if (ashex !== expected) {
    console.log(`Decoding changed in iteration ${i} when changing to FastWriteStringUTF8, got ${ashex}, expected ${expected}`);
    break;
  }
}

if(i<1_000_000) {
  console.error("FAILED after %d iterations",i);
} else
  console.log("PASSED after %d iterations",i);

The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.

@ronag
Copy link
Member Author

ronag commented Aug 24, 2024

The source string is stored as UTF16 (compressed to 1 byte), but still needs to be converted to UTF-8. The fast path can only be implemented as memcpy when only characters from the range 0-127 are present.

I thought FastOneByteString means that it's only one byte chars?

@ronag
Copy link
Member Author

ronag commented Aug 24, 2024

< when only characters from the range 0-127 are present

@lemira does simdutf have this? validate_ascii?

@ronag
Copy link
Member Author

ronag commented Aug 30, 2024

@ShenHongFei can you try again with your project. I think I might have found the issue.

@ronag
Copy link
Member Author

ronag commented Aug 30, 2024

For future reference:

Simple fuzzing:

const assert = require('node:assert')
const crypto = require('node:crypto')

for (let i = 0; i < 1e6; i++) {
  const len = Math.floor(Math.random() * 128)
  const buf = Buffer.allocUnsafe(len)
  crypto.getRandomValues(buf)

  for (let n = 0; n < 16; n++) {
    const start = Math.floor(Math.random() * len)
    const end = Math.min(len, start + Math.floor(Math.random() * (len - start)))
    const src = buf.subarray(start, end)
    const str = src.toString('utf8')

    let bufNew = Buffer.alloc(buf.length);
    let bufOld = Buffer.alloc(buf.length);

    bufNew = bufNew.subarray(0, bufNew.utf8Write(str, start))
    bufOld = bufOld.subarray(0, bufOld.utf8WriteLegacy(str, start)) // old implementation

    assert.deepStrictEqual(bufNew, bufOld)
  }
}

ronag added a commit to ronag/simdutf that referenced this pull request Aug 30, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
ronag added a commit to ronag/simdutf that referenced this pull request Aug 30, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
ronag added a commit to ronag/simdutf that referenced this pull request Aug 30, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
@ronag
Copy link
Member Author

ronag commented Aug 30, 2024

@ShenHongFei
Copy link
Contributor

@ronag I tried it, and the problem has been fixed. 👍

@ronag ronag added the request-ci Add this label to start a Jenkins CI on a PR. label Aug 30, 2024
@ronag ronag changed the title buffer: validate UTF8 on fast path buffer: write string with fast api Aug 30, 2024
@ronag ronag changed the title buffer: write string with fast api buffer: validate UTF8 on fast path Aug 30, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Aug 30, 2024
@nodejs-github-bot
Copy link
Collaborator

@ronag ronag added the commit-queue Add this label to land a pull request using GitHub Actions. label Aug 30, 2024
@nodejs-github-bot nodejs-github-bot added commit-queue-failed An error occurred while landing this pull request using GitHub Actions. and removed commit-queue Add this label to land a pull request using GitHub Actions. labels Aug 30, 2024
@nodejs-github-bot
Copy link
Collaborator

Commit Queue failed
- Loading data for nodejs/node/pull/54526
✔  Done loading data for nodejs/node/pull/54526
----------------------------------- PR info ------------------------------------
Title      buffer: validate UTF8 on fast path  (#54526)
Author     Robert Nagy <[email protected]> (@ronag)
Branch     ronag:fast-utf8-2 -> nodejs:main
Labels     buffer, c++, author ready, needs-ci
Commits    1
 - buffer: validate UTF8 on fast path
Committers 1
 - Robert Nagy <[email protected]>
PR-URL: https://github.com/nodejs/node/pull/54526
Fixes: https://github.com/nodejs/node/issues/54521
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
------------------------------ Generated metadata ------------------------------
PR-URL: https://github.com/nodejs/node/pull/54526
Fixes: https://github.com/nodejs/node/issues/54521
Reviewed-By: Daniel Lemire <[email protected]>
Reviewed-By: Benjamin Gruenbaum <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Paolo Insogna <[email protected]>
--------------------------------------------------------------------------------
   ℹ  This PR was created on Fri, 23 Aug 2024 14:34:31 GMT
   ✔  Approvals: 6
   ✔  - Daniel Lemire (@lemire): https://github.com/nodejs/node/pull/54526#pullrequestreview-2257470621
   ✔  - Benjamin Gruenbaum (@benjamingr) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2273156047
   ✔  - James M Snell (@jasnell) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258976699
   ✔  - Anna Henningsen (@addaleax): https://github.com/nodejs/node/pull/54526#pullrequestreview-2258958504
   ✔  - Yagiz Nizipli (@anonrig) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2272343097
   ✔  - Paolo Insogna (@ShogunPanda) (TSC): https://github.com/nodejs/node/pull/54526#pullrequestreview-2260113366
   ✘  Last GitHub CI failed
   ℹ  Last Full PR CI on 2024-08-30T13:34:03Z: https://ci.nodejs.org/job/node-test-pull-request/61707/
   ℹ  Last Benchmark CI on 2024-08-30T10:28:48Z: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1626/
- Querying data for job/node-test-pull-request/61707/
   ✔  Last Jenkins CI successful
--------------------------------------------------------------------------------
   ✔  Aborted `git node land` session in /home/runner/work/node/node/.ncu
https://github.com/nodejs/node/actions/runs/10637813307

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

ronag added a commit to ronag/simdutf that referenced this pull request Aug 31, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
ronag added a commit to ronag/simdutf that referenced this pull request Aug 31, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
ronag added a commit to ronag/simdutf that referenced this pull request Aug 31, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
ronag added a commit to ronag/simdutf that referenced this pull request Aug 31, 2024
Adds a "safe" version of convert_latin1_to_utf8 with a maximum output length.

Refs: nodejs/node#54526
@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

Fast API handles invalid UTF differently than the slow API.

Fixes: nodejs#54521
PR-URL: nodejs#54525
@nodejs-github-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-failed An error occurred while landing this pull request using GitHub Actions. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Buffer.from() output breaks after optimizing in nodejs 22.7