Don't assume paths are always valid UTF8 #627

piscisaureus · 2018-08-29T19:48:18Z

This is a longer term goal.

Don't use String/str for paths on the rust side (we should start doing this today).
Use flatbuffer byte vectors instead of strings to encode paths in messages.
On the javascript side have an optional, low-level API that reports paths are typed arrays and not strings.

This was an issue in Node: nodejs/node-v0.x-archive#2387

XAMPPRocky · 2018-08-30T21:45:17Z

For APIs that take a string for a path wouldn't changing the messages to take byte vectors be redundant as aren't JavaScript strings are guaranteed to be UTF-8, and then to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy which would incur extra unneeded overhead? You couldn't use OsStrExt::from_bytes as that's only available on Unix because Windows encodes their paths as wide strings (16-bit vectors).

piscisaureus · 2018-09-02T02:25:42Z

JavaScript strings are guaranteed to be UTF-8

Not really actually, technically javascript uses a fixed length 16-bit encoding known as UCS2 (just like java and windows actually). Usually we (browsers, developers) pretend that it uses UTF16, otherwise there'd be no way to encode smileys. But try this in devtools: "😸😸😸".split('').join("-")

to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy

Conversions between Path and utf8-encoded string is actually what I'm saying we should not do! We'll use Path on the rust side and, at least at the lowest level, Uint8Array on the javascript side.

Windows encodes their paths as wide strings (16-bit vectors).

So does javascript :)
But I do agree, we shouldn't present a buffer containing a 16-bit encoded string to the user. Instead, let's convert windows filenames to wtf-8 (https://simonsapin.github.io/wtf-8/). Now there we do have a little redundancy, because we'd do the conversion UCS2 (Path) to WTF-8 (UInt8Array) to UCS2 (JS string).
But file access on windows is quite slow anyway, nobody would notice that...

You couldn't use OsStrExt::from_bytes as that's only available on Unix because.

I'm sure we can break into it somehow....

NewLunarFire · 2018-09-02T13:38:11Z

I'm working on this as part of what started as refactoring the is_remote function. We are given UTF-8 encoded strings as arguments so a conversion is inevitable. But from the moment the module resolution starts, they are converted to Rust Paths or Hyper URIs.

This won't win the war, but it's an important battle.

XAMPPRocky · 2018-09-02T14:59:33Z

@piscisaureus Thank you for explaining. I don't think that would solve the problem presented in the Node issue as you can't encode a string literal as ISO-8859-1 on the JavaScript side. On the Rust side Path can only take PathBuf, String, str, OsStr, and OsString. String requires correct UTF-8 or UTF-16, and the only way to create an OsString from a vector is platform dependent, from_bytes(&[u8]) as I mentioned before on Unix and from_wide(&[u16]) on Windows. The WTF-8 standard is explicitly against being used for interchange, it still requires being emitted as Unicode.

I think the best solution would be to have a facsimile to Rust's OsString for low level APIs.

table OsString {
  unix: [byte];
  windows: [ushort];
}

class OsString {
  private buffer: UInt8Array | UInt16Array;

  constructor(buffer: UInt8Array | UInt16Array) {
    if (windows && buffer instanceof UInt8Array)) {
       throw new Error("OsString requires UInt16Array on Windows");
    } else if (unix && buffer instanceof UInt16Array)) {
       throw new Error("OsString requires UInt8Array on Unix platforms");
    }

    this.buffer = buffer;
  }
  
  /* @internal */
  into_msg(builder: fbs.Builder): fbs.OsString {
     // Convert and return as flatbuffer message
  }
}

benjamingr · 2018-09-02T15:04:58Z

as you can't encode a string as ISO-8859-1 on the JavaScript side

You can, if you hold it in an ArrayBuffer and do the conversion with TextEncoder/TextDecoder.

Holding it in a Uint8Array (low level JavaScript API) is what is being suggested here if I understand correctly.

XAMPPRocky · 2018-09-02T15:08:44Z

@benjamingr Sorry that was explicitly about first two points and for keeping to use Strings/strs for that part of the API and that changing those parts would only add redundancy. Sorry if that was unclear.

Don't use String/str for paths on the rust side (we should start doing this today).

Use flatbuffer byte vectors instead of strings to encode paths in messages.

NewLunarFire · 2018-09-02T15:47:35Z

@Aaronepower I already started working on using Paths/URIs instead of Paths on the Rust side. I'm not sure about using byte vectors to encode paths instead of strings. We are going to have to convert paths between textual and canonical representation, this is just shuffling it around to JS's side.

ry · 2018-09-02T17:01:28Z

@NewLunarFire You can assume that import paths are strings and unicode encoded (if not ascii) - we will not be supporting importing arbitrarily encoded filenames. This issue is about supporting file system ops for non-utf8 filenames (a more general problem).

motss · 2019-01-10T16:07:57Z

Does that mean that String.prototype.normalize() is not going to work as expected?

`deno` only accepts absolute paths with file extension specified. Turned out that tests failed because of an issue in `deno`. See [denoland/deno#627] to learn more.

bartlomieju · 2019-06-25T12:33:33Z

Reference: #2572 and #2559

seishun · 2020-02-10T15:33:38Z

I think the solution to this problem will depend on the following questions:

Should deno allow accessing files with non-Unicode names?
Should deno allow creating such files?

I'd say the most reasonable answers are "yes" and "no", respectively. I propose the following approach:

Add a "Path" class that opaquely wraps an OsStr. The user can't construct it, but they can attempt to convert it to a string.
Methods such as stat take either a Path or a string as input.
Methods such as readDir return Paths as output.

What do you think?

ry · 2020-02-24T18:49:35Z

I'm not convinced it's useful to support non-unicode filenames. It's a huge complexity for what is a legacy problem at this point. It's very nice to assume paths are strings.

Related: in #4004 we skip non-unicode filenames. I think this is reasonable.

stale · 2021-01-06T13:57:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ry added this to the future milestone Aug 29, 2018

piscisaureus closed this as completed Sep 1, 2018

piscisaureus reopened this Sep 1, 2018

bartlomieju mentioned this issue Dec 27, 2019

Don't die on invalid utf8 source files #1489

Closed

seishun mentioned this issue Feb 10, 2020

readDir/readDirSync panics when run on a directory with non-Unicode filenames #3950

Closed

gewoonwoutje mentioned this issue May 13, 2020

error when running: deno run main.js #5280

Closed

kitsonk added the chore something that we should get around to eventually label Nov 5, 2020

stale bot added the stale label Jan 6, 2021

lucacasonato added the suggestion suggestions for new features (yet to be agreed) label Jan 6, 2021

stale bot removed the stale label Jan 6, 2021

bartlomieju mentioned this issue Apr 22, 2021

Deno unable to read files with emojis in path #10304

Closed

Sembiance mentioned this issue Dec 8, 2021

Deno.readDir won't return all files #12998

Open

bartlomieju removed this from the future milestone Feb 4, 2023

hardfist pushed a commit to hardfist/deno that referenced this issue Aug 7, 2024

refactor(core): use FastStaticString everywhere (denoland#627)

168e192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't assume paths are always valid UTF8 #627

Don't assume paths are always valid UTF8 #627

piscisaureus commented Aug 29, 2018 •

edited by ry

Loading

XAMPPRocky commented Aug 30, 2018

piscisaureus commented Sep 2, 2018 •

edited

Loading

NewLunarFire commented Sep 2, 2018

XAMPPRocky commented Sep 2, 2018 •

edited

Loading

benjamingr commented Sep 2, 2018

XAMPPRocky commented Sep 2, 2018 •

edited

Loading

NewLunarFire commented Sep 2, 2018

ry commented Sep 2, 2018

motss commented Jan 10, 2019

bartlomieju commented Jun 25, 2019

seishun commented Feb 10, 2020

ry commented Feb 24, 2020

stale bot commented Jan 6, 2021

Don't assume paths are always valid UTF8 #627

Don't assume paths are always valid UTF8 #627

Comments

piscisaureus commented Aug 29, 2018 • edited by ry Loading

XAMPPRocky commented Aug 30, 2018

piscisaureus commented Sep 2, 2018 • edited Loading

NewLunarFire commented Sep 2, 2018

XAMPPRocky commented Sep 2, 2018 • edited Loading

benjamingr commented Sep 2, 2018

XAMPPRocky commented Sep 2, 2018 • edited Loading

NewLunarFire commented Sep 2, 2018

ry commented Sep 2, 2018

motss commented Jan 10, 2019

bartlomieju commented Jun 25, 2019

seishun commented Feb 10, 2020

ry commented Feb 24, 2020

stale bot commented Jan 6, 2021

piscisaureus commented Aug 29, 2018 •

edited by ry

Loading

piscisaureus commented Sep 2, 2018 •

edited

Loading

XAMPPRocky commented Sep 2, 2018 •

edited

Loading

XAMPPRocky commented Sep 2, 2018 •

edited

Loading