Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't assume paths are always valid UTF8 #627

Open
piscisaureus opened this issue Aug 29, 2018 · 13 comments
Open

Don't assume paths are always valid UTF8 #627

piscisaureus opened this issue Aug 29, 2018 · 13 comments
Labels
chore something that we should get around to eventually suggestion suggestions for new features (yet to be agreed)

Comments

@piscisaureus
Copy link
Member

piscisaureus commented Aug 29, 2018

This is a longer term goal.

  • Don't use String/str for paths on the rust side (we should start doing this today).
  • Use flatbuffer byte vectors instead of strings to encode paths in messages.
  • On the javascript side have an optional, low-level API that reports paths are typed arrays and not strings.

This was an issue in Node: nodejs/node-v0.x-archive#2387

@ry ry added this to the future milestone Aug 29, 2018
@XAMPPRocky
Copy link

For APIs that take a string for a path wouldn't changing the messages to take byte vectors be redundant as aren't JavaScript strings are guaranteed to be UTF-8, and then to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy which would incur extra unneeded overhead? You couldn't use OsStrExt::from_bytes as that's only available on Unix because Windows encodes their paths as wide strings (16-bit vectors).

@piscisaureus
Copy link
Member Author

piscisaureus commented Sep 2, 2018

JavaScript strings are guaranteed to be UTF-8

Not really actually, technically javascript uses a fixed length 16-bit encoding known as UCS2 (just like java and windows actually). Usually we (browsers, developers) pretend that it uses UTF16, otherwise there'd be no way to encode smileys. But try this in devtools: "😸😸😸".split('').join("-")

to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy

Conversions between Path and utf8-encoded string is actually what I'm saying we should not do! We'll use Path on the rust side and, at least at the lowest level, Uint8Array on the javascript side.

Windows encodes their paths as wide strings (16-bit vectors).

So does javascript :)
But I do agree, we shouldn't present a buffer containing a 16-bit encoded string to the user. Instead, let's convert windows filenames to wtf-8 (https://simonsapin.github.io/wtf-8/). Now there we do have a little redundancy, because we'd do the conversion UCS2 (Path) to WTF-8 (UInt8Array) to UCS2 (JS string).
But file access on windows is quite slow anyway, nobody would notice that...

You couldn't use OsStrExt::from_bytes as that's only available on Unix because.

I'm sure we can break into it somehow....

@NewLunarFire
Copy link

I'm working on this as part of what started as refactoring the is_remote function. We are given UTF-8 encoded strings as arguments so a conversion is inevitable. But from the moment the module resolution starts, they are converted to Rust Paths or Hyper URIs.

This won't win the war, but it's an important battle.

@XAMPPRocky
Copy link

XAMPPRocky commented Sep 2, 2018

@piscisaureus Thank you for explaining. I don't think that would solve the problem presented in the Node issue as you can't encode a string literal as ISO-8859-1 on the JavaScript side. On the Rust side Path can only take PathBuf, String, str, OsStr, and OsString. String requires correct UTF-8 or UTF-16, and the only way to create an OsString from a vector is platform dependent, from_bytes(&[u8]) as I mentioned before on Unix and from_wide(&[u16]) on Windows. The WTF-8 standard is explicitly against being used for interchange, it still requires being emitted as Unicode.

I think the best solution would be to have a facsimile to Rust's OsString for low level APIs.

table OsString {
  unix: [byte];
  windows: [ushort];
}
class OsString {
  private buffer: UInt8Array | UInt16Array;

  constructor(buffer: UInt8Array | UInt16Array) {
    if (windows && buffer instanceof UInt8Array)) {
       throw new Error("OsString requires UInt16Array on Windows");
    } else if (unix && buffer instanceof UInt16Array)) {
       throw new Error("OsString requires UInt8Array on Unix platforms");
    }

    this.buffer = buffer;
  }
  
  /* @internal */
  into_msg(builder: fbs.Builder): fbs.OsString {
     // Convert and return as flatbuffer message
  }
}

@benjamingr
Copy link
Contributor

as you can't encode a string as ISO-8859-1 on the JavaScript side

You can, if you hold it in an ArrayBuffer and do the conversion with TextEncoder/TextDecoder.

Holding it in a Uint8Array (low level JavaScript API) is what is being suggested here if I understand correctly.

@XAMPPRocky
Copy link

XAMPPRocky commented Sep 2, 2018

@benjamingr Sorry that was explicitly about first two points and for keeping to use Strings/strs for that part of the API and that changing those parts would only add redundancy. Sorry if that was unclear.

  • Don't use String/str for paths on the rust side (we should start doing this today).
  • Use flatbuffer byte vectors instead of strings to encode paths in messages.

@NewLunarFire
Copy link

@Aaronepower I already started working on using Paths/URIs instead of Paths on the Rust side. I'm not sure about using byte vectors to encode paths instead of strings. We are going to have to convert paths between textual and canonical representation, this is just shuffling it around to JS's side.

@ry
Copy link
Member

ry commented Sep 2, 2018

@NewLunarFire You can assume that import paths are strings and unicode encoded (if not ascii) - we will not be supporting importing arbitrarily encoded filenames. This issue is about supporting file system ops for non-utf8 filenames (a more general problem).

@motss
Copy link
Contributor

motss commented Jan 10, 2019

Does that mean that String.prototype.normalize() is not going to work as expected?

motss added a commit to motss/normalize-diacritics that referenced this issue Jan 10, 2019
`deno` only accepts absolute paths with file extension specified.

Turned out that tests failed because of an issue in `deno`.
See [denoland/deno#627] to learn more.
@bartlomieju
Copy link
Member

Reference: #2572 and #2559

@seishun
Copy link
Contributor

seishun commented Feb 10, 2020

I think the solution to this problem will depend on the following questions:

  • Should deno allow accessing files with non-Unicode names?
  • Should deno allow creating such files?

I'd say the most reasonable answers are "yes" and "no", respectively. I propose the following approach:

  • Add a "Path" class that opaquely wraps an OsStr. The user can't construct it, but they can attempt to convert it to a string.
  • Methods such as stat take either a Path or a string as input.
  • Methods such as readDir return Paths as output.

What do you think?

@ry
Copy link
Member

ry commented Feb 24, 2020

I'm not convinced it's useful to support non-unicode filenames. It's a huge complexity for what is a legacy problem at this point. It's very nice to assume paths are strings.

Related: in #4004 we skip non-unicode filenames. I think this is reasonable.

@kitsonk kitsonk added the chore something that we should get around to eventually label Nov 5, 2020
@stale
Copy link

stale bot commented Jan 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 6, 2021
@lucacasonato lucacasonato added the suggestion suggestions for new features (yet to be agreed) label Jan 6, 2021
@stale stale bot removed the stale label Jan 6, 2021
@bartlomieju bartlomieju removed this from the future milestone Feb 4, 2023
hardfist pushed a commit to hardfist/deno that referenced this issue Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore something that we should get around to eventually suggestion suggestions for new features (yet to be agreed)
Projects
None yet
Development

No branches or pull requests

10 participants