Proposal: Unicode source files #142

zygoloid · 2020-08-14T01:46:32Z

Derived from #17.

proposals/p0142.md

jonmeow · 2020-08-17T16:39:37Z

Generally seems fine. I think my only real suggestion is that you might consider moving the bulk of content (background, details, alternatives, etc) to something like docs/design/lexical_conventions/unicode_files.md. This suggestion is mainly reflecting a perspective that we should aim to keep alternatives with the design where it's easy to find, and using the proposal as suggested here may make it harder to track. Plus, then you can incrementally build the structure of lexical conventions back in that way.

zygoloid · 2020-08-18T22:34:36Z

I think my only real suggestion is that you might consider moving the bulk of content (background, details, alternatives, etc) to something like docs/design/lexical_conventions/unicode_files.md.

Thanks, I think this is the right place for this content to end up, and I agree that we should include rationale and alternatives and such there. For this proposal I'd like to get a decision before I start working on docs/ updates, if that's OK?

chandlerc

Like the direction, most of this is about cleaning up the proposal a bit for posterity.

proposals/p0142.md

chandlerc · 2020-08-19T10:00:17Z

proposals/p0142.md

+### Normalization
+
+Carbon source files, outside comments and string literals, are required to be in
+Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will
+convert source files to NFC as necessary to satisfy this constraint.
+
+### Characters in identifiers and whitespace
+
+We will largely follow Unicode Annex 31 in our selection of identifier and
+whitespace characters. This Annex does not provide specific rules on lexical
+syntax, instead providing a framework that permits a selection of choices of
+concrete rules.


Somewhat meta comment here and elsewhere:

While I absolutely agree on aligning this with the relevant parts of the Unicode specification, I'd like to make the proposal and especially any eventual design document built around this stand alone.

Could you rotate the proposal a bit to outline the functional constraints and then identify how they map into the Unicode standard? Some of these I'm guessing are large (annex 31 here) and in those cases maybe just provide illustrative examples of familiar subsets of the rules to help readers understand the overall direction w/o needing to dig up the annex in question.

Just to clarify this a bit: I'm not asking for lots of detail or anything long here. I like the short and directionaly focused proposal, and don't want to turn it into a spec or anything else. I think it's totally fine to defer to when you're pulling the full design docs together and/or touching the spec for that, and deferring to the Unicode docs here.

I just would like to be able to get a high level rough understanding of the consequences of this direction w/o the indirection.

Feel free to prod me on chat if this is feeling like a bunch of work. =]

I've expanded this section and the section on normalization to explain a bit more about the contents and consequences of the annexes in question.

Do we actually want to use UAX#31 as our baseline here? My feeling is that it's far too permissive, which makes security very difficult (#19) and also adds unnecessary complexity (e.g. the whole discussion in #17 about allowed whitespace characters, when all we actually want are spaces, newlines, and possibly tabs).

It would be great if we could find a preexisting design that does what we want, and I do think we should adopt any parts of UAX#31 that are useful, but I think any design that solves the security issues is going to end up looking pretty different overall. My guess is it makes more sense to start with something restrictive but secure, and add support for more languages over time, rather than starting with something permissive and trying to patch up the security holes.

One of the overarching unanswered questions in #19 is to what extent we want to mitigate the risk of intentionally-underhanded code -- do we want to defend against only Murphy or also Machiavelli?

To some extent I want to defer to UAX#31 for now because, with an essentially unbounded set of places in which we can explore and innovate, this seems like neither a place where we have a lot of expertise nor a place where breaking new ground is going to give the best impact in terms of meeting our goals -- I think this is more an area where we want a reasonably-standard and uncontroversial answer.

(I just want to say my original comment has been addressed... didn't really want to resolve the thread because of @mconst 's comments)

proposals/p0142.md

gribozavr · 2020-08-20T10:34:16Z

proposals/p0142.md

+ identifiers than in detecting whether they are in normal form. While this
+ proposal would require the implementation complexity of converting into NFC
+ in the formatting tool, it would not require the conversion cost to be paid
+ during compilation.


The cost will need to be paid by a compiler that cares about good error messages -- we would need to accept non-NFC identifiers, produce an error, but then normalize them to NFC, and ensure that name lookup works. Maybe we can make NFC the fast path (http:https://unicode.org/reports/tr15/#NFC_QC_Optimization), but a good compiler would still need to normalize in erroneous cases.

That's a good point. I'll add that to the document, but do you think we should actually reverse the decision and allow arbitrary un-normalized text?

I was only pointing out that this disadvantage might not be material in practice.

We could add another disadvantage if we want, explicitly calling out naive tools that process bytes. Do grep or git grep care about normalization differences for example?

Neither grep nor git grep do normalization themselves, so they do care. (Note that they're not naive byte-based tools -- they both support Unicode case-insensitivity, which is pretty complex! They just don't do normalization, which isn't a huge problem in practice because most real-world text is in NFC already.)

Note that I think the complexity is still significantly reduced by requiring normalization. That means the compiler can take an approach such as a fallback path that renormalizes and re-starts lexing (or other potentially costly fallback strategy) rather than needing to normalize in the "expected" path.

But perhaps the greatest reason to require it is to get diff stability and byte-equivalence for things like string literals.

The document was revised to incorporate @gribozavr's comments, but to require normalized text everywhere based in part on @chandlerc's observations. I consider this thread resolved.

proposals/p0142.md

josh11b · 2020-08-25T22:43:19Z

On Mon, Aug 24, 2020 at 11:48 PM Richard Smith ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In proposals/p0142.md <#142 (comment)> : > +### Normalization + +Carbon source files, outside comments and string literals, are required to be in +Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will +convert source files to NFC as necessary to satisfy this constraint. + +### Characters in identifiers and whitespace + +We will largely follow Unicode Annex 31 in our selection of identifier and +whitespace characters. This Annex does not provide specific rules on lexical +syntax, instead providing a framework that permits a selection of choices of +concrete rules. One of the overarching unanswered questions in #19 <#19> is to what extent we want to mitigate the risk of intentionally-underhanded code -- do we want to defend against only Murphy or also Machiavelli?

I would definitely be in favor of selecting a less permissive, more secure option. Do we have any good choices here?

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#142 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUNHV5CDFOXU7HZOMCDORDSCNNFFANCNFSM4P7AUUOQ> .

zygoloid · 2020-08-27T00:48:38Z

@josh11b

I would definitely be in favor of selecting a less permissive, more secure option. Do we have any good choices here?

I've added some discussion of a direction we could pursue, using UAX#39's algorithm to form a fingerprint of identifiers to detect when they're unacceptably visually similar. Given that kind of approach, I think we should be free to use the full character set suggested by UAX#31 and only diagnose actual homoglyph collisions. Is that enough to satisfy your concern here?

je4d · 2020-09-02T17:34:23Z

proposals/p0142.md

+
+The canonical representation for Carbon programs is in files stored as a
+sequence of bytes in a file system on disk, and such files are expected to be
+encoded in UTF-8. Such files may begin with an optional UTF-8 BOM, that is, the


Do we currently have, or plan to have, raw literals in Carbon, and if so do we need to consider an exemption to the requirement of valid UTF-8 for the content of such literals?
I'm imagining use cases such as a [local encoding]->utf8 translation unit test wanting to have a literal for its input.

FWIW I thought about raising the same question, but came to the conclusion that in such situations, you really should use something like unicode escape sequences rather than literal data. Otherwise you wind up in a situation where your source code has semantically-relevant properties that are completely invisible unless you open your source code in a hex editor, which seems like a really bad place to be.

We do plan to have raw literals, but I don't think we intend for them to be used to support arbitrary binary data in source files. I'm inclined to say that the C++ rules that retroactively undo some parts of the early phases of translation for raw literals are a mistake -- we should view raw literals as a way to include special characters such as \ and " in string literals more easily, and anything beyond that should be considered out-of-scope. (Newline characters are also a concern here; the string literals proposal suggests having single-line and multi-line string literals, orthogonal to raw and non-raw literals.)

For an encoding test of the kind you describe, I think the most reasonable source encoding (from a source readability point of view and to ensure that the source file is stable when processed with tools -- and that the literal contents don't get replaced with U+FFFDs) would be to use escape sequences or similar to express the input, and language rules that largely force that choice might even be a good thing.

@je4d I consider this thread to be resolved. Please can you close it if you agree?

jfbastien

There's a proposal for something related in C++: https://isocpp.org/files/papers/P2194R0.pdf
It would be good to align both.

Oops, I had these comments un-submitted since Friday, forgot to click the github button. Sorry :(

proposals/p0142.md

jfbastien · 2020-09-05T17:47:58Z

proposals/p0142.md

+form, and normalize identifiers ourselves:
+
+```diff
+-Carbon source files, outside comments and string literals, are required to be in


Why exclude comments? It seems like there's already checking for valid codepoints in comments (is there?), so requiring normalized comments isn't really more work: just change checking in strings.

Further, I think you can also require that strings be normalized too. If a developer wants a non-normalized string (valid usecase!) then they ought to explicitly escape the sequence in the string (i.e. spell out that they want 'letter followed by accent', not 'letter with accent'). I think this simplifies specification of valid source (it all needs to be normalized), and makes the code clearer because it's always explicit that non-normalized strings were meant (because you have to spell it out). You never allow accidental non-normalized. Further, editors probably normalize the entire file anyways, so purposefully using non-normalized strings likely break.

These are reasonable points, and we've not really explored this axis of choice. Counterpoints: requiring string literals and comments to be normalized will increase compilation times, and at least for comments, we don't seem to get much practical benefit.

I increasingly like this point.

I think that having string literals which are the same be guaranteed to have the same byte sequence in the source code is good.

And I think having string literals that form a different byte sequence be visually distinct is also good. See the discussion around C vs. KC and homoglyphs below.

The document was revised in the suggested direction; I consider this resolved.

jfbastien · 2020-09-05T17:52:51Z

proposals/p0142.md

+- Tools such as `grep` do not perform normalization themselves, and so would
+ be unreliable when applied to a codebase with inconsistent normalization.
+- GCC already diagnoses identifiers that are not in NFC, and WG21 is in the
+ process of adopting an NFC requirement for C++ identifiers, so development


Please add a reference: http:https://wg21.link/P1949

jfbastien · 2020-09-05T17:54:17Z

proposals/p0142.md

+will not deviate from conformance to Annex 31 in any concrete area. We may find
+cases where we wish to take a different direction than that of the Annex.
+However, we should use Annex 31 as a basis for our decisions, and should expect
+strong justification for deviations from it.


Does this align with http:https://wg21.link/P1949 ?

P1949 is proposing that C++ follows Unicode Annex 31, so to the extent that later Carbon proposals do as this proposal suggests, yes.

Right, I just want to have a reference to show the alignment to what C++ is doing, and to say that it's on purpose.

Note added, with a link to the C++ proposal.

proposals/p0142.md

chandlerc · 2020-09-15T09:48:42Z

proposals/p0142.md

+ - If we use a compatibility normalization form, ligatures are considered
+ equivalent to the character sequence that they decompose into.


Reference the discussion around homoglyphs below here? It seems relevant. See my comments there.

Cross-reference to the homoglyphs section added.

chandlerc · 2020-09-15T09:57:43Z

proposals/p0142.md

+However, we should use Annex 31 as a basis for our decisions, and should expect
+strong justification for deviations from it.
+
+#### Homoglyphs


I feel like this should also reference some amount of the similar-appearing cases handled by the compatibility normalization forms. My understanding is that these are distinct from the "confusable" set of glyphs defined by Unicode (UAX#39 below).

Essentially, if we care about homoglyphs, i feel like we should also care about the compatibility normalization.

I can see two approaches here:

Solve through normalization

Solve through detection and rejection of collisions (as suggested here)

I feel like we should take a similar approach for both of these.

If we solve through normalization, we should use normalization form KC, but either canonicalize or directly reject even the use of confusable glyphs (they're always available via escape codes).

If we solve through detection, I think we should detect both KC != C cases, as well as confusables.

I'm worried about the cost of detecting KC != C cases sadly. As a consequence, I wonder if we should use normalization form KC specifically justified by our desire to not have potential confusion similar to that of homoglyphs. And I wonder if we should simply reject confusable characters completely and require their usage to route through escape codes. This approach would have the advantage of moving us further towards the byte sequence of string literals being directly reflected visually as well as differences in identifiers being reliably detected visually.

I'm not even suggesting we have to commit to a specific thing around the confusables here, but I think we need to pick a strategy as that seems likely to influence the normalization form desired.

I don't think compatibility characters are going to be a problem. ICU's confusability-detection routines already check for them, and also, nearly all those characters are things we can just ban because you shouldn't be using them in code anyway.

I'd love to make string literals visually unambiguous, so you can determine their byte sequence by looking at them. In particular, this is important for strings that are going to be interpreted by code, like shell commands or HTTP headers. Banning confusable characters outright is pretty heavy-handed, though -- it would make it nearly impossible to write text in Russian, for example, because so many Cyrillic letters are confusable with Latin letters.

Also, it'll be a nontrivial task just to enumerate all the characters that might be confusable when used in a string literal. As far as I know, most real-world confusability work so far has just dealt with URLs; string literals allow a broader set of characters, like spaces and more kinds of punctuation, so there are going to be new opportunities for confusion that haven't been explored. And even in the limited world of domain names, where a lot of work has gone into preventing spoofing, people discover new spoofing opportunities pretty regularly. This isn't a solved problem.

In general, I think normalization isn't what we should be focusing on here. Normalization is easy, well-understood, and largely irrelevant in practice because most real-world text is NFC anyway; whatever normalization option we pick will be fine. Confusability detection is difficult, poorly understood, and causes security problems. That's the part we need to think about.

FWIW, the proposal largely sets this aside for now, so there doesn't seem to be anything more to do here...

I do continue to think if we want to do anything for confusability in the language, we'll need to do it through some kind of "normalization", although clearly the ones defined by Unicode are not sufficient to address most confusability concerns.

I think just looking at language level confusability through the normalization lens helps make clear the scope we can realistically handle: we could probably normalize away accidentally pasted ligatures and similar things, but maybe not much else. Whether that is worth doing from a QoI perspective (its clearly insufficient for any kind of security) is something I think we don't currently know, but also I don't think needs to be decided right away, so I'm fine proceeding with the proposal as is.

chandlerc · 2020-09-15T09:59:02Z

proposals/p0142.md

+form, and normalize identifiers ourselves:
+
+```diff
+-Carbon source files, outside comments and string literals, are required to be in


I increasingly like this point.

I think that having string literals which are the same be guaranteed to have the same byte sequence in the source code is good.

And I think having string literals that form a different byte sequence be visually distinct is also good. See the discussion around C vs. KC and homoglyphs below.

chandlerc · 2020-09-15T10:01:35Z

proposals/p0142.md

+### Normalization
+
+Carbon source files, outside comments and string literals, are required to be in
+Unicode Normalization Form C ("NFC"). The Carbon source formatting tool will
+convert source files to NFC as necessary to satisfy this constraint.
+
+### Characters in identifiers and whitespace
+
+We will largely follow Unicode Annex 31 in our selection of identifier and
+whitespace characters. This Annex does not provide specific rules on lexical
+syntax, instead providing a framework that permits a selection of choices of
+concrete rules.


(I just want to say my original comment has been addressed... didn't really want to resolve the thread because of @mconst 's comments)

chandlerc · 2020-09-15T10:04:22Z

proposals/p0142.md

+ identifiers than in detecting whether they are in normal form. While this
+ proposal would require the implementation complexity of converting into NFC
+ in the formatting tool, it would not require the conversion cost to be paid
+ during compilation.


Note that I think the complexity is still significantly reduced by requiring normalization. That means the compiler can take an approach such as a fallback path that renormalizes and re-starts lexing (or other potentially costly fallback strategy) rather than needing to normalize in the "expected" path.

But perhaps the greatest reason to require it is to get diff stability and byte-equivalence for things like string literals.

proposals/p0142.md

- [Proposal PR](#142) - [RFC topic](https://forums.carbon-lang.dev/t/rfc-unicode-source-files/119) - [Decision topic](https://forums.carbon-lang.dev/t/request-for-decision-unicode-source-files/145) - [Decision announcement](https://forums.carbon-lang.dev/t/accepted-unicode-source-files-142/148) Finalized on 2020-10-27

Add a section on homoglyph attacks, demonstrating how the problem could be solved under the rules proposed here.

NFD.

- Based on established consensus, apply normalization restriction to the entire source file, including comments and string literals. - Move 'open question' to 'alternatives considered' based on consensus that we want to require pre-normalized source files.

Co-authored-by: austern <[email protected]>

Add design text based on the contents of two proposals: #142 Unicode source files #143 Numeric literals

Factored out of the Lexical Conventions proposal (#173). This proposal covers only the encoding aspect: Carbon text is Unicode, source files are encoded in UTF-8, and we will follow the Unicode Consortium's recommendations to the extent that they make sense to us.

Add design text based on the contents of two proposals: #142 Unicode source files #143 Numeric literals

zygoloid added proposal A proposal WIP labels Aug 14, 2020

googlebot added the cla: yes PR meets CLA requirements according to bot. label Aug 14, 2020

zygoloid added proposal rfc Proposal with request-for-comment sent out and removed WIP labels Aug 14, 2020

gribozavr approved these changes Aug 14, 2020

View reviewed changes

proposals/p0142.md Outdated Show resolved Hide resolved

proposals/p0142.md Outdated Show resolved Hide resolved

proposals/p0142.md Outdated Show resolved Hide resolved

proposals/p0142.md Outdated Show resolved Hide resolved

zygoloid changed the title ~~Unicode source files~~ Proposal: Unicode source files Aug 14, 2020

chandlerc reviewed Aug 19, 2020

View reviewed changes

gribozavr reviewed Aug 20, 2020

View reviewed changes

geoffromer reviewed Aug 21, 2020

View reviewed changes

proposals/p0142.md Outdated Show resolved Hide resolved

proposals/p0142.md Show resolved Hide resolved

sidney13 added the comment deadline label Aug 28, 2020

je4d reviewed Sep 2, 2020

View reviewed changes

jfbastien reviewed Sep 8, 2020

View reviewed changes

josh11b reviewed Sep 14, 2020

View reviewed changes

proposals/p0142.md Show resolved Hide resolved

proposals/p0142.md Outdated Show resolved Hide resolved

chandlerc reviewed Sep 15, 2020

View reviewed changes

jonmeow added needs decision and removed comment deadline proposal rfc Proposal with request-for-comment sent out labels Oct 7, 2020

austern reviewed Oct 13, 2020

View reviewed changes

proposals/p0142.md Outdated Show resolved Hide resolved

jonmeow added proposal accepted Decision made, proposal accepted and removed needs decision labels Oct 13, 2020

jonmeow mentioned this pull request Oct 14, 2020

Decision for #142 (unicode source files) #173

Merged

zygoloid added 3 commits October 28, 2020 17:18

Unicode source files.

b35b0d8

Address review comments from @gribozavr.

29c8c45

Address more code review comments.

64c3904

zygoloid and others added 7 commits October 28, 2020 17:18

Addressing more review comments.

1ed8c19

Move "NFC versus normalize it ourselves" to an open question.

7dcfa95

Add a section on homoglyph attacks, demonstrating how the problem could be solved under the rules proposed here.

Add homoglyph detection to the list of things that might be easier in

31ee9af

NFD.

Expand and clarify discussion of the open question regarding use of NFC.

0763099

Apply editorial simplification from code review

ed03f24

Co-authored-by: austern <[email protected]>

Add some links as requested by review comments.

1022aa6

zygoloid merged commit 3f96e5f into carbon-language:trunk Oct 29, 2020

zygoloid mentioned this pull request Oct 29, 2020

Incorporate proposals #142 and #143 into the design #185

Merged

zygoloid added a commit that referenced this pull request Nov 4, 2020

Incorporate proposals #142 and #143 into the design

cfab113

Add design text based on the contents of two proposals: #142 Unicode source files #143 Numeric literals

jonmeow mentioned this pull request Jul 14, 2021

Improve description of what belongs in a proposal document versus in the design #649

Merged

zygoloid deleted the proposal-unicode-source-files branch March 11, 2022 00:54

chandlerc pushed a commit that referenced this pull request Jun 28, 2022

Incorporate proposals #142 and #143 into the design

ee7a108

Add design text based on the contents of two proposals: #142 Unicode source files #143 Numeric literals

josh11b mentioned this pull request Sep 2, 2022

Character Literals (#1934) #1964

Merged

		- If we use a compatibility normalization form, ligatures are considered
		equivalent to the character sequence that they decompose into.

Proposal: Unicode source files #142

Proposal: Unicode source files #142

Conversation

zygoloid commented Aug 14, 2020 • edited by jonmeow Loading

jonmeow commented Aug 17, 2020

zygoloid commented Aug 18, 2020

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gribozavr Sep 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josh11b commented Aug 25, 2020 via email

zygoloid commented Aug 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfbastien left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfbastien Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mconst Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zygoloid commented Aug 14, 2020 •

edited by jonmeow

Loading

zygoloid Aug 25, 2020 •

edited

Loading

gribozavr Sep 1, 2020 •

edited

Loading

jfbastien Sep 9, 2020 •

edited

Loading

mconst Sep 16, 2020 •

edited

Loading