Reconsider digit separators #1485

jonmeow · 2022-07-21T02:38:24Z

At present Carbon restricts integer digit separators to every 3 digits, going back to https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md.

A contrary mention had been made about the Indian convention. However, it looks like CJK cultures were overlooked, maybe due to conflicting information in https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping (which says eastern countries have switched to 3 digit groups). According to https://www.statisticalconsultants.co.nz/blog/how-the-world-separates-its-digits.html offers that China uses every 4 digits.

In light of the greater amount of convention differences, it may be worth supporting more variations (e.g., support 3 different conventions for digit groupings), or otherwise loosen restrictions. While that could end up with ambiguous placement for some numbers, larger numbers would less ambiguous because the groupings would repeat.

Note, I think this arose from this tweet

lexi-nadia · 2022-07-21T02:55:11Z

Besides international variations, there are also microformats. For example:

let mac_address: i64 = 0xa1_b2_c3_d4_e5_f6;
let uuid: i128 = 0x123e4567_e89b_12d3_a456_426614174000;

mo-xiaoming · 2022-07-21T04:09:58Z

As a Chinese developer, I can say

yes, in our culture, we're used to 4 digit groups
However, as a developer, I'm quite comfortable with 3 digit groups (stockholm syndrome?)
@lexi-nadia has a very good point on hex numbers

So, maybe adding this kind of variation is worth a while

nigeltao · 2022-07-22T01:28:37Z

I'm not saying you should do this, just throwing out a related idea...

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

chandlerc · 2022-07-23T04:02:44Z

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

I think this is a pretty separate question, so if you'd like to pursue it I would move it. FWIW, we can have a near perfect recovery here in the frontend and suggest edits, so I think the difference isn't huge, but it is a difference.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

Given the semantically meaningful different groupings mentioned here, I think this question should include not canonicalizing in the formatter. FWIW, I'm sufficiently convinced by things like credit card numbers, UUIDs, and MAC addresses that we should have this flexibility even outside of any ideas around regional differences or different bases.

nigeltao · 2022-07-29T02:48:19Z

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:

let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;

In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

As for credit card numbers, do people actually process them as numbers (as opposed to strings)?

chandlerc · 2022-07-30T02:37:39Z

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:
let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;
In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

I still find the versions above significantly more readable than these. I agree that no digit separators would be even worse, but I don't think that's really the question. I think the readability gain of format-specific grouping is worthwhile based on the examples here.

zygoloid · 2022-08-09T18:57:05Z

We seem to have good evidence here that we should reconsider this decision, and a good level of consensus for making a change. The next step would be for someone to write a proposal presenting these arguments.

ethomag · 2022-08-11T10:44:17Z

Maybe I misinterpreted this (in docs/design/lexical_conventions/numeric_literals.md)

For real-number literals, digit separators can appear in the decimal and hexadecimal 
integer portions (prior to the period and after the optional e or mandatory p)

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs. Consider:

let nanosecond: f64 = 0.000000001;

vs

let nanosecond: f64 = 0.000_000_001;

I think that improves readability as much as digit separators in the integer part.

jonmeow · 2022-08-11T18:20:17Z

Created a proposal on #1983 -- let me know if I've misunderstood leads direction there, I can always flip around alternatives if the leads want a different choice.

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs.

AFAICT your interpretation is correct, although the proposal has some conflicting examples in ties. Anyways, I think #1983 should produce clear rationale either way.

ethomag · 2022-08-11T21:01:04Z

Thanks @jonmeow for your reply. My concern was not about ties, but strictly readability. I think scientific notation is symmetric around the decimal point. To be able to group decimal digits in the integer part so that you can easily eyeball which parts are grams, kilograms etc is something that can aid avoiding making mistakes when defining constants. I just think the same argument holds for milligrams, micrograms etc.

I could not find any rationale that I could understand in the referred links, but it seems you have already considered this. I was just naïvely thinking that this was something that was overlooked.

I am truly amazed by your work, it's quite a challenge you have taken on!

chandlerc · 2022-08-12T01:40:11Z

(removing good-first-issue label as this is now in progress)

[Proposal #143: Numeric literals](https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md) added digit separators with strict rules for placement. It missed some use-cases. In order to address this, remove placement rules for numeric literals. Related issue: #1485 Co-authored-by: Chandler Carruth <[email protected]> Co-authored-by: Richard Smith <[email protected]>

jonmeow · 2022-08-25T23:30:44Z

I believe this is resolved by #1983 though I still need to update the design (but I think we can call the leads question closed).

jonmeow assigned zygoloid Jul 21, 2022

jonmeow added this to Questions in Issues for leads via automation Jul 21, 2022

jonmeow mentioned this issue Jul 21, 2022

Remove convention remark #1482

Merged

zygoloid added the good first issue Possibly a good first issue for newcomers label Aug 9, 2022

zygoloid moved this from Questions to Needs proposal in Issues for leads Aug 9, 2022

jonmeow added the leads question A question for the leads team label Aug 10, 2022

jonmeow mentioned this issue Aug 11, 2022

Weaken digit separator placement rules #1983

Merged

chandlerc removed the good first issue Possibly a good first issue for newcomers label Aug 12, 2022

jonmeow closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider digit separators #1485

Reconsider digit separators #1485

jonmeow commented Jul 21, 2022 •

edited

lexi-nadia commented Jul 21, 2022

mo-xiaoming commented Jul 21, 2022

nigeltao commented Jul 22, 2022

chandlerc commented Jul 23, 2022

nigeltao commented Jul 29, 2022 •

edited

chandlerc commented Jul 30, 2022

zygoloid commented Aug 9, 2022 •

edited

ethomag commented Aug 11, 2022 •

edited

jonmeow commented Aug 11, 2022

ethomag commented Aug 11, 2022 •

edited

chandlerc commented Aug 12, 2022

jonmeow commented Aug 25, 2022

Reconsider digit separators #1485

Reconsider digit separators #1485

Comments

jonmeow commented Jul 21, 2022 • edited

lexi-nadia commented Jul 21, 2022

mo-xiaoming commented Jul 21, 2022

nigeltao commented Jul 22, 2022

chandlerc commented Jul 23, 2022

nigeltao commented Jul 29, 2022 • edited

chandlerc commented Jul 30, 2022

zygoloid commented Aug 9, 2022 • edited

ethomag commented Aug 11, 2022 • edited

jonmeow commented Aug 11, 2022

ethomag commented Aug 11, 2022 • edited

chandlerc commented Aug 12, 2022

jonmeow commented Aug 25, 2022

jonmeow commented Jul 21, 2022 •

edited

nigeltao commented Jul 29, 2022 •

edited

zygoloid commented Aug 9, 2022 •

edited

ethomag commented Aug 11, 2022 •

edited

ethomag commented Aug 11, 2022 •

edited