Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow four more characters to start identifiers. #11267

Closed
wants to merge 1 commit into from

Conversation

jiahao
Copy link
Member

@jiahao jiahao commented May 14, 2015

  • Mathematical bold 0, 1 (U+1D7CE, U+1D7CF)
  • Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9)

which are sometimes used to represent additive and multiplicative identities.

Closes #10762

- Mathematical bold 0, 1 (U+1D7CE, U+1D7CF)
- Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9)

which are sometimes used to represent certain representations of additive
and multiplicative identities.

Closes #10762
@stevengj
Copy link
Member

I'm ambivalent about this. Should we just allow all the double-struck digits?

@StefanKarpinski
Copy link
Sponsor Member

This does seem to clash with the idea of normalizing characters that could be mistaken for each other.

@jiahao
Copy link
Member Author

jiahao commented May 15, 2015

Well, people do use the fancy 0 and 1 to mean additive and multiplicative identities. I'm not aware of uses for 2-9.

The fancy digits are not canonicalizable to the ordinary digits.

@stevengj
Copy link
Member

My main concern is that the rules for "what is an allowed identifier" are getting pretty complicated.

@sbromberger
Copy link
Contributor

I disagree strongly with this proposed change. It will be too easy to create identifiers that look like numbers, and can cause great confusion depending on what fonts are in use by the user.

I like the general policy of "numberlike characters cannot be used to start identifier names" - it's simple, intuitive, and minimally subjective.

@jiahao
Copy link
Member Author

jiahao commented May 22, 2015

@dpsanders it's up to you to defend this one.

@PallHaraldsson
Copy link
Contributor

@sbromberger "It will be too easy to create identifiers that look like numbers" - I know it's probably not Julia's place, but if this would appear bigger would it solve the issue?

Usually, programming editors have used monospaced fonts. I'm not sure that is outdated with Unicode.. Unicode has halfwidth and fullwidth at least.. Just thinking, at least in editors/IDEs like Juno (that most will probably use anyway - with time), could there be a special case that makes "numberlikes" at least taller?

@JeffBezanson: "I don't think it's the place of a programming language to try to ban characters." - I think I agree. Security is usually for the data coming in. Is the program itself not the programmer's responsibility? Some lint program could give a warning? Anyway I will never use these letters and to not care either way, just found the issue interesting..

@sbromberger
Copy link
Contributor

@PallHaraldsson so now we're going to be forced to use a specific editor (I don't use Juno/LT, btw) with a specific set of fonts just to avoid confusion with identifiers looking like numbers? I really don't think that's a reasonable suggestion.

@PallHaraldsson
Copy link
Contributor

No, not "forced" to use a different editor, the code would still work.. and be secure against outside attacks, just not be readable.. - unless you either avoid the letter (in your own code) or use say Juno - or a linter that warns you. Anyway.. just a thought.. and anyone know of if halfwidth/fullwidth and proportional in generl is used in editor's..?

@ScottPJones
Copy link
Contributor

I agree with @sbromberger on this one... julia programs are not identical to equations in a math textbook, as much as people seem to try to make it so (I'm not against that, don't get me wrong), but making something that is inconsistent with all programming languages I'm aware of (using a numberlike symbol by itself as an identifier) seems like it will just lead to confusion for many people... What's wrong with having to prefix these 4 characters with i (for identity), a (for array), or m (for matrix), when making a variable name?
I think this should be reverted... (sorry, @jiahao!)

@PallHaraldsson
Copy link
Contributor

Scott and others:

I just noticed, even just in the REPL:

𝟙 = 1 #yes, easier to see with: a𝟙 = 1

Just like:

丙=3

Yes, here in my Thunderbird editor (and I guess vi) you see no size
difference. [And I do not use Juno.. yet.. I guess I should, postponing
until there is the debugger - a real good reason..). The REPL is
amazingly just useful enough for me, and vi/less when I do edit(function)..

@PallHaraldsson
Copy link
Contributor

on fullwidth #5903

@sbromberger
Copy link
Contributor

I don't understand why a programming language, which can be composed and edited in any one of a number of methods (including pen!), would want to tie its hands with glyphs that can be confused with other symbols and that must rely on the user to avoid specific fonts / methods of editing in order to eliminate confusion and ambiguity.

Just from an accessibility perspective, this becomes a nightmare.

@hayd
Copy link
Member

hayd commented May 29, 2015

Allowing this is very different from packages actually using it; it's not like these identifiers are going to become widely popular... or used at all outside of this small niche (they're never end up in base for example). Unicode in julia has been incredibly useful... yet it's still often panned for the same reasons you cite (possibly ambiguity, accessibility, needs a modern font).

It's easy to write terrible code with similar looking identifiers, even in ascii. Who are we protecting here?

I don't think 𝟙 should be aliased to 1.

@ScottPJones
Copy link
Contributor

To me, this is horribly inconsistent... why these four, and not the double strike 2..9, if the argument is that basically any Unicode should be allowed at any position in an identifier...
I don't have any problem with allowing Unicode characters as operators, identifiers, etc., but I think they do need to follow certain fairly standard rules... (i.e. letters, letterlike, plus a few other things for initial character of an identifier... followed by those + numbers, numberlike, etc.).
I also don't think this really is the case that @JeffBezanson brought up... i.e. "I don't think it's the place of a programming language to try to ban characters."
I agree with him, in general, but this isn't banning the character, it is simply saying that it's classification means it shouldn't be allowed as an identifier start character, just like 0..9, or :, or +, ...

@jiahao
Copy link
Member Author

jiahao commented May 30, 2015

@ScottPJones this PR is not merged. There is nothing to revert.

Everyone, let's not complicate things here.

  • Hungarian notation is not idiomatic Julia, so saying that you can always use Hungarian notation is irrelevant.

  • The visual distinguishability issue is in general one we should be cognizant of. However, these specific characters, even written by hand, are designed to be distinguishable from ordinary letters and numerals. In fact, these characters derive from how mathematicians write them on paper and on blackboards. My own rendering of this looks like:

    Any properly designed font should respect the reason why these characters exist. Therefore I also consider this point irrelevant.

  • The main issue is whether the rules for valid identifiers are already too complicated as they stand.

@JeffBezanson
Copy link
Sponsor Member

My gut feeling is that unicode character categories provide a good objective basis for decisions like this. I don't think it can be about fonts or appearances one way or the other. After all there are tons of pairs of similar-looking characters in unicode.

However I wouldn't want to normalize 𝟙 and 1 to the same character. The only reason 𝟙 exists is to have a different symbol, not to write digits in a nifty-looking font. The standard arguably got this one wrong, and 𝟙 should have category Sm (math symbol).

@ScottPJones
Copy link
Contributor

@jiahao I thought it was merged because of the comment by @sbromberger in #10762, i.e.:

... but I see that a commit has already been made to allow this. I'll just go once more on the record that I think it's a bad idea, and will move on.

I also don't approve of Hungarian notation (one of the many evils foisted upon the world by M$, IMO 😀) My point about using i, m, or a, as prefixes was simply that those could retain most of the terseness of using the 𝟙 character, while still being a valid identifier using the current identifier start rules... I never meant to imply that one should use "System" Hungarian notation... (there are still some valid arguments in favor of "Apps" Hungarian notation, not that I use it anyway).

There is a huge visual distinguishability issue about using this, for the many people who use iOS,
and read julia-users, julia-dev, and GitHub on their iPhone or iPad... there's no way (unless you jailbreak your device, something most people won't do) to change the font... and these characters just come out as boxes... without @jiahao's nice photo of a hand drawing, I wouldn't have seen the character at all until I got to my Mac this morning... at least with i𝟙 or a𝟙, you can see that it's probably an identifier... Is typing one extra character so much of a burden?

There is a Unicode standard (annex) about this issue... see http:https://unicode.org/reports/tr31/

Finally: this is _way_ too complicated already... who can remember these rules (except maybe Dr. @JeffBezanson)? (BTW, why all the special casing of the Sm category? Which ones aren't allowed?)

    return (cat == UTF8PROC_CATEGORY_LU || cat == UTF8PROC_CATEGORY_LL ||
            cat == UTF8PROC_CATEGORY_LT || cat == UTF8PROC_CATEGORY_LM ||
            cat == UTF8PROC_CATEGORY_LO || cat == UTF8PROC_CATEGORY_NL ||
            cat == UTF8PROC_CATEGORY_SC ||  // allow currency symbols
            cat == UTF8PROC_CATEGORY_SO ||  // other symbols

            // math symbol (category Sm) whitelist
            (wc >= 0x2140 && wc <= 0x2a1c &&
             ((wc >= 0x2140 && wc <= 0x2144) || // ⅀, ⅁, ⅂, ⅃, ⅄
              wc == 0x223f || wc == 0x22be || wc == 0x22bf || // ∿, ⊾, ⊿
              wc == 0x22a4 || wc == 0x22a5 ||   // ⊤ ⊥
              (wc >= 0x22ee && wc <= 0x22f1) || // ⋮, ⋯, ⋰, ⋱

              (wc >= 0x2202 && wc <= 0x2233 &&
               (wc == 0x2202 || wc == 0x2205 || wc == 0x2206 || // ∂, ∅, ∆
                wc == 0x2207 || wc == 0x220e || wc == 0x220f || // ∇, ∎, ∏
                wc == 0x2210 || wc == 0x2211 || // ∐, ∑
                wc == 0x221e || wc == 0x221f || // ∞, ∟
                wc >= 0x222b)) || // ∫, ∬, ∭, ∮, ∯, ∰, ∱, ∲, ∳

              (wc >= 0x22c0 && wc <= 0x22c3) ||  // N-ary big ops: ⋀, ⋁, ⋂, ⋃
              (wc >= 0x25F8 && wc <= 0x25ff) ||  // ◸, ◹, ◺, ◻, ◼, ◽, ◾, ◿

              (wc >= 0x266f &&
               (wc == 0x266f || wc == 0x27d8 || wc == 0x27d9 || // ♯, ⟘, ⟙
                (wc >= 0x27c0 && wc <= 0x27c2) ||  // ⟀, ⟁, ⟂
                (wc >= 0x29b0 && wc <= 0x29b4) ||  // ⦰, ⦱, ⦲, ⦳, ⦴
                (wc >= 0x2a00 && wc <= 0x2a06) ||  // ⨀, ⨁, ⨂, ⨃, ⨄, ⨅, ⨆
                (wc >= 0x2a09 && wc <= 0x2a16) ||  // ⨉, ⨊, ⨋, ⨌, ⨍, ⨎, ⨏, ⨐, ⨑, ⨒, ⨓, ⨔, ⨕, ⨖
                wc == 0x2a1b || wc == 0x2a1c)))) || // ⨛, ⨜

            (wc >= 0x1d6c1 && // variants of \nabla and \partial
             (wc == 0x1d6c1 || wc == 0x1d6db ||
              wc == 0x1d6fb || wc == 0x1d715 ||
              wc == 0x1d735 || wc == 0x1d74f ||
              wc == 0x1d76f || wc == 0x1d789 ||
              wc == 0x1d7a9 || wc == 0x1d7c3)) ||

            // super- and subscript +-=()
            (wc >= 0x207a && wc <= 0x207e) ||
            (wc >= 0x208a && wc <= 0x208e) ||

            // angle symbols
            (wc >= 0x2220 && wc <= 0x2222) || // ∠, ∡, ∢
            (wc >= 0x299b && wc <= 0x29af) || // ⦛, ⦜, ⦝, ⦞, ⦟, ⦠, ⦡, ⦢, ⦣, ⦤, ⦥, ⦦, ⦧, ⦨, ⦩, ⦪, ⦫, ⦬, ⦭, ⦮, ⦯

            // Other_ID_Start
            wc == 0x2118 || wc == 0x212E || // ℘, ℮
            (wc >= 0x309B && wc <= 0x309C)); // katakana-hiragana sound marks

@stevengj
Copy link
Member

@ScottPJones, the reason for the special-casing of category Sm is that this category is something of an intractable mess where parsing is concerned:

  • It contains symbols that we definitely want to allow in identifiers, like \nabla.
  • It contains symbols that we want to parse as infix operators, like \oplus
  • It contains punctuation-like characters such as the U+23a4 braket fragment that we probably do not want to allow in identifiers or operators.

In practice, having things like available as infix operators, and things like x⁽⁺⁾ available as identifiers, is just way too useful to abandon, and in practice they don't seem to be confusing because they follow standard mathematical conventions, but they require special-casing.

You're right that, given all the special-casing that we already do in the name of mathematical conventions, it is not completely crazy to special-case bold 0 and 1.

@ScottPJones
Copy link
Contributor

OK, thanks!
Somewhat OT: I am planning on making this table driven, unless people object... will be a lot faster, save space, etc., and adding new identifiers can be done by just running a julia program to regenerate the table (as a C file, like utf8proc_data.c) (could even be loadable at startup instead).] That way it can directly pick up changes to the Unicode standard (instead of having to do it indirectly via remaking utf8proc, linking to a new version of utf8proc, etc.) Thoughts? 🍅 s?

@stevengj
Copy link
Member

@ScottPJones, note that the parser also needs Unicode normalization, not just categories etc. Also, this code is not performance-critical last I checked, so that shouldn't be a major design criterion. And since we now maintain utf8proc, updating to new Unicode versions (which doesn't happen very often anyway), is easy already.

@stevengj
Copy link
Member

(You can work on improving whatever you like, of course, and cleaner code is always welcome. But I just want to suggest that rewriting working, non-performance-critical, code should probably not be your priority.)

@ScottPJones
Copy link
Contributor

@stevengj This particular pet project would accomplish a few things:

  1. Improve my knowledge of Julia data structures, which I definitely know I will need for our project
  2. Reduce the size of memory for Julia (no 1MB chunk of very spread out data... it's not laid out well for cache performance)
  3. Reduce the dependence of core Julia on an external C library
  4. Develop the techniques in Julia to handle the ones that will probably be more performance critical for me... (for example, normalization)
  5. Have the nice side benefit of speeding up some operations (whether or not it's performance critical) and simplifying the code (and getting it better documented)
    Does that make sense?

@ScottPJones
Copy link
Contributor

@stevengj About maintaining utf8proc... it can still be maintained by various julians, but does Julia really need to depend on it in the future, if there is a better alternative?
(and I think Unicode standards are going to be coming much quicker... mostly because of adding Emoji's... that's become a big issue for the Unicode organization)

@stevengj
Copy link
Member

No, Julia does not need to depend on it if there is a nicer alternative. However, we probably need a C library for normalization, since it has to be executed in the flisp parser. (Updating to new unicode versions should take only a few minutes of work, since the data import is fully automated: update the URLs, run make update, commit, and update the commit number in Julia.)

I just hope that, along with all the emojis, they finally add a superscript "q".

@ScottPJones
Copy link
Contributor

Well, from what I saw, all that is needed in C is two id character lookup functions, and the normalization function... everything else can be done in Julia, accessing generated tables. (but tables structured so that doing normalization, checking an id, or lower/upper casing of a string doesn't wipe out your L1/L2 cache!)

@jiahao
Copy link
Member Author

jiahao commented Aug 24, 2015

Closing as too contentious.

@jiahao jiahao closed this Aug 24, 2015
@jiahao jiahao deleted the cjh/doublestruck01 branch October 22, 2015 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

\Bbbone is not valid input
8 participants