Parse a minimal set of fullwidth punctuation as synonyms #5903

jiahao · 2014-02-22T16:58:11Z

The current Unicode normalization policy (#5576, #5434) is to employ the NFC normalization to canonicalize identifiers. However, NFC is overly conservative as a choice of canonicalization, since it does not alleviate the possibility of writing obfuscated code using, for example, full-width punctuation characters in identifiers.

Example:

julia> b＝3:5 #full-width equals
ERROR: b＝3 not defined

julia> b＝3=-1
-1

julia> [b＝3:5]
7-element Array{Int64,1}:
 -1
  0
  1
  2
  3
  4
  5

While in general we probably don't want to get into the business of building in semantic knowledge of natural languages into the parser, I think at the very least we should support as synonyms the default output produced by standard input method editors. As an example, setting the input method to Pinyin - Simplified IME on OSX 10.9, typing on the keyboard bing1=3 selects the first Chinese character with phonetic spelling bing, then continues with =3 as part of the input stream. The result, when typed directly into the Julia REPL, is

julia> 丙＝3
ERROR: 丙＝3 not defined

which stems from the full-width ＝ being parsed as part of the identifier rather than the assignment operator, which is arguably what the typical user would have intended.

The text was updated successfully, but these errors were encountered:

IainNZ · 2014-02-22T17:14:50Z

The rabbit hole really is deep with this one. What does the average CJK user do when writing code, just put the IME into English by default and maybe switch back for comments?

IainNZ · 2014-02-22T17:18:54Z

Haha, this one is quite fun. Julia 0.2:

julia> 私の名前＝"Iain"
ERROR: @私の名前＝_str not defined

Unicode macros could make for some fun looking code...

jiahao · 2014-02-22T17:37:06Z

I can't speak for everyone else, but I do switch back and forth between US ANSI and Pinyin constantly to type the proper halfwidth punctuation. I think many people don't even bother to try typing non-Roman characters into their code.

JeffBezanson · 2014-02-22T18:07:19Z

The idea here is that we only use a small number of ASCII punctuation symbols, and so if other unicode characters are really aliases of those they should be treated the same. For example we already treat 26 unicode characters as whitespace. I think the fullwidth colon and equals are pretty obvious, but it does get murkier. I'm not sure what to do with the large number of quote characters in particular.

wlbksy · 2014-02-23T01:36:48Z

I switch back and forth for this. I offen encounter problems with recognize comma, colon, semicolon, exclamation, question, parenthesis marks between fullwidth and halfwidth. period, quote and other bracket marks seems fine because it's easy to identify them. This is the first time that I realize there is a fullwidh equal mark

wlbksy · 2014-02-23T01:43:46Z

I think equal mark should be dealt with since it seldom be used in strings. Let's just leave the others as they be, maybe you'll need full width marks in string someday

jiahao · 2014-02-23T19:29:18Z

Here are some Unicode normalization tables that may be useful, particularly the ones for punctuation.

stevengj · 2014-02-24T18:25:37Z

Rather than starting to add custom exceptions to NFC, my preference would be to start with NFKC (which solves the issue here of multiple input modes in asian languages, as well as e.g. ligatures in Latin scripts or µ vs. μ) and add exceptions as needed (if a convincing real-world case arises where we really want to treat two Kompatible symbols as inequivalent). See #5434.

stevengj · 2014-06-29T13:46:09Z

Since we settled on NFC, it might be useful to revisit this issue and add a limited set of custom additions to our Unicode normalization.

The µ (micro) vs. μ (mu) issue just came up again (Keno/SIUnits.jl#23) for example, and I would tend to include this exception as well simply because µ is so easy to type on MacOS (option-m).

timholy · 2014-07-14T20:36:32Z

Bump. The distinction between micro vs mu is pretty annoying. It would be great to have a decision on this for 0.3.

staticfloat · 2014-07-18T19:45:25Z

@IainNZ I just had to try it out for myself, and your use case looks like it's working in 0.3!

julia> 私の名前="鯖"
"鯖"

julia> 私の名前
"鯖"

The best part about this is that TAB-completion actually works, so I can type 私, hit <TAB> and it'll autocomplete the rest. :P

StefanKarpinski · 2016-11-10T18:03:47Z

The lack of attention for several releases makes me think we can probably let this go until some indefinite time in the future.

…g#5903)" This reverts commit cf61972.

stevengj · 2017-01-06T13:32:58Z

Not actually implemented in my PR, though now it's easy to add

JeffBezanson · 2017-07-20T17:33:54Z

Full-width punctuation characters now give "invalid character" parse errors, so I think adding this would be non-breaking. Can probably be deferred.

jiahao added decision labels Feb 22, 2014

stevengj mentioned this issue Feb 24, 2014

canonicalize unicode identifiers #5434

Closed

andrioni mentioned this issue Mar 14, 2014

UTF-8 combining characters and normalization in reverse() #6165

Closed

stevengj mentioned this issue Jun 29, 2014

Use the same $\mu$ character that the REPL produces with \mu-TAB Keno/SIUnits.jl#23

Closed

stevengj mentioned this issue Jul 18, 2014

add custom JULIA normalization? JuliaStrings/utf8proc#11

Closed

JeffBezanson added the parser Language parsing and surface syntax label Mar 16, 2015

PallHaraldsson mentioned this issue May 29, 2015

Allow four more characters to start identifiers. #11267

Closed

nalimilan mentioned this issue Mar 26, 2016

Wrong LaTeX-Unicode mapping of \varepsilon #14751

Closed

StefanKarpinski added this to the 0.6.0 milestone Sep 13, 2016

StefanKarpinski modified the milestones: 1.0, 0.6.0 Nov 10, 2016

stevengj mentioned this issue Nov 30, 2016

WIP: custom Unicode normalization for Julia identifiers #19464

Merged

stevengj added a commit to stevengj/julia that referenced this issue Dec 1, 2016

normalize fullwidth characters during parsing (fixes JuliaLang#5903)

fd87792

stevengj added a commit to stevengj/julia that referenced this issue Dec 26, 2016

normalize fullwidth characters during parsing (fixes JuliaLang#5903)

61602a4

stevengj added a commit to stevengj/julia that referenced this issue Dec 29, 2016

normalize fullwidth characters during parsing (fixes JuliaLang#5903)

cf61972

stevengj added a commit to stevengj/julia that referenced this issue Dec 31, 2016

Revert "normalize fullwidth characters during parsing (fixes JuliaLan…

741267e

…g#5903)" This reverts commit cf61972.

stevengj added a commit to stevengj/julia that referenced this issue Jan 4, 2017

normalize fullwidth characters during parsing (fixes JuliaLang#5903)

a55a95b

stevengj added a commit to stevengj/julia that referenced this issue Jan 4, 2017

Revert "normalize fullwidth characters during parsing (fixes JuliaLan…

a28c90f

…g#5903)" This reverts commit cf61972.

tkelman closed this as completed in 62c423b Jan 6, 2017

stevengj reopened this Jan 6, 2017

JeffBezanson removed the kind:breaking This change will break code label Jul 20, 2017

StefanKarpinski modified the milestones: 1.x, 1.0 Jul 20, 2017

yuyichao mentioned this issue Aug 28, 2017

Add \approxeq (≊) as an alias to isapprox as well #23478

Closed

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse a minimal set of fullwidth punctuation as synonyms #5903

Parse a minimal set of fullwidth punctuation as synonyms #5903

jiahao commented Feb 22, 2014

IainNZ commented Feb 22, 2014

IainNZ commented Feb 22, 2014

jiahao commented Feb 22, 2014

JeffBezanson commented Feb 22, 2014

wlbksy commented Feb 23, 2014

wlbksy commented Feb 23, 2014

jiahao commented Feb 23, 2014

stevengj commented Feb 24, 2014

stevengj commented Jun 29, 2014

timholy commented Jul 14, 2014

staticfloat commented Jul 18, 2014

StefanKarpinski commented Nov 10, 2016

stevengj commented Jan 6, 2017

JeffBezanson commented Jul 20, 2017

Parse a minimal set of fullwidth punctuation as synonyms #5903

Parse a minimal set of fullwidth punctuation as synonyms #5903

Comments

jiahao commented Feb 22, 2014

IainNZ commented Feb 22, 2014

IainNZ commented Feb 22, 2014

jiahao commented Feb 22, 2014

JeffBezanson commented Feb 22, 2014

wlbksy commented Feb 23, 2014

wlbksy commented Feb 23, 2014

jiahao commented Feb 23, 2014

stevengj commented Feb 24, 2014

stevengj commented Jun 29, 2014

timholy commented Jul 14, 2014

staticfloat commented Jul 18, 2014

StefanKarpinski commented Nov 10, 2016

stevengj commented Jan 6, 2017

JeffBezanson commented Jul 20, 2017