Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse a minimal set of fullwidth punctuation as synonyms #5903

Open
jiahao opened this issue Feb 22, 2014 · 14 comments
Open

Parse a minimal set of fullwidth punctuation as synonyms #5903

jiahao opened this issue Feb 22, 2014 · 14 comments
Labels
domain:unicode Related to unicode characters and encodings needs decision A decision on this change is needed parser Language parsing and surface syntax

Comments

@jiahao
Copy link
Member

jiahao commented Feb 22, 2014

The current Unicode normalization policy (#5576, #5434) is to employ the NFC normalization to canonicalize identifiers. However, NFC is overly conservative as a choice of canonicalization, since it does not alleviate the possibility of writing obfuscated code using, for example, full-width punctuation characters in identifiers.

Example:

julia> b=3:5 #full-width equals
ERROR: b=3 not defined

julia> b=3=-1
-1

julia> [b=3:5]
7-element Array{Int64,1}:
 -1
  0
  1
  2
  3
  4
  5

While in general we probably don't want to get into the business of building in semantic knowledge of natural languages into the parser, I think at the very least we should support as synonyms the default output produced by standard input method editors. As an example, setting the input method to Pinyin - Simplified IME on OSX 10.9, typing on the keyboard bing1=3 selects the first Chinese character with phonetic spelling bing, then continues with =3 as part of the input stream. The result, when typed directly into the Julia REPL, is

julia> 丙=3
ERROR: 丙=3 not defined

which stems from the full-width being parsed as part of the identifier rather than the assignment operator, which is arguably what the typical user would have intended.

@IainNZ
Copy link
Member

IainNZ commented Feb 22, 2014

The rabbit hole really is deep with this one. What does the average CJK user do when writing code, just put the IME into English by default and maybe switch back for comments?

@IainNZ
Copy link
Member

IainNZ commented Feb 22, 2014

Haha, this one is quite fun. Julia 0.2:

julia> 私の名前="Iain"
ERROR: @私の名前=_str not defined

Unicode macros could make for some fun looking code...

@jiahao
Copy link
Member Author

jiahao commented Feb 22, 2014

I can't speak for everyone else, but I do switch back and forth between US ANSI and Pinyin constantly to type the proper halfwidth punctuation. I think many people don't even bother to try typing non-Roman characters into their code.

@JeffBezanson
Copy link
Sponsor Member

The idea here is that we only use a small number of ASCII punctuation symbols, and so if other unicode characters are really aliases of those they should be treated the same. For example we already treat 26 unicode characters as whitespace. I think the fullwidth colon and equals are pretty obvious, but it does get murkier. I'm not sure what to do with the large number of quote characters in particular.

@wlbksy
Copy link
Contributor

wlbksy commented Feb 23, 2014

I switch back and forth for this. I offen encounter problems with recognize comma, colon, semicolon, exclamation, question, parenthesis marks between fullwidth and halfwidth. period, quote and other bracket marks seems fine because it's easy to identify them. This is the first time that I realize there is a fullwidh equal mark

@wlbksy
Copy link
Contributor

wlbksy commented Feb 23, 2014

I think equal mark should be dealt with since it seldom be used in strings. Let's just leave the others as they be, maybe you'll need full width marks in string someday

@jiahao
Copy link
Member Author

jiahao commented Feb 23, 2014

Here are some Unicode normalization tables that may be useful, particularly the ones for punctuation.

@stevengj
Copy link
Member

Rather than starting to add custom exceptions to NFC, my preference would be to start with NFKC (which solves the issue here of multiple input modes in asian languages, as well as e.g. ligatures in Latin scripts or µ vs. μ) and add exceptions as needed (if a convincing real-world case arises where we really want to treat two Kompatible symbols as inequivalent). See #5434.

@stevengj
Copy link
Member

Since we settled on NFC, it might be useful to revisit this issue and add a limited set of custom additions to our Unicode normalization.

The µ (micro) vs. μ (mu) issue just came up again (Keno/SIUnits.jl#23) for example, and I would tend to include this exception as well simply because µ is so easy to type on MacOS (option-m).

@timholy
Copy link
Sponsor Member

timholy commented Jul 14, 2014

Bump. The distinction between micro vs mu is pretty annoying. It would be great to have a decision on this for 0.3.

@staticfloat
Copy link
Sponsor Member

@IainNZ I just had to try it out for myself, and your use case looks like it's working in 0.3!

julia> 私の名前="鯖"
"鯖"

julia> 私の名前
"鯖"

The best part about this is that TAB-completion actually works, so I can type , hit <TAB> and it'll autocomplete the rest. :P

@StefanKarpinski
Copy link
Sponsor Member

The lack of attention for several releases makes me think we can probably let this go until some indefinite time in the future.

stevengj added a commit to stevengj/julia that referenced this issue Dec 31, 2016
stevengj added a commit to stevengj/julia that referenced this issue Jan 4, 2017
@tkelman tkelman closed this as completed in 62c423b Jan 6, 2017
@stevengj
Copy link
Member

stevengj commented Jan 6, 2017

Not actually implemented in my PR, though now it's easy to add

@stevengj stevengj reopened this Jan 6, 2017
@JeffBezanson
Copy link
Sponsor Member

Full-width punctuation characters now give "invalid character" parse errors, so I think adding this would be non-breaking. Can probably be deferred.

@JeffBezanson JeffBezanson removed the kind:breaking This change will break code label Jul 20, 2017
@StefanKarpinski StefanKarpinski modified the milestones: 1.x, 1.0 Jul 20, 2017
@DilumAluthge DilumAluthge removed this from the 1.x milestone Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings needs decision A decision on this change is needed parser Language parsing and surface syntax
Projects
None yet
Development

No branches or pull requests

9 participants