Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8proc doesn't support unicode 6 #7582

Closed
johnmyleswhite opened this issue Jul 13, 2014 · 14 comments · Fixed by #7917
Closed

utf8proc doesn't support unicode 6 #7582

johnmyleswhite opened this issue Jul 13, 2014 · 14 comments · Fixed by #7917
Labels
domain:unicode Related to unicode characters and encodings kind:bug Indicates an unexpected problem or unintended behavior kind:upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@johnmyleswhite
Copy link
Member

When I try to produce a subscript t on the letter "a" by typing in a\_t<TAB>, I get the following behavior:

julia> aₜ
ERROR: syntax: invalid character "ₜ"

In contrast, I can safely produce aᵣ.

@JeffBezanson
Copy link
Sponsor Member

This is due to utf8proc not supporting unicode 6.0.0. See also #6939

@JeffBezanson JeffBezanson changed the title Bug in LaTeX-style substitutions? utf8proc doesn't support unicode 6 Jul 13, 2014
@stevengj
Copy link
Member

Unfortunately, I haven't been able to reach the utf8proc maintainer since February. When I contacted him at the time, he said he was willing to make a public Mercurial repo for utf8proc, and was interested in updating for Unicode 6, once he had done some cleanup.

We may have to make a fork of utf8proc in order to import the Unicode 6 tables and make other changes that would be useful for us (e.g. to import the Unicode character-width tables for #6939).

(We could also switch to something like ICU, but I'm a bit reluctant to go that route. utf8proc is a clean, small C library based on UTF-8 that does what we need and no more, while ICU is a huge C++ library based on UTF-16, with lots of extraneous functionality.)

@StefanKarpinski
Copy link
Sponsor Member

I agree. Let's fork utf8proc. Maybe rename the project to avoid confusion? Calling it libutf8 might be good. We would of course make the derivation from utf8proc clear.

@stevengj
Copy link
Member

Maybe libutf8proc, just to make the relationship clearer in the name.

@StefanKarpinski
Copy link
Sponsor Member

Seems reasonable.

@stevengj
Copy link
Member

I have a cleaned-up git repo as a starting point for work on utf8proc. Should I start a JuliaLang libutf8proc repo?

@jiahao
Copy link
Member

jiahao commented Jul 15, 2014

go for it

@StefanKarpinski
Copy link
Sponsor Member

yep, that sounds good.

@stevengj
Copy link
Member

I don't have admin rights to JuliaLang, so I can't create a repo there. I temporarily put it at stevengj/libutf8proc and added @StefanKarpinski and @jiahao as collaborators. Either one of you should feel free to transfer the repo to JuliaLang.

@stevengj
Copy link
Member

Or, no, it looks like you have to give me temporary admin rights to transfer it in.

@StefanKarpinski
Copy link
Sponsor Member

I gave you admin rights. You can now transfer it or create it.

@stevengj
Copy link
Member

Okay, moved to JuliaLang/libutf8proc JuliaLang/libmojibake

@abcsds
Copy link

abcsds commented Dec 7, 2016

Hi I reproduced this problem on Julia 0.5.0 Should I post it to JuliaLang/libmojibake ?
I'm using latex symbol \mid
Thank you

julia>0= [0,1]
ERROR: syntax: invalid character ""

julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753 (2016-09-19 18:14 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM)2 Duo CPU     P7550  @ 2.26GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, penryn)

@yuyichao
Copy link
Contributor

yuyichao commented Dec 7, 2016

  1. It's not libmojibake anymore but back to libutf8proc
  2. Please don't post in an old closed issue. It's most likely unrelated to your problem
  3. IIUC, we intentionally don't allow certain symbols to be used in variable names since they are too confusing with other ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings kind:bug Indicates an unexpected problem or unintended behavior kind:upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants