Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle Unicode uppercase/lowercase conversions correctly #774

Closed
StefanKarpinski opened this issue Apr 29, 2012 · 18 comments
Closed

handle Unicode uppercase/lowercase conversions correctly #774

StefanKarpinski opened this issue Apr 29, 2012 · 18 comments
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@StefanKarpinski
Copy link
Member

As pointed out here, there are cases where a single Unicode character needs to be split into multiple characters when converted to uppercase: e.g. ß to SS and ffl to FFL (it's possible that similar cases exist for conversion from uppercase to lowercase as well). The interface of the towupper and towlower functions which we use for general Unicode case conversion can't handle such transformations since the signature is Char to Char. Despite this, we should handle these conversions correctly, although I'm not sure where to get code that does this.

@ghost ghost assigned StefanKarpinski Apr 29, 2012
@StefanKarpinski
Copy link
Member Author

More information can be found here: https://www.unicode.org/faq/casemap_charprop.html. On a related note, we should handle correct titlecasing of strings as well, which has its own difficulties.

@6e441f9c
Copy link

6e441f9c commented May 1, 2012

It looks like the main drawback of ICU is that it converts data to UTF-16 internally by default; there are some UTF-8 friendly interfaces, though

@StefanKarpinski
Copy link
Member Author

What is ICU?

@nolta
Copy link
Member

nolta commented May 1, 2012

IBM's Unicode library: https://site.icu-project.org/

@StefanKarpinski
Copy link
Member Author

We can maybe just steal the code for case mapping, assuming they handle this issue correctly. We already handle almost all Unicode issues just fine. This is just one that even the standard C Unicode functions fail on.

@JeffBezanson
Copy link
Member

I'm not even sure what changing letter case is for. We have towupper just because it's there. Case folding or case-insensitive searching seems slightly more useful. One question is whether we want to have our own unicode tables, which can be pretty massive. People do all kinds of tricks generating code from them etc.

@StefanKarpinski
Copy link
Member Author

I'm not even sure what changing letter case is for.

Really? You are clearly not a Perl programmer :-P

@nolta
Copy link
Member

nolta commented Sep 14, 2012

It looks like the main drawback of ICU is that it converts data to UTF-16 internally by default; there are some UTF-8 friendly interfaces, though

Fortunately, these include the case mapping functions. I've wrapped them up in extras/icu.jl:

julia> uppercase("testingß")
"TESTINGß"

julia> load("icu.jl"); import ICU.*;

julia> uppercase("testingß")
"TESTINGSS"

@aviks
Copy link
Member

aviks commented Jan 25, 2013

I think the ICU package works well for this. Do we want to have the unicode tables in base at any point? Should this issue be closed?

@JeffBezanson
Copy link
Member

Are we planning to add this to Base?

@StefanKarpinski
Copy link
Member Author

It seems like a lot of stuff to put into Base. The ICU library is kind of huge for just this one tiny and not-very-common piece of functionality. It would be nice to be able to pull the correct uppercasing logic out of ICU – or get it from somewhere else.

@JeffBezanson
Copy link
Member

Even just that is quite big --- you need the full unicode tables.

@StefanKarpinski
Copy link
Member Author

Then I'd say the current situation is a local optimum: we have mostly correct basic uppercasing and lowercasing in Base and if you need the full fancy version, then you use ICU and get it.

@aviks
Copy link
Member

aviks commented Mar 9, 2013

+1

@hayd
Copy link
Member

hayd commented Jan 7, 2016

Did titlecase drop off the radar here? (I thought you could do it in the past, but can't find any related issues)

@nalimilan
Copy link
Member

@hayd At least that's supported by https://github.com/nolta/UnicodeExtras.jl.

@stevengj
Copy link
Member

stevengj commented Mar 4, 2016

Titlecase info is provided by UTF8proc, but it would be nice to have a little wrapper routine like utf8proc_uppercase to make it easier to access.

@stevengj
Copy link
Member

Note that utf8proc 2.0 added utf8proc_totitle, so it would now be trivial to add a titlecase function in Julia.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

8 participants