handle Unicode uppercase/lowercase conversions correctly #774

StefanKarpinski · 2012-04-29T17:30:07Z

As pointed out here, there are cases where a single Unicode character needs to be split into multiple characters when converted to uppercase: e.g. ß to SS and ﬄ to FFL (it's possible that similar cases exist for conversion from uppercase to lowercase as well). The interface of the towupper and towlower functions which we use for general Unicode case conversion can't handle such transformations since the signature is Char to Char. Despite this, we should handle these conversions correctly, although I'm not sure where to get code that does this.

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2012-04-29T17:54:41Z

More information can be found here: https://www.unicode.org/faq/casemap_charprop.html. On a related note, we should handle correct titlecasing of strings as well, which has its own difficulties.

6e441f9c · 2012-05-01T09:35:38Z

It looks like the main drawback of ICU is that it converts data to UTF-16 internally by default; there are some UTF-8 friendly interfaces, though

StefanKarpinski · 2012-05-01T15:00:13Z

What is ICU?

nolta · 2012-05-01T15:04:20Z

IBM's Unicode library: https://site.icu-project.org/

StefanKarpinski · 2012-05-01T15:21:11Z

We can maybe just steal the code for case mapping, assuming they handle this issue correctly. We already handle almost all Unicode issues just fine. This is just one that even the standard C Unicode functions fail on.

JeffBezanson · 2012-05-01T21:44:06Z

I'm not even sure what changing letter case is for. We have towupper just because it's there. Case folding or case-insensitive searching seems slightly more useful. One question is whether we want to have our own unicode tables, which can be pretty massive. People do all kinds of tricks generating code from them etc.

StefanKarpinski · 2012-05-02T18:42:26Z

I'm not even sure what changing letter case is for.

Really? You are clearly not a Perl programmer :-P

nolta · 2012-09-14T19:25:05Z

It looks like the main drawback of ICU is that it converts data to UTF-16 internally by default; there are some UTF-8 friendly interfaces, though

Fortunately, these include the case mapping functions. I've wrapped them up in extras/icu.jl:

julia> uppercase("testingß")
"TESTINGß"

julia> load("icu.jl"); import ICU.*;

julia> uppercase("testingß")
"TESTINGSS"

aviks · 2013-01-25T17:01:12Z

I think the ICU package works well for this. Do we want to have the unicode tables in base at any point? Should this issue be closed?

JeffBezanson · 2013-03-09T00:01:32Z

Are we planning to add this to Base?

StefanKarpinski · 2013-03-09T00:19:13Z

It seems like a lot of stuff to put into Base. The ICU library is kind of huge for just this one tiny and not-very-common piece of functionality. It would be nice to be able to pull the correct uppercasing logic out of ICU – or get it from somewhere else.

JeffBezanson · 2013-03-09T00:23:22Z

Even just that is quite big --- you need the full unicode tables.

StefanKarpinski · 2013-03-09T00:25:44Z

Then I'd say the current situation is a local optimum: we have mostly correct basic uppercasing and lowercasing in Base and if you need the full fancy version, then you use ICU and get it.

aviks · 2013-03-09T10:34:09Z

+1

hayd · 2016-01-07T07:01:02Z

Did titlecase drop off the radar here? (I thought you could do it in the past, but can't find any related issues)

nalimilan · 2016-01-07T14:22:15Z

@hayd At least that's supported by https://github.com/nolta/UnicodeExtras.jl.

stevengj · 2016-03-04T17:40:13Z

Titlecase info is provided by UTF8proc, but it would be nice to have a little wrapper routine like utf8proc_uppercase to make it easier to access.

stevengj · 2016-11-30T15:49:02Z

Note that utf8proc 2.0 added utf8proc_totitle, so it would now be trivial to add a titlecase function in Julia.

ghost assigned StefanKarpinski Apr 29, 2012

StefanKarpinski closed this as completed Mar 9, 2013

StefanKarpinski mentioned this issue Nov 30, 2016

add titlecase function #19465

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle Unicode uppercase/lowercase conversions correctly #774

handle Unicode uppercase/lowercase conversions correctly #774

StefanKarpinski commented Apr 29, 2012

StefanKarpinski commented Apr 29, 2012

6e441f9c commented May 1, 2012

StefanKarpinski commented May 1, 2012

nolta commented May 1, 2012

StefanKarpinski commented May 1, 2012

JeffBezanson commented May 1, 2012

StefanKarpinski commented May 2, 2012

nolta commented Sep 14, 2012

aviks commented Jan 25, 2013

JeffBezanson commented Mar 9, 2013

StefanKarpinski commented Mar 9, 2013

JeffBezanson commented Mar 9, 2013

StefanKarpinski commented Mar 9, 2013

aviks commented Mar 9, 2013

hayd commented Jan 7, 2016

nalimilan commented Jan 7, 2016

stevengj commented Mar 4, 2016

stevengj commented Nov 30, 2016

handle Unicode uppercase/lowercase conversions correctly #774

handle Unicode uppercase/lowercase conversions correctly #774

Comments

StefanKarpinski commented Apr 29, 2012

StefanKarpinski commented Apr 29, 2012

6e441f9c commented May 1, 2012

StefanKarpinski commented May 1, 2012

nolta commented May 1, 2012

StefanKarpinski commented May 1, 2012

JeffBezanson commented May 1, 2012

StefanKarpinski commented May 2, 2012

nolta commented Sep 14, 2012

aviks commented Jan 25, 2013

JeffBezanson commented Mar 9, 2013

StefanKarpinski commented Mar 9, 2013

JeffBezanson commented Mar 9, 2013

StefanKarpinski commented Mar 9, 2013

aviks commented Mar 9, 2013

hayd commented Jan 7, 2016

nalimilan commented Jan 7, 2016

stevengj commented Mar 4, 2016

stevengj commented Nov 30, 2016