-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle Unicode uppercase/lowercase conversions correctly #774
Comments
More information can be found here: https://www.unicode.org/faq/casemap_charprop.html. On a related note, we should handle correct titlecasing of strings as well, which has its own difficulties. |
It looks like the main drawback of ICU is that it converts data to UTF-16 internally by default; there are some UTF-8 friendly interfaces, though |
What is ICU? |
IBM's Unicode library: https://site.icu-project.org/ |
We can maybe just steal the code for case mapping, assuming they handle this issue correctly. We already handle almost all Unicode issues just fine. This is just one that even the standard C Unicode functions fail on. |
I'm not even sure what changing letter case is for. We have |
Really? You are clearly not a Perl programmer :-P |
Fortunately, these include the case mapping functions. I've wrapped them up in extras/icu.jl:
|
I think the ICU package works well for this. Do we want to have the unicode tables in base at any point? Should this issue be closed? |
Are we planning to add this to Base? |
It seems like a lot of stuff to put into Base. The ICU library is kind of huge for just this one tiny and not-very-common piece of functionality. It would be nice to be able to pull the correct uppercasing logic out of ICU – or get it from somewhere else. |
Even just that is quite big --- you need the full unicode tables. |
Then I'd say the current situation is a local optimum: we have mostly correct basic uppercasing and lowercasing in Base and if you need the full fancy version, then you use ICU and get it. |
+1 |
Did titlecase drop off the radar here? (I thought you could do it in the past, but can't find any related issues) |
@hayd At least that's supported by https://github.com/nolta/UnicodeExtras.jl. |
Titlecase info is provided by UTF8proc, but it would be nice to have a little wrapper routine like |
Note that utf8proc 2.0 added |
As pointed out here, there are cases where a single Unicode character needs to be split into multiple characters when converted to uppercase: e.g. ß to SS and ffl to FFL (it's possible that similar cases exist for conversion from uppercase to lowercase as well). The interface of the
towupper
andtowlower
functions which we use for general Unicode case conversion can't handle such transformations since the signature isChar
toChar
. Despite this, we should handle these conversions correctly, although I'm not sure where to get code that does this.The text was updated successfully, but these errors were encountered: