Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make isuppercase and islowercase agree with Unicode standard #36618

Closed
stevengj opened this issue Jul 11, 2020 · 0 comments · Fixed by #38574
Closed

make isuppercase and islowercase agree with Unicode standard #36618

stevengj opened this issue Jul 11, 2020 · 0 comments · Fixed by #38574
Labels
domain:unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

stevengj commented Jul 11, 2020

Currently, islowercase checks whether a character is in category Ll, Letter: Lowercase, and isuppercase checks for category Lu, Letter: Uppercase or Lt, Letter: Titlecase.

However, it was recently brought to my attention that there are actually official Unicode derived properties called Lowercase and Uppercase which differ from these definitions.

  • Titlecase characters like Dž (U+01c5) are not considered uppercase. (Note that uppercase('Dž') yields a different character 'DŽ', so this makes a certain sense.)
  • Some Lo, Letter: Other characters like ª are included as Lowercase (or Uppercase in other cases like ).

The next version of utf8proc will provide islower and isupper functions compliant with these definitions (JuliaStrings/utf8proc#196), so we may want to switch to them.

(My guess is that it makes little difference in practice — I'm not clear how useful these functions are for general Unicode strings — but the standard here seems fairly sensible. Apparently this is what Python's isupper/islower functions do.)

@stevengj stevengj added the domain:unicode Related to unicode characters and encodings label Jul 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant