make isuppercase and islowercase agree with Unicode standard #36618

stevengj · 2020-07-11T12:56:26Z

Currently, islowercase checks whether a character is in category Ll, Letter: Lowercase, and isuppercase checks for category Lu, Letter: Uppercase or Lt, Letter: Titlecase.

However, it was recently brought to my attention that there are actually official Unicode derived properties called Lowercase and Uppercase which differ from these definitions.

Titlecase characters like ǅ (U+01c5) are not considered uppercase. (Note that uppercase('ǅ') yields a different character 'Ǆ', so this makes a certain sense.)
Some Lo, Letter: Other characters like ª are included as Lowercase (or Uppercase in other cases like Ⓐ).

The next version of utf8proc will provide islower and isupper functions compliant with these definitions (JuliaStrings/utf8proc#196), so we may want to switch to them.

(My guess is that it makes little difference in practice — I'm not clear how useful these functions are for general Unicode strings — but the standard here seems fairly sensible. Apparently this is what Python's isupper/islower functions do.)

The text was updated successfully, but these errors were encountered:

stevengj added the domain:unicode Related to unicode characters and encodings label Jul 11, 2020

This was referenced Nov 23, 2020

update to utf8proc 2.6 #38551

Merged

Unicode-compliant islower/uppercase #38574

Merged

stevengj closed this as completed in #38574 Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make isuppercase and islowercase agree with Unicode standard #36618

make isuppercase and islowercase agree with Unicode standard #36618

stevengj commented Jul 11, 2020 •

edited

Loading

make isuppercase and islowercase agree with Unicode standard #36618

make isuppercase and islowercase agree with Unicode standard #36618

Comments

stevengj commented Jul 11, 2020 • edited Loading

stevengj commented Jul 11, 2020 •

edited

Loading