`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

digital-carver · 2018-04-29T22:02:25Z

Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO). This is almost correct, except that the Unicode Alphabetic property belongs to these categories, to a Nl category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic that live in Mc and Mn (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic list.

Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}) and Java (Character.isAlphabetic) get this right (Java documentation explicitly explains the Alphabetic property, Python 2 and 3 both ("அதிகாலை".isalpha()) seem to be getting it wrong. Perl also gets the Other_Alphabetic characters correctly identified under \p{Alpha} (though it also seems to have additional magic on top).

Other_Alphabetic apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha.

I'm not sure if utf8proc supports querying for either the Alphabetic or the Other_Alphabetic property (the utf8proc_property_struct doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.

The text was updated successfully, but these errors were encountered:

stevengj · 2018-04-30T15:14:43Z

There are a whole bunch of Unicode character properties that aren't currently in utf8proc, e.g. Other_Alphabetic and Sentence_Terminal and Quotation_Mark and ...

I suspect that, rather than cramming all of these into utf8proc, it would be better to keep utf8proc focused mainly on normalization and have a separate package of UnicodeProperties with a bunch of optimized 2-stage tables (exposed as e.g. a new AbstractSet type) for different character properties.

stevengj · 2018-04-30T15:15:34Z

In the meantime, maybe isalpha should be renamed to isletter, analogous to GoLang.

ararslan · 2018-04-30T16:39:40Z

+1 for isletter; I can never remember whether isalpha is "is alphabetic" or "is alphanumeric."

JeffBezanson · 2018-05-10T18:54:57Z

Triage is ok with renaming to isletter.

StefanKarpinski · 2018-05-11T17:57:22Z

If someone wants to make a PR doing this rename, that would be good, I don't think it's going to happen otherwise though. @digital-carver? (or @ararslan if you feel like it)

ararslan added the domain:unicode Related to unicode characters and encodings label Apr 29, 2018

StefanKarpinski added the status:triage This should be discussed on a triage call label May 8, 2018

JeffBezanson added this to the 1.0 milestone May 10, 2018

JeffBezanson removed the status:triage This should be discussed on a triage call label May 10, 2018

JeffBezanson changed the title ~~isalpha should use Unicode property Alphabetic~~ isalpha should use Unicode property Alphabetic; rename to isletter May 10, 2018

ararslan mentioned this issue May 11, 2018

Rename isalpha to isletter #27077

Merged

JeffBezanson modified the milestones: 1.0, 1.x May 12, 2018

MichaelChirico mentioned this issue May 19, 2020

[BUGZILLA #17798] Update/Determine What to Do With Rlocale Character Determination Tables MichaelChirico/r-bugs#6972

Open

DilumAluthge removed this from the 1.x milestone Mar 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

digital-carver commented Apr 29, 2018

stevengj commented Apr 30, 2018 •

edited

Loading

stevengj commented Apr 30, 2018

ararslan commented Apr 30, 2018

JeffBezanson commented May 10, 2018

StefanKarpinski commented May 11, 2018

isalpha should use Unicode property Alphabetic; rename to isletter #26932

isalpha should use Unicode property Alphabetic; rename to isletter #26932

Comments

digital-carver commented Apr 29, 2018

stevengj commented Apr 30, 2018 • edited Loading

stevengj commented Apr 30, 2018

ararslan commented Apr 30, 2018

JeffBezanson commented May 10, 2018

StefanKarpinski commented May 11, 2018

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

`isalpha` should use Unicode property `Alphabetic`; rename to `isletter` #26932

stevengj commented Apr 30, 2018 •

edited

Loading