Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isalpha should use Unicode property Alphabetic; rename to isletter #26932

Open
digital-carver opened this issue Apr 29, 2018 · 5 comments
Open
Labels
domain:unicode Related to unicode characters and encodings

Comments

@digital-carver
Copy link
Contributor

Right now, it simply checks whether the given character is in one of the L categories (isalpha(c::AbstractChar) = UTF8PROC_CATEGORY_LU <= category_code(c) <= UTF8PROC_CATEGORY_LO). This is almost correct, except that the Unicode Alphabetic property belongs to these categories, to a Nl category (number-like letters, eg. Roman numerals), and crucially to a set of characters defined to be Other_Alphabetic that live in Mc and Mn (spacing and non-spacing marks). A lot of codepoints in Indic texts, for eg. most occurrences of vowels in Tamil texts, are characters found in this Other_Alphabetic list.

Among the few other (programming) languages I tried this check on, Ruby (\p{Alpha}) and Java (Character.isAlphabetic) get this right (Java documentation explicitly explains the Alphabetic property, Python 2 and 3 both ("அதிகாலை".isalpha()) seem to be getting it wrong. Perl also gets the Other_Alphabetic characters correctly identified under \p{Alpha} (though it also seems to have additional magic on top).

Other_Alphabetic apparently belongs to 1300 code points according to the Unicode PropList, so there are letters from quite a few language scripts that currently fail isalpha.

I'm not sure if utf8proc supports querying for either the Alphabetic or the Other_Alphabetic property (the utf8proc_property_struct doesn't seem to have either property), so this might have to be implemented there first. Also, possibly related to #25653 with regards to implementation.

@ararslan ararslan added the domain:unicode Related to unicode characters and encodings label Apr 29, 2018
@stevengj
Copy link
Member

stevengj commented Apr 30, 2018

There are a whole bunch of Unicode character properties that aren't currently in utf8proc, e.g. Other_Alphabetic and Sentence_Terminal and Quotation_Mark and ...

I suspect that, rather than cramming all of these into utf8proc, it would be better to keep utf8proc focused mainly on normalization and have a separate package of UnicodeProperties with a bunch of optimized 2-stage tables (exposed as e.g. a new AbstractSet type) for different character properties.

@stevengj
Copy link
Member

In the meantime, maybe isalpha should be renamed to isletter, analogous to GoLang.

@ararslan
Copy link
Member

+1 for isletter; I can never remember whether isalpha is "is alphabetic" or "is alphanumeric."

@StefanKarpinski StefanKarpinski added the status:triage This should be discussed on a triage call label May 8, 2018
@JeffBezanson
Copy link
Sponsor Member

Triage is ok with renaming to isletter.

@JeffBezanson JeffBezanson added this to the 1.0 milestone May 10, 2018
@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label May 10, 2018
@JeffBezanson JeffBezanson changed the title isalpha should use Unicode property Alphabetic isalpha should use Unicode property Alphabetic; rename to isletter May 10, 2018
@StefanKarpinski
Copy link
Sponsor Member

If someone wants to make a PR doing this rename, that would be good, I don't think it's going to happen otherwise though. @digital-carver? (or @ararslan if you feel like it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

6 participants