Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Unicode.isequal_normalized function #42493

Merged
merged 7 commits into from
Oct 13, 2021
Merged

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Oct 4, 2021

This PR adds a function isequivalent isequal_normalized to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).

Previously, the only way to do this was to call Unicode.normalize on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new isequal_normalized function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays. It seems to be about 2x faster than calling normalize in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early. (If we could stack-allocate small arrays it might get faster.)

(In the future, we might also want to add Unicode.isless_normalized and Unicode.cmp_normalized functions for comparing Unicode strings, but isequal_normalized seemed like a good start.)

@stevengj stevengj added the domain:unicode Related to unicode characters and encodings label Oct 4, 2021
@simeonschaub
Copy link
Member

It's not really clear to me what isequivalent does just from the name. Perhaps something more verbose like is_normalized_equal would be slightly clearer?

@dkarrasch
Copy link
Member

Nitpick: indentation should be 4 spaces, I guess.

@stevengj
Copy link
Member Author

stevengj commented Oct 5, 2021

Indentation should be fixed now. (For some reason, vscode was detecting Unicode.jl as using 2-space indentation and adjusted my code accordingly.)

@stevengj
Copy link
Member Author

stevengj commented Oct 5, 2021

Maybe isequal_normalized?

Alternatively, we could just call it Unicode.isequal and not export it?

@simeonschaub
Copy link
Member

Maybe isequal_normalized?

That sounds good to me. I don't want to hold this up any further though if others don't mind the name.

@stevengj
Copy link
Member Author

renamed to isequal_normalized

@stevengj stevengj changed the title add Unicode.isequivalent function add Unicode.isequal_normalized function Oct 13, 2021
@vtjnash vtjnash merged commit 7e81414 into master Oct 13, 2021
@vtjnash vtjnash deleted the sgj/unicode-equivalence branch October 13, 2021 19:42
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).

Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays.  It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early.  (If we could stack-allocate small arrays it might get faster.)

(In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
This adds a function `isequal_normalized` to the Unicode stdlib to check whether two strings are canonically equivalent (optionally casefolding and/or stripping combining marks).

Previously, the only way to do this was to call `Unicode.normalize` on the two strings, to construct normalized versions, but this seemed a bit wasteful — the new `isequal_normalized` function calls lower-level functions in utf8proc to accomplish the same task while only allocating 4-codepoint (16-byte) temporary arrays.  It seems to be about 2x faster than calling `normalize` in the expensive case where the strings are equivalent, and is potentially much faster for inequivalent strings for which the loop can break early.  (If we could stack-allocate small arrays it might get faster.)

(In the future, we might also want to add `Unicode.isless_normalized` and `Unicode.cmp_normalized` functions for comparing Unicode strings, but `isequal_normalized` seemed like a good start.)
StefanKarpinski pushed a commit that referenced this pull request Dec 19, 2023
Fixes #52408.

(Note that this function was added in Julia 1.8, in #42493.)

In the future it would be good to further optimize this function by
adding a fast path for the common case of strings that are mostly ASCII
characters. Perhaps simply skip ahead to the first byte that doesn't
match before we begin doing decomposition etcetera.
KristofferC pushed a commit that referenced this pull request Dec 23, 2023
Fixes #52408.

(Note that this function was added in Julia 1.8, in #42493.)

In the future it would be good to further optimize this function by
adding a fast path for the common case of strings that are mostly ASCII
characters. Perhaps simply skip ahead to the first byte that doesn't
match before we begin doing decomposition etcetera.

(cherry picked from commit 3b250c7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants