UTF-8 combining characters and normalization in reverse() #6165

andrioni · 2014-03-14T04:47:25Z

julia> "noe\u0308l"
"noël"

julia> reverse("noe\u0308l")
"l̈eon"

julia> length("noe\u0308l")
5

julia> uppercase("baﬄe")
"BAﬄE"

(it appears that Github also doesn't like them, examples taken from here)

Backspace support is also broken on the REPL: if you press backspace twice from "noël", it shows you "no", but actually represents "noe".

What should be our approach in this case? NFKC does solve both issues above, but I'm not sure how good it is to normalize every string before operating on them:

julia> length(normalize_string("noe\u0308l", :NFKC))
4

julia> reverse(normalize_string("noe\u0308l", :NFKC))
"lëon"

julia> uppercase(normalize_string("baﬄe", :NFKC))
"BAFFLE"

(related to #5434, #5903)

jiahao · 2014-03-14T04:49:37Z

I think we decided not to normalize strings by default. Strings are user data.

andrioni · 2014-03-14T04:55:43Z

Yeah, but the problem is how should be string operations like length and reverse be defined: on codepoints (as AFAIK it currently is) or on characters. If the latter, they probably should normalize their input.

JeffBezanson · 2014-03-14T06:32:17Z

I know reversing codepoints is not what you always want, but at least it's a primitive. Having only a NFKC-normalize-and-reverse function can't be right.

StefanKarpinski · 2014-03-14T13:51:14Z

I suspect that doing NFC normalization during reversal is the right default. There can be a normalize::Bool=true option so if you really want to just reverse the codepoints, you can do reverse(str, normalize=false). You want this as an option to the reverse function since otherwise you end up needing normalize and then reverse, creating an unnecessary intermediate copy of the string. Normalizing to NFKC rather than NFC, strikes me as bad idea even though it happens to work for the ligature since it will also change things like 𝕞 to m, which no one would expect the reverse function to do. The ligature example does show that there are cases where you might want to do this, of course. Perhaps the normalize keyword should take a symbol that can be one of :none, :nfc (default), or :nfkc.

nalimilan · 2014-03-14T22:44:25Z

+1

StefanKarpinski · 2014-03-14T23:08:43Z

We might want UnicodeNormalization objects with NFC <: UnicodeNormalization, etc. That way we could pass the normalization as an argument and dispatch on it, which could be quite handy.

stevengj · 2014-03-15T14:29:47Z

What is the use case for reversing strings? The intended application should determine functionality here.

If you want to reverse graphemes, utf8proc can identify the graphemes for you. (Normalization won't necessarily eliminate all combining characters.)

Note that we already have Unicode case folding.

StefanKarpinski · 2014-03-15T16:23:01Z

That's the problem – we don't know, since we're writing a standard library, not an application. You're right that grapheme reversal is probably the better way to do this. It would be good to come up with a reverse function that generally does what you want. It should generally operate at the level of graphemes, but cases like ligatures would, ideally, be split before reversing. Ligature splitting would imply that reverse is not its own inverse, which sucks, but I feel like that's unavoidable with Unicode.

pao · 2014-03-15T16:43:43Z

Silly question: if the xs == reverse(reverse(xs)) invariant is broken, are we dealing with a function that is semantically different than reverse()? Should we be giving this operation a different name?

StefanKarpinski · 2014-03-15T16:52:19Z

Very possible. Maybe reverse should just throw an error if it can't reverse a string in such a way that that identity holds, suggesting that you use a different function.

stevengj · 2014-03-15T17:22:46Z

@StefanKarpinski, even library writers should have some kind of practical application in mind. Otherwise, why include a reverse(::String) function at all?

StefanKarpinski · 2014-03-15T17:25:24Z

Largely because every language standard library seems to have one. I'm not entirely sure what it's useful for except for cute examples with sorting dictionary words. If someone has real world examples of using the reverse function on strings, I'm all ears.

JeffBezanson · 2014-03-15T18:34:31Z

I agree it's not so important. One reason to have it is that reversing UTF-8 code points efficiently is difficult, so on the off chance you need this it's good to have built in. I prefer not to guess what people want, and instead do something primitive so that you can understand what it does and build what you want on top.

stevengj · 2014-03-15T20:28:20Z

See e.g. this discussion on supposed uses for string reversal. It is telling that all of the answers seem to be bogus (there always seems to be a much better way to solve the problem that string reversal supposedly solves). My favorite answer is:

Reversing a string (in place or not) is a very common interview question for basic programming knowledge. A language lacking these built in functions would be difficult to interview for. The candidate would actually have to know something.

I vote that we just remove reverse(::String) function from Base unless a real use can be found. That's the normal criterion for including a function, after all, and it is impossible to sensibly design this function unless we know what it is for.

nalimilan · 2014-03-15T23:03:34Z

This function is probably most useful as a showcase of how smart Julia is when it comes to handling Unicode strings. Not sure that justifies its existence... ;-)

stevengj · 2014-03-15T23:38:34Z

I'm not sure whether including a useless function shows off how smart we are...

JeffBezanson · 2014-03-19T21:47:51Z

It's possible that RevString is more useful than reverse; it's used in a few places to operate on the ends of strings, since otherwise iteration only moves forwards. Perhaps we should keep that and remove reverse(::String).

StefanKarpinski · 2014-03-19T22:06:37Z

We could also have a reverse iteration protocol, which strings already kind of have with prevind.

JeffBezanson · 2014-03-19T22:11:24Z

Adding that (maybe endof which we already have, prev, and isstart) and getting rid of both RevString and reverse(String) would be fine by me.

JeffBezanson · 2014-04-04T03:16:50Z

So shall we deprecate reverse(::String)?

carlobaldassi · 2014-04-04T03:43:23Z

I actually just used it (related: #6276).

StefanKarpinski · 2014-04-04T19:30:21Z

Yeah, I really don't think that removing the reverse function is reasonable.

nalimilan · 2014-04-04T21:00:58Z

@carlobaldassi Wouldn't it be better to use reverse search rather than reverting the whole string just to match a few characters?

StefanKarpinski · 2014-04-04T21:47:59Z

It would but reverse string search has just as many difficulties as reversing a string does if you think about it. And PCRE doesn't support it.

On Apr 4, 2014, at 5:01 PM, Milan Bouchet-Valat [email protected] wrote:

@carlobaldassi Wouldn't it be better to use reverse search rather than reverting the whole string just to match a few characters?

—
Reply to this email directly or view it on GitHub.

nalimilan · 2014-04-04T21:54:03Z

Well, if I want to find the last occurrence of noël in a string, I don't need to reverse both the string and noël; this is just like standard regexp matching, but starting from the end. It looks like the code @carlobaldassi linked to just does that.

carlobaldassi · 2014-04-04T22:57:56Z

It could almost be reworked fairly easily to use rsearch with strings, but more verbosely. I needed to match either of two strings, so I would need 2 searches, and then compare the ranges (with if/else code to account for all cases). Plus, replicating the word-boundary token \b without regexes would be non-trivial.
(That said, that code is not super-critical.)

In any case, I was so annoyed by the experience that I started writing a ReverseRegex type. I have a reasonably complete working prototype now (warning: ugly code, and likely to have horrible performance).

carlobaldassi · 2014-04-04T22:59:32Z

this is just like standard regexp matching, but starting from the end

@nalimilan see also #6276

StefanKarpinski · 2014-04-05T03:07:56Z

Well, if I want to find the last occurrence of noël in a string, I don't need to reverse both the string and noël; this is just like standard regexp matching, but starting from the end.

Are you looking for \u0308 then e or the reverse? If you reverse both the regex and the string, sure it's pretty straightforward, but if you reverse only one, then you still have to answer all the questions.

nalimilan · 2014-04-05T10:07:58Z

@carlobaldassi Yeah, but your solution isn't perfect either (is gnisu a word? :-), and it's reliable only because you know you deal with ASCII characters. If you were looking for a user-provided string, then the Unicode normalization issue would make it quite fragile. Another solution, since you're dealing with short strings anyway, is to get all the matches and take the last one (which was suggested in #6276). Anyway, my point is just that reverse is not super useful for pattern matching, or at least not really necessary (and potentially dangerous).

@StefanKarpinski The point is, you don't need to reverse any of them to find the last occurrence of noël.

carlobaldassi · 2014-04-09T23:48:54Z

If you were looking for a user-provided string, then the Unicode normalization issue would make it quite fragile

I think that @StefanKarpinski's proposal of an argument to reverse would mostly solve this issue.

Another solution, since you're dealing with short strings anyway, is to get all the matches and take the last one

Indeed in this case it would, it'd just be more verbose (not a particularly compelling argument for keeping reverse in Base, I'll concede). But in general, besides performance considerations, reverse-searching with a regex is not exactly equivalent to doing this if it involves quantifiers (e.g. *, *? etc.). In order to express those, you really need to start matching from the end of the string, and match regex atoms from the end of the regex as you go. In other words, reverse both.

stevengj · 2014-12-07T03:19:18Z

With #9261, join(reverse(collect(graphemes("noe\u0308l")))) == "lëon" will reverse the graphemes, doing the right thing with all combining characters regardless of normalization, if that is what you want.

(The collect step could be eliminated by implementing a reverse iterator, but I don't know if anyone actually cares about this functionality enough to make it worthwhile.)

StefanKarpinski · 2014-12-07T03:47:07Z

Perhaps grapheme reversal is the correct default behavior.

stevengj · 2014-12-07T04:05:38Z

If the application of reverse(s::String) is mostly for reversed regex searching (as seems to be the case), it would be better to leave as-is.

ivarne · 2014-12-07T04:35:19Z

Since we have two possible behaviors, we should document which we have picked.

nalimilan · 2014-12-07T11:56:05Z

@stevengj You scared me! join(reverse(collect(graphemes("noe\u0308l")))) == "lëon", not "leön". :-)

This indeed sounds like a very cool feature, and I agree with @StefanKarpinski that it should be the default. I think it would be more intuitive, as it is based on what the standard reference you quoted (http:https://www.unicode.org/reports/tr29/) calls "user-perceived characters". OTC the current behavior only makes sense when using regexps and it requires a deep understanding of Unicode.

I'd actually argue that iterating over strings should also go over graphemes, and that calling reverse on a string should continue to be equivalent to calling reverse on the array obtained via a string iterator. The current situation is quite confusing since it varies depending on how characters are formed:

# ë as a single codepoint, OK
julia> [c for c in "noël"]
4-element Array{Char,1}:
 'n'
 'o'
 'ë'
 'l'

julia> join(reverse([c for c in "noël"]))
"lëon"

julia> reverse("noël")
"lëon"

# ë as two codepoints, weird
julia> [c for c in "noe\u0308l"]
5-element Array{Char,1}:
 'n'
 'o'
 'e'
 '̈'
 'l'

# These really don't make any sense to 80% of users IMHO
julia> join(reverse([c for c in "noe\u0308l"]))
"l̈eon"

julia> reverse("noe\u0308l")
"l̈eon"

If the application of reverse(s::String) is mostly for reversed regex searching (as seems to be the case), it would be better to leave as-is.

I'd rather offer a function to to reverse regex matching (as I suggested at #9249 (comment)), and make reverse act on graphemes.

stevengj · 2014-12-07T15:03:35Z

@nalimilan, making iteration go over graphemes by default would be a huge change because graphemes are substrings, not codepoints (Chars). I wouldn't be in favor of this.

I continue to think that the behavior of any function, reverse included, should be dictated by how it is used. So far, the only practical use for reverse has been for reverse regex searches (and possibly reversed string processing with other external C libraries). And if you are going to regex search for a unicode substring with combining characters, even if you are searching forward, you have to understand combining characters and normalizations.

A reverse-regex function should be discussed in #6276.

My preference would be to simply document this in reverse: say that it reverses codepoints, and is mainly useful for reverse string processing (e.g. regex), but mention the grapheme reversal as an alternative.

nalimilan · 2014-12-07T15:15:27Z

@nalimilan, making iteration go over graphemes by default would be a huge change because graphemes are substrings, not characters. I wouldn't be in favor of this.

Ah, right. Unicode is so complex. Maybe we just need some amount of normalization when iterating and reversing, so that they give the same result for both "noe\u0308l" and "noël" . Keeping iteration and reversal in sync looks like a good idea to me, an currently they fail the "noël" test.

I don't really like the idea that reverse is only useful for regex. If that's the case, then let's just provide a function for that, but let's not have a reverse function which looks like it's generally useful while it isn't.

stevengj · 2014-12-07T15:24:22Z

Problems with just exposing a function for rsearch with regex:

The UI would be confusing because the caller must manually reverse the regex — we can't automate this without implementing our own regex processing.
If we ever do implement pure-Julia regex processing, it will be able to implement rsearch without a reversed regex and we will be forced to make a breaking change.
There are many regex functions in Julia, not just search: you would need reversed versions of ismatch, eachmatch, etcetera.
There might be other external C string-processing libraries where you want to process in reverse, and hence you would still want reverse.

Normalizing on iteration (and reversal) would (a) greatly slow down those functions, (b) involve more memory allocation, and (c) wouldn't fix the problem anyway because NFC eliminates some but not all combining characters.

PallHaraldsson · 2015-07-23T13:54:05Z

@stevengj: "I vote that we just remove reverse(::String) function from Base". I may not have thought this through, but doesn't reverse imply indexing must be well defined (and it isn't - see #9297 (comment))

If/when indexing is gone and you can reintroduce it explicitly, say you want codepoint-indexing or whatever (byte, grapheme), does reverse get trivial then?

PallHaraldsson · 2015-07-23T14:24:51Z

Add breaking label? Appropriate if changing from codepoint-meaning (or just dropping function..). Seems I just do not have the right.

stevengj · 2015-07-23T14:44:43Z

I think this issue can be closed.

Per the above discussion, it seems to have become clear that we need a reverse function for only one reason: efficient reversed-order string processing with external libraries, especially PCRE. And for this purpose, you want to reverse codepoints, not graphemes.

No one has come up with any practical use for reversing graphemes. This is possible in Julia as noted above, via the graphemes iterator, but with no real application it doesn't make sense to make it the default or even to provide a special function in Base.

ScottPJones · 2015-07-23T15:09:28Z

👏 Good to see this closed!

In looking back at #6165 (due to #23612), I realized that we never clearly documented the behavior or rationale.

jiahao added the unicode label Mar 14, 2014

StefanKarpinski mentioned this issue Mar 28, 2014

rsearch doesn't take a Regex as 2nd argument #6276

Closed

JeffBezanson added the decision label Apr 4, 2014

StefanKarpinski mentioned this issue Apr 17, 2014

Introduce splice_buffer! abstraction for IOBuffer in the REPL #6429

Merged

stevengj mentioned this issue Dec 4, 2014

add reverseind(s, i): convert indices in reverse(s) to indices in s #9249

Merged

nalimilan mentioned this issue Dec 8, 2014

add graphemes(s) function to iterate over string graphemes #9261

Merged

ivarne added the kind:breaking This change will break code label Jul 23, 2015

stevengj closed this as completed Jul 23, 2015

stevengj mentioned this issue Sep 19, 2017

WIP: Remove RevString #23612

Closed

stevengj added a commit that referenced this issue Sep 20, 2017

better documentation for string reversal

b0103e9

In looking back at #6165 (due to #23612), I realized that we never clearly documented the behavior or rationale.

stevengj mentioned this issue Sep 20, 2017

better documentation for string reversal #23793

Merged

This was referenced Oct 2, 2017

Adding number of elements to return for first and last #23969

Closed

reverse iteration protocol #23972

Closed

stevengj mentioned this issue Nov 20, 2017

make reverseind generic #24613

Closed

UTF-8 combining characters and normalization in reverse() #6165

UTF-8 combining characters and normalization in reverse() #6165

Comments

andrioni commented Mar 14, 2014

jiahao commented Mar 14, 2014

andrioni commented Mar 14, 2014

JeffBezanson commented Mar 14, 2014

StefanKarpinski commented Mar 14, 2014

nalimilan commented Mar 14, 2014

StefanKarpinski commented Mar 14, 2014

stevengj commented Mar 15, 2014

StefanKarpinski commented Mar 15, 2014

pao commented Mar 15, 2014

StefanKarpinski commented Mar 15, 2014

stevengj commented Mar 15, 2014

StefanKarpinski commented Mar 15, 2014

JeffBezanson commented Mar 15, 2014

stevengj commented Mar 15, 2014

nalimilan commented Mar 15, 2014

stevengj commented Mar 15, 2014

JeffBezanson commented Mar 19, 2014

StefanKarpinski commented Mar 19, 2014

JeffBezanson commented Mar 19, 2014

JeffBezanson commented Apr 4, 2014

carlobaldassi commented Apr 4, 2014

StefanKarpinski commented Apr 4, 2014

nalimilan commented Apr 4, 2014

StefanKarpinski commented Apr 4, 2014

nalimilan commented Apr 4, 2014

carlobaldassi commented Apr 4, 2014

carlobaldassi commented Apr 4, 2014

StefanKarpinski commented Apr 5, 2014

nalimilan commented Apr 5, 2014

carlobaldassi commented Apr 9, 2014

stevengj commented Dec 7, 2014

StefanKarpinski commented Dec 7, 2014

stevengj commented Dec 7, 2014

ivarne commented Dec 7, 2014

nalimilan commented Dec 7, 2014

stevengj commented Dec 7, 2014

nalimilan commented Dec 7, 2014

stevengj commented Dec 7, 2014

PallHaraldsson commented Jul 23, 2015

PallHaraldsson commented Jul 23, 2015

stevengj commented Jul 23, 2015

ScottPJones commented Jul 23, 2015