wip: string overhaul #24439

StefanKarpinski · 2017-11-02T01:43:09Z

Making a PR to run CI.

Keno · 2017-11-02T02:07:19Z

base/iostream.jl

@@ -336,12 +336,12 @@ function read(s::IOStream, nb::Integer; all::Bool=true)
 end

 ## Character streams ##
-const _chtmp = Ref{Char}()
+const _chtmp = Ref{UInt32}()


Unrelated, but this should now be inside the function :)

bkamins · 2017-11-02T09:52:53Z

Nice! For testing purposes I have written a verbose version of the decoder here https://gist.github.com/bkamins/5a2f89cab14d434e3f3e4a23c121a8a2 that checks for validity of UTF-8 (it is slower of course).
I have checked and both implementations return the same values for 1112064 valid UTF-8.

bkamins · 2017-11-02T10:11:34Z

base/char.jl

+ ((u & 0x00ff0000) >> 4) ⊻ ((u & 0xff000000) >> 6)
+end
+
+function convert(::Type{Char}, u::UInt32)


Tested and this works OK for all valid Unicode (last code point U+10FFFF). The conversion breaks for invalid Unicode as it overflows at 0x00400000.
I understand that we want to avoid branching here so adding the test if u is not too large is not an option? If so - this probably be documented somewhere.

I'm still contemplating how best to handle invalid code points. There was previously an error check here but it caused a massive build time regression (it was also calling a deprecated error type), so I just took it out for now. Will have to figure out what to do here.

bkamins · 2017-11-02T15:13:04Z

With respect to validity checking:
Neither convert(::Type{UInt32}, c::Char) nor convert(::Type{Char}, u::UInt32) are injective - different arguments (fortunately invalid in cases of both functions) are mapped to the same value.
Thus convert(UInt32, convert(Char, u)) does not have to be equal to u if u is invalid, similarly for convert(Char, convert(UInt32, c)) if c is invalid.
In this light maybe something similar to @inbounds macro might be a good idea (turning checking if argument is valid on and off)?

StefanKarpinski · 2017-11-02T16:47:47Z

Either that or change the transformations to be bijective. I’m considering how to do that.

bkamins · 2017-11-02T22:23:51Z

My thoughts on the bijection (I hope there is no error in this 😄, but I have tested it not only in the head but also by implementation).

The best I could think of it is doable but a bit messy (fortunately only for invalid values). Here is the outline how this can be done (giving description of mapping UTF-8 to UInt32):

assume that all 3-byte ED sequences are valid (only for bijection in conversion - not for validity testing; this simplifies things)
decode all valid UTF-8 byte sequences in Char to their UInt32 representation;
all invalid UTF-8 byte sequences in Char that have representation larger than 0x10FFFF should be left unaltered (so we apply identity function to them - they were invalid in UTF-8 and remain invalid in UInt32 representation);
all invalid UTF-8 byte sequences in Char that have representation less than 0x10FFFF should be mapped to representation of valid UTF-8 byte sequences in Char that have representation larger than 0x10FFFF.

There are exactly 1112064 valid Char greater than 0x10FFFF (call them large valid) and the same number of invalid Char less than 0x10FFFF (call them small invalid). Now the trick is that most of small invalid Char are equal to decoded large valid Char so the conversion is simple.
Only 1920 small invalid values in the range from 128 to 2047 are not directly mapped to large valid Char (as they are encoded by two byte small valid Char). And those have to be mapped to large valid Char that when decoded produce two-byte small valid Char (again there are 1920 such values and they are pretty simple to indentify).

bkamins · 2017-11-03T06:30:02Z

base/filesystem.jl

+ while 1 < n
+ p = position(f)
+ b = read(f, UInt8)
+ if b & 0xc0 != 0x80


Here we should also handle sequences starting with 0xC0, 0xC1, 0xE0, 0xF0 that would be overlong, starting with 0xED that would be reserved and starting with 0xF4 that would encode too large code point.

I'm not sure about that. I think that parsing overlong encodings as a single char is a much friendlier behavior. They occur fairly often in real data, many systems (e.g. Java) use overlong encodings to avoid embedded nul bytes.

bkamins · 2017-11-03T06:32:24Z

base/io.jl

+ c = UInt32(b0)
+ if n <= 4
+ while 1 < n
+ peek(s) & 0xc0 == 0x80 || break


The same comment as above regarding handling invalid sequences.

bkamins · 2017-11-03T06:37:44Z

base/iostream.jl

 function peekchar(s::IOStream)
- if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{Char}), s, _chtmp) < 0
+ chref = Ref{UInt32}()
+ if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{UInt32}), s, chref) < 0


current implementation of ios_peekutf8 also does not check for invalid sequences.

Seems a bit wasteful for ios_peekutf8 to convert the UTF-8 encoding to UInt32, and then to convert it back when returning Char(chref[]).

~~Since this seems to be the only function in all of Julia that calls ios_peekutf8, we should just re-write ios_peekutf8 to return Char.~~ ios_peekutf8 is also used in src/flisp, so we will need to keep that as-is and maybe write a ios_peekrawutf8 function to get Char.

Agree, this stuff should be optimized to avoid converting back and forth. This PR is WIP and the first step is just to get everything working and all tests passing.

This is still in my new PR, which is annoying but I think that all the ios_ support code needs an overhaul in the 1.x timeframe and ideally we would move most of the buffering code etc. into Julia and it can understand the new Char representation and avoid the inefficiency here. Until then I think it's ok to just leave this as it is.

bkamins · 2017-11-03T07:35:25Z

In the above comments I have followed strict Unicode 10 compliant approach to reading of UTF-8.
However, it could be OK to create invalid Char starting with 0xC0, 0xC1, 0xE0, 0xF0, 0xED, 0xF4 trying to read the expected number of bytes provided that later we check that they are invalid.

The reason is that if they were created they would have 10 leading bits in all continuation bytes, so there is no risk of doing this as such following byte would be invalid anyway.

This will probably speed up Char creation. Of course then it is even more important to precisely check Char for validity.

If this approach is taken the only issue to fix is ios_peekutf8 which sometimes tries to read 4 or 5 continuation bytes (as specified in trailingBytesForUTF8), and this should not be allowed (we should read up to 3 continuation bytes as otherwise they will not fit into UInt32).

bkamins · 2017-11-03T08:29:45Z

As an additional comment to keep in mind - if we use strict parsing rule we are then sure that e.g. string length is consistent with Unicode 10 compliant parsers (as @StefanKarpinski mentioned earlier). If we use simplified Char creation we lose this property.

StefanKarpinski · 2017-11-03T13:34:56Z

We should definitely not ever read 4-5 continuation bytes; that's from a very old UTF-8 specification and hasn't been allowed for a long time. It's also not a thing that happens much in real life (aside from CESU-8 maybe?), and would require a larger Char type, which is undesirable.

StefanKarpinski · 2017-11-03T13:41:00Z

See also WTF-8: https://en.wikipedia.org/wiki/UTF-8#WTF-8, which ignores invalidity of surrogate pairs, since UTF-8 has no intrinsic issue with them – they are purely invalid because of how UTF-16 works (and it only works that way because of its legacy history with UCS-2).

StefanKarpinski · 2017-11-03T14:23:46Z

Some cases I can see for invalid UTF-8 data, together with the most useful "character" parsing:

Binary data: Char = byte
Latin-1 data: Char = byte
Modified UTF-8: Char = loosely decoded UTF-8 characters
WTF-8: Char = loosely decoded UTF-8 characters
Mangled UTF-8 data: Char = partial characters?

It's hard to imagine where the Unicode 10 partial character decoding business is actually a useful interpretation. Of course, the byte interpretation for binary and Latin-1 data isn't really very useful either since you don't get the right characters out, but at least the number and size of the units are correct. And yes, there's a chance that binary data or Latin-1 will look like valid UTF-8, but that's very improbable, especially if you require a complete UTF-8 character to group bytes. So I guess I think the most useful Char splitting approach would be:

Split anything that has the basic form of a UTF-8 character as a Char: i.e. an ASCII byte or a lead byte with 2-4 leading ones followed by that many continuation bytes. This covers ASCII, valid UTF-8 (which includes ASCII), Modified UTF-8, and WTF-8. I suspect we should also just allow converting these to integer code point values in the straightforward way – some of which will produce code points of surrogate pairs (which one can later check for validity, if one even cares) and some of which will map to the same code points, non-injectively (i.e. overlong encodings). On the other hand, two different encodings of the same code point should not be equal as characters. So, for example, c = '\xC0\x80' would be the overlong encoding of the code point U+0 and you would have c != '\0' but Int(c) == 0. So if you're looking for a literal NUL byte, then comparing characters against '\0' will not accidentally find overlong encodings of U+0, but if you're explicitly searching for characters with code point 0, then you will find it. That seems like a sane and useful distinction which allows people to work with common UTF-8 variants without getting into trouble.
Split anything else into Char values according to the Unicode 10 standard. This is mainly so that we only ever have to do a single byte of look-ahead, instead of having to look ahead three bytes. I actually think that splitting everything else as single-byte invalid Chars is probably a more useful splitting, but this is easier to implement and it's closer to the standard, so ¯\_(ツ)_/¯.

This doesn't give 100% Unicode 10 conformance, but as I said above, I'm not entirely sure the standard really applies. The only observable discrepancy we might encounter is counting the number of characters in invalid strings differently, and in those cases, I think our answers will be what people actually expect/want. That seems like a pretty acceptable tradeoff between handling common UTF-8 variants smoothly and being approximately Unicode 10 conformant.

bkamins · 2017-11-03T15:03:59Z

First about reading 4-5 byte sequences. As I have shown in #24420 currently Julia does this, e.g. run collect(String([0xff, 0xc0, 0xc0, 0xc0, 0xc0, 0xc0, 0x41, 0x42, 0x43])) (this is just for a reference).

I am OK with parsing valid overlong encoding and valid reserved surrogates as if they were allowed. And all other invalid UTF-8 following Unicode 10. I understand this is what you propose and this will lead to fastest implementation of splitting. We should simply document that we work like this

However, the other issue is conversion from Char to UInt32. In my bijection proposal above valid reserved surrogates would be decoded as if they were valid - exactly as you propose. However, if we want a bijection between Char and UInt32 we cannot convert overlong encodings to their natural representation.

So the question is do we want to have this bijection or not.

And even if we do not want a full bijection I propose to decode all other than overlong and surrogate invalid Char to code points larger than U+10FFFF so that such invalid Char does not decode to a valid codepoint (and if we drop bijection we can apply some simple conversion rule not caring that they all all diffferent).

Apart from that then isvalid(c::Char) function should allow for Unicode 10 compliant checking and Julia compliant checking (if we end up with deciding that they do not have to be identical). Then probably some additional argument strict::Bool would be good.

StefanKarpinski · 2017-11-03T15:12:32Z

I think I've changed my mind on the bijection business, after contemplating it. When converting from Char to integer seems to be the natural point to get an error if the Char does not naturally give you a code point. So I just need to make that conversion throw the error in a way that doesn't slow everything down so much. I think a more general character classification scheme may be better than the current isvalid function, which can only give two possible answers. classify(c::Char) and classify(s::String) or something like that. Could maybe be rolled in with the other isalpha, etc. stuff which we've kind of wanted to clean up for some time now.

bkamins · 2017-11-03T17:35:58Z

Maybe I am stating the obvious but I have just realized that actually internally in Julia we will not need to convert from Char to UInt32 at all.
One can work on Char that is UTF-8 encoded everywhere (i.e. even if some internal functions wanted Unicode parameter they can be rewritten to to take UTF-8 encoded parameter without making a conversion to a code point) - it might even be faster in some cases (e.g. next should be faster).
And in what cases is the exact code point value required by a programmer - I suppose very rarely - mostly in cases when some external library wants its input in Unicode.

StefanKarpinski · 2017-11-03T18:06:19Z

Yes, that's exactly the idea – you can do most things with Char values without ever converting to code points. Actually wanting the code point for a character is quite rare! Why would you care? You can do almost everything by equality and inequality comparisons with other characters. And most of the time you don't even need to do that since you can just pass string data all the way through.

StefanKarpinski · 2017-11-03T20:43:26Z

I'm puzzling through ordering here and thought maybe you have some thoughts, @bkamins. What ordering should the entire space of characters – valid and invalid – have? It's fairly common to write code like 'a' ≤ c ≤ 'z' to test if some character is in a range, so it seems somewhat dangerous for invalid characters to ever have the property that valid₁ ≤ invalid ≤ valid₂.

The other consideration is that we might want sorting lexicographically by character to match sorting lexicographically by bytes – which is a property that valid UTF-8 is designed to have. However, I don't think that's actually possible for any ordering of partial characters, so perhaps that shouldn't be a consideration after all. Any thoughts?

bkamins · 2017-11-03T22:49:01Z

I think that we can order Char by bytes and inform users that:

Char is UTF-8 encoded thus it is safe to write ranges only of ASCII characters (I would say that this is more or less the only case when character comparisons are made in practice);
if they have only valid UTF-8 it will be ordered by code point (by valid I mean strictly valid - without overlong sequences);
if they want ordering by code point and have no guarantee that UTF-8 is valid then they should convert Char to code point and compare it (and here I would say again that invalid UTF-8 should be always converted to some value larger than U+10FFFF) -> currently your code does not do that
add a general remark that comparing non ASCII Char is not a good idea (so we are back to point 1 above) as in general right order of Char and String is locale dependent, e.g. in Polish a<ą<b<c<ć<d<e<ę<f;

In short - outside of ASCII range I would not worry too much about comparing Char as there is no single valid rule anyway and it should be discouraged.

As a side comment - supporting https://en.wikipedia.org/wiki/Unicode_collation_algorithm somehow in Julia would be great.
In general I believe that if people will start using Julia for more text processing some package like stringi in R will be crucial to have.
Are you aware of any such efforts?

StefanKarpinski · 2017-11-03T23:13:09Z

I think we're in agreement here. The conclusion I came to is that the invariant we want is that c₁ < c₂ if and only if "$c₁" < "$c₂". This is naturally the case for valid UTF-8 lexically sorted by bytes, and extends to invalid characters as well by byte ordering – which is what I think you're saying too. That's a nice property that doesn't have to do with the representation of Char. However, it does imply that I haven't quite represented characters correctly here: the zero padding should be least significant, not most significant. I'll adjust the PR accordingly.

StefanKarpinski · 2017-11-03T23:14:10Z

Also: I think that all fancy Unicode stuff like the collation algorithm can be implemented in packages.

bkamins · 2017-11-04T08:34:39Z

The property you have indicated would be nice to have.

Two things to consider though:

I have been thinking of right padded representation earlier and the problem I saw was that it would be more expensive to work with such Char than in current model. Re-implementing convert, read and write methods will show if there is a penalty.
The property extends to "$c₁[anything 1 here]" < "$c₂[anything 2 here]" only for valid UTF-8. Below I am using right padding notation. Consider a valid character 0xF0908D88 which is greater than invalid character 0xF0908D00 that uses three bytes. Now compare the following 8-element sequences of bytes
[0xF0, 0x90, 0x8D, 0x88, 0xF0, 0x90, 0x8D, 0x88]
and
[0xF0, 0x90, 0x8D, 0xF0, 0x90, 0x8D, 0x88, 0x00]
the first one is less than than the second.
Though the first decoded to characters is
[0xF0908D88, 0xF0908D88]
and the second is
[0xF0908D00, 0xF0908D88, 0x00000000]

nalimilan · 2017-11-04T10:27:16Z

Very nice! I agree that consistency between c₁ < c₂ and "$c₁" < "$c₂" sounds essential.

Regarding over-long character encodings, I'd say it's fine to treat them as different from their correct counterparts for < as long as we do that consistently, i.e. they should never be considered as equal, even after conversion to UInt32. Is that the plan?

bkamins · 2017-11-04T13:59:31Z

Just to highlight the issue with comparison I see using analogous notation:
it is impossible to guarantee under any conversion rule that s₁[1]<s₂[1] and s₁<s₂ are consistent if the latter uses memcmp.
Thus consistency between c₁ < c₂ and "$c₁" < "$c₂" is a nice feature but will have a limited usefulness in practice.

The consistency s₁[1]<s₂[1] and s₁<s₂ is guaranteed for valid UTF-8 strings no matter if we use character encoding with left or right padding - both will work.

So I would say that a major decision between left and right padding would be how they compare in performance.

As for conversion from Char to UInt32:

the current implementation in this PR maps overlong sequences to their code point representation (this not following what you have stated); also it maps invalid UTF-8 into valid code points;
in the discussion above the idea was to make the conversion bijective (thus ensuring what you have said).
the other approach would be to strictly encode all invalid UTF-8 sequences by U+FFFD and all valid UTF-8 to their code point value (this also would ensure that overlong sequences are not considered equal to their valid counterparts).

If I understand it correctly @StefanKarpinski wants to implement yet something else. Correct?

As another thought maybe it would be useful to define CodePoint type that would provide a different conversion behavior (i.e. strict conversion following Unicode 10 and always using U+FFFD to represent invalid UTF-8) than conversion to UInt32 (which could be less strict)?

Keno reviewed Nov 2, 2017

View reviewed changes

JeffBezanson added the domain:strings "Strings!" label Nov 2, 2017

StefanKarpinski mentioned this pull request Nov 2, 2017

Handling of invalid UTF-8 strings #24420

Closed

bkamins reviewed Nov 2, 2017

View reviewed changes

bkamins reviewed Nov 3, 2017

View reviewed changes

StefanKarpinski force-pushed the sk/xchar branch 3 times, most recently from 15d5497 to 915538c Compare November 3, 2017 20:13

StefanKarpinski added 24 commits December 8, 2017 14:35

wip

912779e

wip

cbbee08

wip

8a22a96

wip

829aba2

wip

b2d231b

wip

f82c793

wip

c55cca0

my dirty laundry, you filthy voyeur [ci skip]

68467ad

wip [ci skip]

29dc1c3

wip: added a doc string for AbstractString

d68eb07

wip

2fadfb0

wip

5aad731

wip

cad41c5

wip

f849593

wip

61dbb90

wip

931b289

wip

0794487

wip

b674bc1

wip

b802606

fix [ci skip]

8d414fd

docstring typo fix [ci skip]

120f9ca

test for more method errors [ci skip]

1861238

cosmetic tweaks

1c722f1

wip

e54e4c0

StefanKarpinski force-pushed the sk/xchar branch from 765d6eb to e54e4c0 Compare December 8, 2017 19:38

StefanKarpinski mentioned this pull request Dec 9, 2017

string overhaul #24999

Merged

StefanKarpinski closed this Dec 9, 2017

bkamins mentioned this pull request Dec 20, 2017

Make next behave like nextind for String #24423

Closed

stevengj mentioned this pull request Dec 28, 2017

AbstractChar type #25302

Closed

DilumAluthge deleted the sk/xchar branch March 25, 2021 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: string overhaul #24439

wip: string overhaul #24439

StefanKarpinski commented Nov 2, 2017

Keno Nov 2, 2017

bkamins commented Nov 2, 2017

bkamins Nov 2, 2017

StefanKarpinski Nov 2, 2017

bkamins commented Nov 2, 2017

StefanKarpinski commented Nov 2, 2017

bkamins commented Nov 2, 2017

bkamins Nov 3, 2017

StefanKarpinski Nov 3, 2017

bkamins Nov 3, 2017

bkamins Nov 3, 2017

stevengj Nov 6, 2017 •

edited

Loading

StefanKarpinski Nov 6, 2017

StefanKarpinski Dec 9, 2017 •

edited

Loading

bkamins commented Nov 3, 2017

bkamins commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017 •

edited

Loading

bkamins commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

bkamins commented Nov 3, 2017 •

edited

Loading

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017 •

edited

Loading

bkamins commented Nov 3, 2017 •

edited

Loading

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

bkamins commented Nov 4, 2017

nalimilan commented Nov 4, 2017 •

edited

Loading

bkamins commented Nov 4, 2017

wip: string overhaul #24439

wip: string overhaul #24439

Conversation

StefanKarpinski commented Nov 2, 2017

Choose a reason for hiding this comment

bkamins commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 2, 2017

StefanKarpinski commented Nov 2, 2017

bkamins commented Nov 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj Nov 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski Dec 9, 2017 • edited Loading

Choose a reason for hiding this comment

bkamins commented Nov 3, 2017

bkamins commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017 • edited Loading

bkamins commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

bkamins commented Nov 3, 2017 • edited Loading

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017 • edited Loading

bkamins commented Nov 3, 2017 • edited Loading

StefanKarpinski commented Nov 3, 2017

StefanKarpinski commented Nov 3, 2017

bkamins commented Nov 4, 2017

nalimilan commented Nov 4, 2017 • edited Loading

bkamins commented Nov 4, 2017

stevengj Nov 6, 2017 •

edited

Loading

StefanKarpinski Dec 9, 2017 •

edited

Loading

StefanKarpinski commented Nov 3, 2017 •

edited

Loading

bkamins commented Nov 3, 2017 •

edited

Loading

StefanKarpinski commented Nov 3, 2017 •

edited

Loading

bkamins commented Nov 3, 2017 •

edited

Loading

nalimilan commented Nov 4, 2017 •

edited

Loading