Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: string overhaul #24439

Closed
wants to merge 39 commits into from
Closed

wip: string overhaul #24439

wants to merge 39 commits into from

Conversation

StefanKarpinski
Copy link
Sponsor Member

Making a PR to run CI.

base/iostream.jl Outdated
@@ -336,12 +336,12 @@ function read(s::IOStream, nb::Integer; all::Bool=true)
end

## Character streams ##
const _chtmp = Ref{Char}()
const _chtmp = Ref{UInt32}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but this should now be inside the function :)

@bkamins
Copy link
Member

bkamins commented Nov 2, 2017

Nice! For testing purposes I have written a verbose version of the decoder here https://gist.github.com/bkamins/5a2f89cab14d434e3f3e4a23c121a8a2 that checks for validity of UTF-8 (it is slower of course).
I have checked and both implementations return the same values for 1112064 valid UTF-8.

base/char.jl Outdated
((u & 0x00ff0000) >> 4) ⊻ ((u & 0xff000000) >> 6)
end

function convert(::Type{Char}, u::UInt32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and this works OK for all valid Unicode (last code point U+10FFFF). The conversion breaks for invalid Unicode as it overflows at 0x00400000.
I understand that we want to avoid branching here so adding the test if u is not too large is not an option? If so - this probably be documented somewhere.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still contemplating how best to handle invalid code points. There was previously an error check here but it caused a massive build time regression (it was also calling a deprecated error type), so I just took it out for now. Will have to figure out what to do here.

@bkamins
Copy link
Member

bkamins commented Nov 2, 2017

With respect to validity checking:
Neither convert(::Type{UInt32}, c::Char) nor convert(::Type{Char}, u::UInt32) are injective - different arguments (fortunately invalid in cases of both functions) are mapped to the same value.
Thus convert(UInt32, convert(Char, u)) does not have to be equal to u if u is invalid, similarly for convert(Char, convert(UInt32, c)) if c is invalid.
In this light maybe something similar to @inbounds macro might be a good idea (turning checking if argument is valid on and off)?

@StefanKarpinski
Copy link
Sponsor Member Author

Either that or change the transformations to be bijective. I’m considering how to do that.

@bkamins
Copy link
Member

bkamins commented Nov 2, 2017

My thoughts on the bijection (I hope there is no error in this 😄, but I have tested it not only in the head but also by implementation).

The best I could think of it is doable but a bit messy (fortunately only for invalid values). Here is the outline how this can be done (giving description of mapping UTF-8 to UInt32):

  • assume that all 3-byte ED sequences are valid (only for bijection in conversion - not for validity testing; this simplifies things)
  • decode all valid UTF-8 byte sequences in Char to their UInt32 representation;
  • all invalid UTF-8 byte sequences in Char that have representation larger than 0x10FFFF should be left unaltered (so we apply identity function to them - they were invalid in UTF-8 and remain invalid in UInt32 representation);
  • all invalid UTF-8 byte sequences in Char that have representation less than 0x10FFFF should be mapped to representation of valid UTF-8 byte sequences in Char that have representation larger than 0x10FFFF.

There are exactly 1112064 valid Char greater than 0x10FFFF (call them large valid) and the same number of invalid Char less than 0x10FFFF (call them small invalid). Now the trick is that most of small invalid Char are equal to decoded large valid Char so the conversion is simple.
Only 1920 small invalid values in the range from 128 to 2047 are not directly mapped to large valid Char (as they are encoded by two byte small valid Char). And those have to be mapped to large valid Char that when decoded produce two-byte small valid Char (again there are 1920 such values and they are pretty simple to indentify).

while 1 < n
p = position(f)
b = read(f, UInt8)
if b & 0xc0 != 0x80
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should also handle sequences starting with 0xC0, 0xC1, 0xE0, 0xF0 that would be overlong, starting with 0xED that would be reserved and starting with 0xF4 that would encode too large code point.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that. I think that parsing overlong encodings as a single char is a much friendlier behavior. They occur fairly often in real data, many systems (e.g. Java) use overlong encodings to avoid embedded nul bytes.

base/io.jl Outdated
c = UInt32(b0)
if n <= 4
while 1 < n
peek(s) & 0xc0 == 0x80 || break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment as above regarding handling invalid sequences.

function peekchar(s::IOStream)
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{Char}), s, _chtmp) < 0
chref = Ref{UInt32}()
if ccall(:ios_peekutf8, Cint, (Ptr{Void}, Ptr{UInt32}), s, chref) < 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current implementation of ios_peekutf8 also does not check for invalid sequences.

Copy link
Member

@stevengj stevengj Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit wasteful for ios_peekutf8 to convert the UTF-8 encoding to UInt32, and then to convert it back when returning Char(chref[]).

Since this seems to be the only function in all of Julia that calls ios_peekutf8, we should just re-write ios_peekutf8 to return Char. ios_peekutf8 is also used in src/flisp, so we will need to keep that as-is and maybe write a ios_peekrawutf8 function to get Char.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this stuff should be optimized to avoid converting back and forth. This PR is WIP and the first step is just to get everything working and all tests passing.

Copy link
Sponsor Member Author

@StefanKarpinski StefanKarpinski Dec 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in my new PR, which is annoying but I think that all the ios_ support code needs an overhaul in the 1.x timeframe and ideally we would move most of the buffering code etc. into Julia and it can understand the new Char representation and avoid the inefficiency here. Until then I think it's ok to just leave this as it is.

@bkamins
Copy link
Member

bkamins commented Nov 3, 2017

In the above comments I have followed strict Unicode 10 compliant approach to reading of UTF-8.
However, it could be OK to create invalid Char starting with 0xC0, 0xC1, 0xE0, 0xF0, 0xED, 0xF4 trying to read the expected number of bytes provided that later we check that they are invalid.

The reason is that if they were created they would have 10 leading bits in all continuation bytes, so there is no risk of doing this as such following byte would be invalid anyway.

This will probably speed up Char creation. Of course then it is even more important to precisely check Char for validity.

If this approach is taken the only issue to fix is ios_peekutf8 which sometimes tries to read 4 or 5 continuation bytes (as specified in trailingBytesForUTF8), and this should not be allowed (we should read up to 3 continuation bytes as otherwise they will not fit into UInt32).

@bkamins
Copy link
Member

bkamins commented Nov 3, 2017

As an additional comment to keep in mind - if we use strict parsing rule we are then sure that e.g. string length is consistent with Unicode 10 compliant parsers (as @StefanKarpinski mentioned earlier). If we use simplified Char creation we lose this property.

@StefanKarpinski
Copy link
Sponsor Member Author

We should definitely not ever read 4-5 continuation bytes; that's from a very old UTF-8 specification and hasn't been allowed for a long time. It's also not a thing that happens much in real life (aside from CESU-8 maybe?), and would require a larger Char type, which is undesirable.

@StefanKarpinski
Copy link
Sponsor Member Author

See also WTF-8: https://en.wikipedia.org/wiki/UTF-8#WTF-8, which ignores invalidity of surrogate pairs, since UTF-8 has no intrinsic issue with them – they are purely invalid because of how UTF-16 works (and it only works that way because of its legacy history with UCS-2).

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented Nov 3, 2017

Some cases I can see for invalid UTF-8 data, together with the most useful "character" parsing:

  1. Binary data: Char = byte
  2. Latin-1 data: Char = byte
  3. Modified UTF-8: Char = loosely decoded UTF-8 characters
  4. WTF-8: Char = loosely decoded UTF-8 characters
  5. Mangled UTF-8 data: Char = partial characters?

It's hard to imagine where the Unicode 10 partial character decoding business is actually a useful interpretation. Of course, the byte interpretation for binary and Latin-1 data isn't really very useful either since you don't get the right characters out, but at least the number and size of the units are correct. And yes, there's a chance that binary data or Latin-1 will look like valid UTF-8, but that's very improbable, especially if you require a complete UTF-8 character to group bytes. So I guess I think the most useful Char splitting approach would be:

  • Split anything that has the basic form of a UTF-8 character as a Char: i.e. an ASCII byte or a lead byte with 2-4 leading ones followed by that many continuation bytes. This covers ASCII, valid UTF-8 (which includes ASCII), Modified UTF-8, and WTF-8. I suspect we should also just allow converting these to integer code point values in the straightforward way – some of which will produce code points of surrogate pairs (which one can later check for validity, if one even cares) and some of which will map to the same code points, non-injectively (i.e. overlong encodings). On the other hand, two different encodings of the same code point should not be equal as characters. So, for example, c = '\xC0\x80' would be the overlong encoding of the code point U+0 and you would have c != '\0' but Int(c) == 0. So if you're looking for a literal NUL byte, then comparing characters against '\0' will not accidentally find overlong encodings of U+0, but if you're explicitly searching for characters with code point 0, then you will find it. That seems like a sane and useful distinction which allows people to work with common UTF-8 variants without getting into trouble.

  • Split anything else into Char values according to the Unicode 10 standard. This is mainly so that we only ever have to do a single byte of look-ahead, instead of having to look ahead three bytes. I actually think that splitting everything else as single-byte invalid Chars is probably a more useful splitting, but this is easier to implement and it's closer to the standard, so ¯\_(ツ)_/¯.

This doesn't give 100% Unicode 10 conformance, but as I said above, I'm not entirely sure the standard really applies. The only observable discrepancy we might encounter is counting the number of characters in invalid strings differently, and in those cases, I think our answers will be what people actually expect/want. That seems like a pretty acceptable tradeoff between handling common UTF-8 variants smoothly and being approximately Unicode 10 conformant.

@bkamins
Copy link
Member

bkamins commented Nov 3, 2017

First about reading 4-5 byte sequences. As I have shown in #24420 currently Julia does this, e.g. run collect(String([0xff, 0xc0, 0xc0, 0xc0, 0xc0, 0xc0, 0x41, 0x42, 0x43])) (this is just for a reference).

I am OK with parsing valid overlong encoding and valid reserved surrogates as if they were allowed. And all other invalid UTF-8 following Unicode 10. I understand this is what you propose and this will lead to fastest implementation of splitting. We should simply document that we work like this

However, the other issue is conversion from Char to UInt32. In my bijection proposal above valid reserved surrogates would be decoded as if they were valid - exactly as you propose. However, if we want a bijection between Char and UInt32 we cannot convert overlong encodings to their natural representation.

So the question is do we want to have this bijection or not.

And even if we do not want a full bijection I propose to decode all other than overlong and surrogate invalid Char to code points larger than U+10FFFF so that such invalid Char does not decode to a valid codepoint (and if we drop bijection we can apply some simple conversion rule not caring that they all all diffferent).

Apart from that then isvalid(c::Char) function should allow for Unicode 10 compliant checking and Julia compliant checking (if we end up with deciding that they do not have to be identical). Then probably some additional argument strict::Bool would be good.

@StefanKarpinski
Copy link
Sponsor Member Author

I think I've changed my mind on the bijection business, after contemplating it. When converting from Char to integer seems to be the natural point to get an error if the Char does not naturally give you a code point. So I just need to make that conversion throw the error in a way that doesn't slow everything down so much. I think a more general character classification scheme may be better than the current isvalid function, which can only give two possible answers. classify(c::Char) and classify(s::String) or something like that. Could maybe be rolled in with the other isalpha, etc. stuff which we've kind of wanted to clean up for some time now.

@bkamins
Copy link
Member

bkamins commented Nov 3, 2017

Maybe I am stating the obvious but I have just realized that actually internally in Julia we will not need to convert from Char to UInt32 at all.
One can work on Char that is UTF-8 encoded everywhere (i.e. even if some internal functions wanted Unicode parameter they can be rewritten to to take UTF-8 encoded parameter without making a conversion to a code point) - it might even be faster in some cases (e.g. next should be faster).
And in what cases is the exact code point value required by a programmer - I suppose very rarely - mostly in cases when some external library wants its input in Unicode.

@StefanKarpinski
Copy link
Sponsor Member Author

Yes, that's exactly the idea – you can do most things with Char values without ever converting to code points. Actually wanting the code point for a character is quite rare! Why would you care? You can do almost everything by equality and inequality comparisons with other characters. And most of the time you don't even need to do that since you can just pass string data all the way through.

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented Nov 3, 2017

I'm puzzling through ordering here and thought maybe you have some thoughts, @bkamins. What ordering should the entire space of characters – valid and invalid – have? It's fairly common to write code like 'a' ≤ c ≤ 'z' to test if some character is in a range, so it seems somewhat dangerous for invalid characters to ever have the property that valid₁ ≤ invalid ≤ valid₂.

The other consideration is that we might want sorting lexicographically by character to match sorting lexicographically by bytes – which is a property that valid UTF-8 is designed to have. However, I don't think that's actually possible for any ordering of partial characters, so perhaps that shouldn't be a consideration after all. Any thoughts?

@bkamins
Copy link
Member

bkamins commented Nov 3, 2017

I think that we can order Char by bytes and inform users that:

  • Char is UTF-8 encoded thus it is safe to write ranges only of ASCII characters (I would say that this is more or less the only case when character comparisons are made in practice);
  • if they have only valid UTF-8 it will be ordered by code point (by valid I mean strictly valid - without overlong sequences);
  • if they want ordering by code point and have no guarantee that UTF-8 is valid then they should convert Char to code point and compare it (and here I would say again that invalid UTF-8 should be always converted to some value larger than U+10FFFF) -> currently your code does not do that
  • add a general remark that comparing non ASCII Char is not a good idea (so we are back to point 1 above) as in general right order of Char and String is locale dependent, e.g. in Polish a<ą<b<c<ć<d<e<ę<f;

In short - outside of ASCII range I would not worry too much about comparing Char as there is no single valid rule anyway and it should be discouraged.

As a side comment - supporting https://en.wikipedia.org/wiki/Unicode_collation_algorithm somehow in Julia would be great.
In general I believe that if people will start using Julia for more text processing some package like stringi in R will be crucial to have.
Are you aware of any such efforts?

@StefanKarpinski
Copy link
Sponsor Member Author

I think we're in agreement here. The conclusion I came to is that the invariant we want is that c₁ < c₂ if and only if "$c₁" < "$c₂". This is naturally the case for valid UTF-8 lexically sorted by bytes, and extends to invalid characters as well by byte ordering – which is what I think you're saying too. That's a nice property that doesn't have to do with the representation of Char. However, it does imply that I haven't quite represented characters correctly here: the zero padding should be least significant, not most significant. I'll adjust the PR accordingly.

@StefanKarpinski
Copy link
Sponsor Member Author

Also: I think that all fancy Unicode stuff like the collation algorithm can be implemented in packages.

@bkamins
Copy link
Member

bkamins commented Nov 4, 2017

The property you have indicated would be nice to have.

Two things to consider though:

  • I have been thinking of right padded representation earlier and the problem I saw was that it would be more expensive to work with such Char than in current model. Re-implementing convert, read and write methods will show if there is a penalty.
  • The property extends to "$c₁[anything 1 here]" < "$c₂[anything 2 here]" only for valid UTF-8. Below I am using right padding notation. Consider a valid character 0xF0908D88 which is greater than invalid character 0xF0908D00 that uses three bytes. Now compare the following 8-element sequences of bytes
    [0xF0, 0x90, 0x8D, 0x88, 0xF0, 0x90, 0x8D, 0x88]
    and
    [0xF0, 0x90, 0x8D, 0xF0, 0x90, 0x8D, 0x88, 0x00]
    the first one is less than than the second.
    Though the first decoded to characters is
    [0xF0908D88, 0xF0908D88]
    and the second is
    [0xF0908D00, 0xF0908D88, 0x00000000]

@nalimilan
Copy link
Member

nalimilan commented Nov 4, 2017

Very nice! I agree that consistency between c₁ < c₂ and "$c₁" < "$c₂" sounds essential.

Regarding over-long character encodings, I'd say it's fine to treat them as different from their correct counterparts for < as long as we do that consistently, i.e. they should never be considered as equal, even after conversion to UInt32. Is that the plan?

@bkamins
Copy link
Member

bkamins commented Nov 4, 2017

Just to highlight the issue with comparison I see using analogous notation:
it is impossible to guarantee under any conversion rule that s₁[1]<s₂[1] and s₁<s₂ are consistent if the latter uses memcmp.
Thus consistency between c₁ < c₂ and "$c₁" < "$c₂" is a nice feature but will have a limited usefulness in practice.

The consistency s₁[1]<s₂[1] and s₁<s₂ is guaranteed for valid UTF-8 strings no matter if we use character encoding with left or right padding - both will work.

So I would say that a major decision between left and right padding would be how they compare in performance.

As for conversion from Char to UInt32:

  • the current implementation in this PR maps overlong sequences to their code point representation (this not following what you have stated); also it maps invalid UTF-8 into valid code points;
  • in the discussion above the idea was to make the conversion bijective (thus ensuring what you have said).
  • the other approach would be to strictly encode all invalid UTF-8 sequences by U+FFFD and all valid UTF-8 to their code point value (this also would ensure that overlong sequences are not considered equal to their valid counterparts).

If I understand it correctly @StefanKarpinski wants to implement yet something else. Correct?

As another thought maybe it would be useful to define CodePoint type that would provide a different conversion behavior (i.e. strict conversion following Unicode 10 and always using U+FFFD to represent invalid UTF-8) than conversion to UInt32 (which could be less strict)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants