Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC, WIP: highlander #14383

Closed
wants to merge 12 commits into from
Closed

RFC, WIP: highlander #14383

wants to merge 12 commits into from

Conversation

StefanKarpinski
Copy link
Sponsor Member

This branch is a preview of where I'd like to go with built-in string types, except that I'd like to take it even further. So far this branch collapses ASCIIString and UTF8String into a single, concrete UTF-8 string type called String. In addition to this, I'd like to make the representation of String and SubString{String} the same, removing that distinction as well. I'd also like to move all non-String string types out of Base and into a StringEncodings package (or something like that). We can keep simple utility functions to transcode String to UTF-16 on Windows for system calls, but otherwise, no non-UTF-8 functionality would exist in Base. Then there would truly be only one (string type in base, that is).

Additionally, I'd like to replace the current Char type with a new Char type that leaves UTF-8-like data as-is, allowing character indexing and iteration to do far less work in the common case where you don't actually care about the values of code points (you can still check equality and ordering because of the cleverness of UTF-8). That would reduce the performance penalty of using UTF-8 everywhere.

Comments and thoughts welcomed.

@ScottPJones
Copy link
Contributor

Sorry, but I think the part of changing Char would be a disaster. It might help performance in a few small places, but would be very bad in many others.
I really don't see this as being a good path forwards for string handling.
I think something like @quinnj's ideas, combined with the ideas of changing the way arrays are handled that I've heard discussed, with some of the ideas about better storage of small strings from the bytevec code I think you were working on, and using traits for the encodings of strings, would be much more powerful and useful for those of us doing a lot of string processing.

@eschnett
Copy link
Contributor

If substrings are very efficient, to the point that a one-character (read: one-grapheme) string is as efficient as a scalar variable, is there then still a need for a separate "Character" type? On a historical note -- Fortran makes no distinction between characters and strings except via their length.

I'm sure there can be various ways to iterate over strings, yielding either grapheme clusters, graphemes, codepoints, or bytes, normalized or not. The default way might be bytes. Or not. I hear that iterating over codepoints is almost never what one wants.

@ScottPJones
Copy link
Contributor

Another thing to look at is the string support in Swift 2.0 (which is now open source, right here on GitHub)

@simonbyrne
Copy link
Contributor

While I'm broadly in favour, I think it would be worthwhile getting some good benchmarks together so we have a rough idea of what the cost is going to be in terms of performance.For example, the following function

function hasspace(s)
    for c in s
        isspace(c) && return true
    end
    false
end

is 6x faster for an ASCIIString vs the same data in a UTF8String.

@nalimilan
Copy link
Member

@simonbyrne I suspect we should have a function to check whether a character is an ASCII space, which is much faster to check than Unicode categories. I think it is quite common to have to look for a space in a non-ASCII string, and we want this to be fast too.

@StefanKarpinski A big +1 for the first part of the proposal. The second part about Char is promising, but I don't understand how it would work. Would Char be a simple view on the string data? At that point, @eschnett I right that the distinction between char and string might not make sense. Considering everything as a string would have some advantages, like allowing to represent the same way codepoints and grapheme clusters. The latter is the best definition of "character" in many cases, and this is indeed what you get in Swift when iterating over a string.

@StefanKarpinski
Copy link
Sponsor Member Author

@simonbyrne: benchmarks are definitely necessary before this change can go forward. I think the new character representation can reduce that performance gap considerably.

@nalimilan: the Char trick is basically to represent characters as a 32-bit immutable value like now but as undecoded UTF-8 bytes instead of as the Unicode code point. This allows you to get a character value from a UTF-8 string using just a few integer ops and a table lookup. It still needs a little work, but when I tried it out on the sk/newchar branch – see this commit – it was remarkably non-disruptive. Basically, the only code that has any realy trouble with the change is code that assumes that UTF-32 strings and arrays of Char have the same in-memory representation. Since using UTF-32 is rare and making that very low-level assumption is even rarer, this doesn't cause much trouble. The other place that there can be trouble is code that converts between Char and UInt32 using reinterpret – but you really shouldn't do that and using convert seems much more common.

@nalimilan
Copy link
Member

I see, makes sense. Sounds much easier than making char and string the same thing. We could imagine having AbstractChar to allow for efficient UTF-16 or UTF-32 implementations in packages.

@StefanKarpinski
Copy link
Sponsor Member Author

Yes, that's a possibility.

@samoconnor
Copy link
Contributor

+1
Sounds all good to me.

@StefanKarpinski StefanKarpinski force-pushed the sk/highlander branch 2 times, most recently from a5f275c to aeebea9 Compare January 11, 2016 17:07
@samoconnor
Copy link
Contributor

@eschnett an @nalimilan have a good point about ... iterating over codepoints is almost never what one wants._ etc.

I like the idea of a 32-bit Char type with some UTF-8 implementation magic.

I think that String should not be indexable at all (or should only be byte-indexable, i.e. same as Vector{UInt8}).

There are so many ways one might want to iterate/index the content of a string...

bytes(s)[7]
for byte in eachbyte(s)

chars(s)[7]
for char in eachchar(s) ...

clusters(s)[7]
for cluster in eachcluster(s) ...

lines(s)[7]
for line in eachline(s)

words(s)[7]
for row in eachword(s)

jlstatement(s)[7]
for stmt in eachjlstatement(s)

I think there are three use case classes:

  1. the code knows about the high level abstractions represented in the String and knows how to deal with them (lines, modified emojis, words, rows, statements, verses, paragraphs, tags, whatever).
  2. the code treats the String as an opaque value to be compared and or passed around, but has no concept of indexing or iteration.
  3. low level code that has to deal with the String as a byte array in order to pass it to/from some external system (network, serialisation format, etc).

@yurivish
Copy link
Contributor

Perl 6's string implementation allows constant-time indexing into graphemes. Here are slides from a talk about Unicode in Perl 6 by one of the primary contributors to the implementation: https://jnthn.net/papers/2015-spw-nfg.pdf

@simonbyrne
Copy link
Contributor

I couldn't find a formal description, but from what I can gather from here, Perl 6 uses an array of 32-bit signed integers, with negative numbers corresponding to "synthetic codepoints", via some sort of lookup table.

Update: here is a more formal spec of their grapheme normalization form (NFG).

@diakopter
Copy link

@simonbyrne pdd28 was always aspirational; the rakudo/moarvm implementation was only peripherally influenced by the parrot writeups

@nalimilan
Copy link
Member

Their approach is interesting, but it only makes sense when starting from UTF-32, and certainly not from UTF-8 (as in the case of Julia). That is, they already have O(1) codepoint indexing, what their trick adds is O(1) grapheme indexing. But this is terribly wasteful as regards memory use, and it forces them to not only validate, but also normalize all strings on input. I guess this is a good option for a language which considers that manipulating Unicode strings should be fast, at the expense of significant overhead when your needs are more basic (like parsing a CSV file or computing stats from a database). Indeed, as they note themselves:

This means that although Parrot will use 32-bit NFG strings for optimizations within operations, for the most part individual users should use the native character set and encoding of their data, rather than using NFG strings directly.

I'm not sure what "native" charset means, but it sounds weird that their Unicode support is so good that they advise moving away from Unicode altogether. Kind of counter-productive...

That said, it looks like there's a pattern in new languages for string iteration to go over graphemes rather than codepoints (which are mostly an implementation detail). I'm not sure how feasible it would be to get an acceptable performance for that so as to make graphemes the default.

@samoconnor See #9297.

@StefanKarpinski StefanKarpinski force-pushed the sk/highlander branch 2 times, most recently from da86d36 to 6e99e2b Compare February 25, 2016 18:22
@dcarrera
Copy link

dcarrera commented Mar 8, 2016

I think this is a great idea. Are you also thinking of deprecating AbstractString ?

@nalimilan
Copy link
Member

@dcarrera What would be the point of deprecating AbstractString? That would make it impossible to create custom string types in packages and have them work with Base.

@dcarrera
Copy link

dcarrera commented Mar 8, 2016

@nalimilan I see. My bad.

@samoconnor
Copy link
Contributor

Perhaps this is relevant... ?

I think the example below suggests that String should not be treated as a collection/itterable.
At present some things treat *String as a collection and some do not.

julia> collect(Base.flatten(Vector[[1,2], [3,4]]))
4-element Array{Any,1}:
 1
 2
 3
 4

julia> vcat(Vector[[1,2], [3,4]]...)
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> collect(Base.flatten(["12", "34"]))
4-element Array{Char,1}:
 '1'
 '2'
 '3'
 '4'

julia> vcat(["12", "34"]...)
2-element Array{ASCIIString,1}:
 "12"
 "34"

@StefanKarpinski
Copy link
Sponsor Member Author

That's certainly a direction that the string API could take. I think it's separate from this work, however.

@samoconnor
Copy link
Contributor

Fair enough. Your discussion of "allowing character indexing and iteration" for the new Char type made me think that it might be a good opportunity to make the relationship between String and Char explicit at the same time (i.e. remove the implicit treatment of String as a collection of Char):

for c in "hello" println(c) end
ERROR: MethodError: no method matching start(::String)

"hello"[1]
ERROR: MethodError: no method matching getindex(::String, ::Int64)

for c in chars("hello") println(c) end

first(chars["hello"])

@nalimilan
Copy link
Member

@samoconnor See also #9261 and #9297.

@samoconnor
Copy link
Contributor

thx @nalimilan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants