Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stringapalooza #16107

Closed
26 of 32 tasks
StefanKarpinski opened this issue Apr 28, 2016 · 64 comments
Closed
26 of 32 tasks

Stringapalooza #16107

StefanKarpinski opened this issue Apr 28, 2016 · 64 comments
Assignees
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings kind:excision Removal of code from Base or the repository needs docs Documentation for this change is required
Milestone

Comments

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Apr 28, 2016

0.5 Major tasks

Round 1

Round 2

Round 3

Cleanup tasks


0.6 Major tasks

Round 4

  • Base: change Char representation (allow lossless string processing of any data)
  • Base: remove RepString (moved to LegacyStrings)
  • Base: remove RevString (move to package?)
  • Base: merge SubString and String (add offset field to String)

Cleanup tasks

@StefanKarpinski StefanKarpinski self-assigned this Apr 28, 2016
@StefanKarpinski StefanKarpinski added domain:unicode Related to unicode characters and encodings domain:strings "Strings!" labels Apr 28, 2016
@StefanKarpinski StefanKarpinski added this to the 0.5.0 milestone Apr 28, 2016
@JeffBezanson
Copy link
Sponsor Member

Great list.

What about windows APIs that use utf-16?

@StefanKarpinski
Copy link
Sponsor Member Author

What about windows APIs that use utf-16?

Already taken care of: #15033.

Two more potential rounds:

  • removing RepString and RevString
  • merging SubString and String

Might not get to those until the next release though.

@quinnj
Copy link
Member

quinnj commented Apr 28, 2016

What about:

I definitely think there's a lot to play with around merging Substring and String, but it certainly feels like 0.6 material.

@tkelman
Copy link
Contributor

tkelman commented Apr 29, 2016

It was just added, but now that readstring(io) is just String(read(io)) it maybe doesn't even need a separate name forever

@lobingera
Copy link

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

@nalimilan
Copy link
Member

nalimilan commented Apr 29, 2016

+1 for including as much of this plan as possible into 0.5. Breakage better happen soon.

As regards moving ASCIIString, etc. to a package (round 3, bullet 2), see discussion on an implementation for any encoding in StringEncodings.jl here.

As regards changing Char's underlying representation(round 4, bullet 1), another step would also be to introduce AbstractChar and use it in method signatures to allow e.g. the ASCIIString replacement to implement ASCIIChar <: AbstractChar as a UInt8. This would allow people working with ASCII data to actually enjoy higher performance than before.

EDIT: Finally, I'd like to see a discussion regarding the opportunity of ensuring that String only holds valid UTF-8 data or not, and how to handle file paths (special string type or not). But I don't want to derail this already rich thread, so maybe better open a separate issue for that?

@StefanKarpinski
Copy link
Sponsor Member Author

just for my curiosity: Why should this be part of 0.5? The title of the release had something to do with Arrays ... not Strings?

Because I've been working on this for months and it's ready to go. It won't hold up the release anyway.

@lobingera
Copy link

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

@nalimilan
Copy link
Member

nalimilan commented Apr 29, 2016

I somehow agree, it will not hold the release of the language. But more syntax changes in 0.5 give some impact in porting packages to 0.5; maybe i'm just wrong that this causes effort ...

And more syntax changes in 0.6 will cause effort for even more packages. And given the growth rate of the package ecosystem...

@stevengj
Copy link
Member

#15033 didn't fully provide a path for external packages to call Windows APIs, without calling internal functions.

@StefanKarpinski
Copy link
Sponsor Member Author

True – added to the roadmap.

@StefanKarpinski
Copy link
Sponsor Member Author

Actually, question here: do people think we should keep ascii and have it convert strings to standard String type and error if the content is not plain ASCII? It's kind of a useful function to have.

@StefanKarpinski
Copy link
Sponsor Member Author

Another question about behavior. String and string don't actually behave in the same manner always:

julia> String(UInt8[97,98,99])
"abc"

julia> string(UInt8[97,98,99])
"UInt8[97,98,99]"

Any thoughts on resolving this? Currently utf8 and ascii behave like String not like string.

@stevengj
Copy link
Member

stevengj commented May 6, 2016

Maybe String! if it takes ownership of the array, i.e. is "in-place"?

There is also the bytestring function, whose name seems like a holdover from ByteString, but whose function is essential.

@StefanKarpinski
Copy link
Sponsor Member Author

I'm pretty sure that usage of ! does not have the @JeffBezanson seal of approval.

@stevengj
Copy link
Member

stevengj commented May 6, 2016

String¡, then.

@JeffBezanson
Copy link
Sponsor Member

This case doesn't bother me that much --- unlike utf8 and ascii, string is not an encoding.

@stevengj
Copy link
Member

stevengj commented May 6, 2016

The point is, we need some function to replace bytestring(::Vector{UInt8}) and bytestring(::Ptr{UInt8}, [len]) (makes a copy of bytes), but also UTF8String(::Vector{UInt8}) and pointer_to_string (doesn't make a copy).

It makes sense to me to name them all the same thing, with ! for the "in-place" versions. But I don't know what that name should be. Maybe bytestring and bytestring!, where byte refers to the encoding? Or utf8 and utf8! to be more explicit about the encoding, with ascii doing an additional assert?

@JeffBezanson
Copy link
Sponsor Member

Can we just keep bytestring?

@stevengj
Copy link
Member

stevengj commented May 6, 2016

@JeffBezanson, bytestring is fine if a bit vague, but it always makes a copy. Would you be okay with bytestring! for the non-copying version?

@JeffBezanson
Copy link
Sponsor Member

I'm not prepared to make this the first ever case where ! means something other than mutation.

@stevengj
Copy link
Member

stevengj commented May 6, 2016

Then what do we call the non-copying version(s) of bytestring?

@JeffBezanson
Copy link
Sponsor Member

I've probably missed some of the discussion here, but can String have the same constructors that UTF8String had? Plus one with a Ptr argument?

@stevengj
Copy link
Member

stevengj commented May 6, 2016

@JeffBezanson, the problem is that then it can't replace string.

@JeffBezanson
Copy link
Sponsor Member

That's fine with me. Somehow we need one function that wraps a UInt8 vector as a string, and another that gives you the output of print as a string. We could rename string to something like sprint, except that's taken.

@StefanKarpinski
Copy link
Sponsor Member Author

TODO: Cstring and Cwstring should ensure that string data is NUL terminated (as well as NUL free).

@StefanKarpinski
Copy link
Sponsor Member Author

Ref #16499. Also need a way to express conversion of String to non-NUL-terminated UTF-16 data with known length. Perhaps convert(Vector{UInt16}, s)?

@stevengj
Copy link
Member

stevengj commented Jun 2, 2016

@StefanKarpinski, in previous incarnations, any UInt8 array allocated by Julia was automatically NUL-terminated internally; is this no longer the case? And UTF16String and UTF32String were NUL-terminated, so convert(Cwstring, string) was also.

@JeffBezanson
Copy link
Sponsor Member

Which items are slated for 0.5? Through round 3, IIRC?

@StefanKarpinski
Copy link
Sponsor Member Author

Yes, that's correct. Tomorrow/Wednesday I need to create a LegacyStrings package, put all the Unicode stuff in it and then merge my PR that removes all of that stuff with deprecations that point at it.

@JeffBezanson
Copy link
Sponsor Member

PR is up to remove RepString; it has already been added to LegacyStrings. RevString is used by some Base functions so we might want to leave it for now. Anything else here planned for 0.6?

@StefanKarpinski
Copy link
Sponsor Member Author

While you're doing string stuff, it's probably not too hard to just actually do utf-8 reversal on strings instead of using the RevString type – that would advance the highlander agenda just a bit further. (If you feel like it and have some spare type while waiting for type revamp test to run or something.)

@StefanKarpinski
Copy link
Sponsor Member Author

Other than removing RevString everything that's likely to be done is already done.

@StefanKarpinski StefanKarpinski added the kind:excision Removal of code from Base or the repository label Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings kind:excision Removal of code from Base or the repository needs docs Documentation for this change is required
Projects
None yet
Development

No branches or pull requests

8 participants