Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace ASCIIString & UTF8String with String #16058

Merged
merged 5 commits into from
May 4, 2016
Merged

Conversation

StefanKarpinski
Copy link
Sponsor Member

@StefanKarpinski StefanKarpinski commented Apr 26, 2016

It's definitely well past time to get moving on this change, and I've tried bigger PRs but I think I need to start smaller. Here I'm separating out just the part that merges ASCIIString and UTF8String and replaces them with a single String type, which is basically just a rename of UTF8String. The deprecation is pretty minimal, we'll have to run PkgEval and figure out what breaks out there and try to mitigate.

This is kind of an awkward middle state since UTF8String no longer exists but UTF16String and UTF32String still do. Eventually, these should all move out into a package: if you want fully validated UTF8-encoded strings, then you should use a UTF8String type, etc. The Base String type will then be the somewhat permissive default string type that can handle any UTF8-like data, valid or not.

@quinnj
Copy link
Member

quinnj commented Apr 26, 2016

145 files changed == smaller change, eh? :trollface:

convert(::Type{ASCIIString}, a::Array{UInt8,1}, invalids_as::AbstractString) =
convert(ASCIIString, a, ascii(invalids_as))
convert(::Type{ASCIIString}, s::AbstractString) = ascii(bytestring(s))
ascii(x) = ascii(convert(String, x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are we going to keep the ascii methods here around instead of deprecating? I guess there's still a use-case for when you want to ensure your string is only ASCII characters?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first stage, I figured it would be less disruptive to keep it. We should probably get rid of ascii and utf8 later.

Copy link
Member

@nalimilan nalimilan Apr 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not deprecate them at the same time as the type renaming? It's easier to fix everything at the same time for users.

Also, ascii should definitely continue checking that the input is ASCII, for both safety and reliability. EDIT: I misread the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I say "at the same time", I mean that it should be done soon, but not necessarily in this PR, which is already big enough.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan: that's the idea.

Copy link
Contributor

@tkelman tkelman Apr 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this ascii(x) method verify its input is ascii? nevermind it does so, got confused

@iamed2
Copy link
Contributor

iamed2 commented Apr 26, 2016

The downside of losing an ASCIIString type is that wrapping an API that only accepts ASCII becomes harder as you can't rely on the type system to help you. Perhaps there could be an ASCIIString type that just wraps String but ensures all content isascii?

@StefanKarpinski
Copy link
Sponsor Member Author

Yeah, this change is a beast but there's even more to come...

@@ -82,7 +82,7 @@ finalize(o::ANY) = ccall(:jl_finalize, Void, (Any,), o)
gc(full::Bool=true) = ccall(:jl_gc_collect, Void, (Cint,), full)
gc_enable(on::Bool) = ccall(:jl_gc_enable, Cint, (Cint,), on)!=0

bytestring(str::ByteString) = str
bytestring(str::String) = str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my question above about ascii; is there a reason to keep bytestring around as opposed to deprecating? I also understand that we want to keep things minimal, but it might be nice to just mark the places we want to follow up on later.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review, @quinnj. Yes, we should deprecate all of ascii, utf8, bytestring, maybe string and replace them all with String, which will take any number of values, convert each to a string and produce a single String object for the concatenation of all of those.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started down that road, but that's a very big sweater once you start pulling the strings, and this change plus that one were just too much when combined.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, understood. Definitely excited for this and I'll try to take some more time to review/test things out to help out.

@StefanKarpinski
Copy link
Sponsor Member Author

@iamed2: that can be provided by an ASCII string type in an external package.

@@ -277,8 +273,6 @@ function readdlm_string(sbuff::ByteString, dlm::Char, T::Type, eol::Char, auto::
catch ex
if isa(ex, TypeError) && (ex.func == :store_cell)
T = ex.expected
elseif get(optsd, :ignore_invalid_chars, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this option need to be removed from docs and elsewhere in the code?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll just reinstate it instead.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this whole option doesn't really make sense, which is why I deleted it. The transformation is now a no-op.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I hate this file. It should not be in Base, imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OT: Judging by the frequent complaints about performance we get, I also think these functions should be deprecated in favor of CSV.jl as soon as it's ready. CSV.jl+DataStreams.jl will also unify reading data into different structures like Array, DataFrame, TimeSeries, data bases, etc.

@ivarne ivarne added this to the 0.6.0 milestone Apr 27, 2016
data::Array{UInt8,1}
ASCIIString(d::Array{UInt8,1}) = new(d)
String(d::Array{UInt8,1}) = new(d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this inner constructor useless?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is. Of course, the old code had it too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove it then, we're likely to forget about it after merging.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I'm trying to get this PR merged before someone introduces more conflicts and/or breaks the tests. You have no idea how many times I've rebased this and how often it breaks when someone changes something seemingly unrelated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was indeed my concern too (hence the quick review). Go ahead then.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is no longer fresh and I'm going to have to rebase and rerun CI anyway, I'll include this change (and a few others) when I do.

@StefanKarpinski StefanKarpinski force-pushed the sk/highlander1 branch 2 times, most recently from dac2e0c to d74cbe7 Compare April 28, 2016 19:41
@StefanKarpinski StefanKarpinski mentioned this pull request Apr 28, 2016
32 tasks
@StefanKarpinski
Copy link
Sponsor Member Author

I'm inclined to merge this but it's going to wreak havoc on packages. Thoughts?

@StefanKarpinski
Copy link
Sponsor Member Author

I guess I should add String to Compat first so that people can fix their packages.

@@ -3728,7 +3728,7 @@ second variant.
popdisplay

"""
readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, ignore_invalid_chars=false, quotes=true, dims, comments=true, comment_char='#')
readdlm(source, delim::Char, T::Type, eol::Char; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=true, comment_char='#')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's another mention of ignore_invalid_chars below

@StefanKarpinski StefanKarpinski merged commit 5de52cf into master May 4, 2016
@StefanKarpinski StefanKarpinski deleted the sk/highlander1 branch May 4, 2016 16:44
@quinnj
Copy link
Member

quinnj commented May 4, 2016

Nice!

@Keno
Copy link
Member

Keno commented May 4, 2016

Lineedit change is due to prevind behavior change:

1|debug > n
In REPL.jl:538
568           if match != 0:-1 && h != response_str && haskey(hist.mode_mapping, hist.modes[idx])
569               truncate(response_buffer, 0)
570               write(response_buffer, h)
571               seek(response_buffer, prevind(response_str, first(match)))
572               hist.cur_idx = idx
573               return true

Before:

1|julia > prevind("ll", first(5:5))
4

After:

1|julia > prevind("ll", first(5:5))
2

I believe the following is the correct patch:

diff --git a/base/REPL.jl b/base/REPL.jl
index bc3e4f9..90929b0 100644
--- a/base/REPL.jl
+++ b/base/REPL.jl
@@ -568,7 +568,7 @@ function history_search(hist::REPLHistoryProvider, query_buffer::IOBuffer, respo
         if match != 0:-1 && h != response_str && haskey(hist.mode_mapping, hist.modes[idx])
             truncate(response_buffer, 0)
             write(response_buffer, h)
-            seek(response_buffer, prevind(response_str, first(match)))
+            seek(response_buffer, prevind(h, first(match)))
             hist.cur_idx = idx
             return true
         end

@StefanKarpinski
Copy link
Sponsor Member Author

Ah, I see. That's pretty subtle. Thanks for figuring it out.

@nalimilan
Copy link
Member

Since "ll"[1:5] throws a BoundsError, shouldn't prevind("ll", 5) do the same? That would catch this kind of weird behavior.

@vtjnash
Copy link
Sponsor Member

vtjnash commented May 5, 2016

This PR seems to have dropped code coverage by nearly 10% (https://codecov.io/gh/JuliaLang/julia/commit/5de52cf9c9343cfcf50be4c7c736290d3f985961/changes).

@yuyichao
Copy link
Contributor

yuyichao commented May 5, 2016

Is this expected?

julia> String("1")
ERROR: MethodError: no method matching convert(::Type{Array{UInt8,1}}, ::String)
you may have intended to import Base.convert
Closest candidates are:
  convert(::Type{Any}, ::ANY)
  convert{T}(::Type{T}, ::T)
 in String(::String) at ./boot.jl:222
 in eval(::Module, ::Any) at ./boot.jl:228

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented May 5, 2016

Yeah, turns out there was a reason to leave this "useless" inner constructor in and it's the problem @yuyichao just found. I think I discovered that months ago and had left it in for a reason. That's the problem with really long-running PRs. Fixed here: #16212.

@tkelman
Copy link
Contributor

tkelman commented May 5, 2016

Hard to believe that wasn't tested.

@StefanKarpinski
Copy link
Sponsor Member Author

StefanKarpinski commented May 5, 2016

@tkelman: Yeah, I made a PR but I'm also adding tests to it for that.

@vtjnash: I'm not sure what's up with that coverage drop. Some of it is legit in that we relied on ACSIIString and UTF8String to exercise different code paths. But most of it seems totally unrelated – if you look at the actual changes, they look like code paths that don't depend on string types at all. I wonder if the code coverage machinery broke somehow?

@tkelman
Copy link
Contributor

tkelman commented May 5, 2016

The coverage tests run with inlining off. There are occasionally tests that fail in that situation but pass when inlining is on. https://build.julialang.org/builders/coverage_ubuntu14.04-x64/builds/216/steps/Run%20non-inlined%20tests/logs/stdio

read
ERROR: LoadError: MethodError: no method matching convert(::Type{Array{UInt8,1}}, ::Array{Char,1})
you may have intended to import Base.convert
Closest candidates are:
  convert(!Matched::Type{Any}, ::ANY)
  convert{T}(::Type{T}, !Matched::T)
 in String(::Array{Char,1}) at ./boot.jl:222
 in (::##895#909)(::String) at /home/ubuntu/buildbot/slave/coverage_ubuntu14_04-x64/build/julia-6c67baee1a/share/julia/test/read.jl:179
 in mktempdir(::##895#909, ::String) at ./file.jl:272
 in mktempdir(::Function) at ./file.jl:270
 in include_from_node1(::String) at ./loading.jl:426
 in (::CoverageBase.##13#14{Array{String,1}})() at /home/ubuntu/.julia/v0.5/CoverageBase/src/CoverageBase.jl:41
 in cd(::CoverageBase.##13#14{Array{String,1}}, ::String) at ./file.jl:48
 in runtests(::Array{String,1}) at /home/ubuntu/.julia/v0.5/CoverageBase/src/CoverageBase.jl:38
 in eval(::Module, ::Any) at ./boot.jl:228
 [inlined code] from ./sysimg.jl:11
 in process_options(::Base.JLOptions) at ./client.jl:240
 in _start() at ./client.jl:319
while loading /home/ubuntu/buildbot/slave/coverage_ubuntu14_04-x64/build/julia-6c67baee1a/share/julia/test/read.jl, in expression starting on line 3

@StefanKarpinski
Copy link
Sponsor Member Author

Ah, I guess I should make sure that tests pass with inlining off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants