Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify WeakRefStrings purpose #21

Merged
merged 6 commits into from
Jan 5, 2018
Merged

Clarify WeakRefStrings purpose #21

merged 6 commits into from
Jan 5, 2018

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Dec 24, 2017

Remove a few pitfall functions that allowed direct processing of WeakRefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors

cc: @nalimilan @alyst: I see this as step one to the plan in JuliaData/CSV.jl#140

Should resolve #19 (@andreasnoack).

…RefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors
@codecov
Copy link

codecov bot commented Dec 24, 2017

Codecov Report

Merging #21 into master will increase coverage by 0.86%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #21      +/-   ##
==========================================
+ Coverage   78.72%   79.59%   +0.86%     
==========================================
  Files           1        1              
  Lines          47       49       +2     
==========================================
+ Hits           37       39       +2     
  Misses         10       10
Impacted Files Coverage Δ
src/WeakRefStrings.jl 79.59% <85.71%> (+0.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69d5ff8...a6c3eb3. Read the comment docs.

@alyst
Copy link
Contributor

alyst commented Dec 24, 2017

The key difference w.r.t #20 is that here WeakRefStringsArray{T} <: AbstractArray{T}.
I.e. for dispatch it's still an array of WeakRefString (or Union{WeakRefString, Missing}) elements, and only eltype(A) says it's String (or Union{String, Missing}).
So if e.g A::WeakRefStringsArray{WeakRefString{UInt8},1}, then B = A[1:2] yields B::Array{WeakRefString{UInt8},1}, so if A is gc-ed, the elements in B will point to invalid memory, and we are back to dataframe corruption problems of JuliaData/CSV.jl#132.

@nalimilan
Copy link
Member

nalimilan commented Dec 24, 2017

Yeah, I think this is mostly orthogonal to #20. I wonder whether it's a good idea to remove features, though. As long as CSV doesn't export WeakRefString and that indexing a WeakRefStringArray only returns String, there's no risk people will use them without being aware of the risks. EDIT: after reading #19, I understand the removed functions were broken anyway, so that's fine with me.

BTW, I wonder whether we could find a type which would be easier to understand, especially for WeakRefStringArray, which does not actually use weak references AFAICT since it ensures the backing memory isn't freed, and doesn't return WeakRefString entries now.

WeakRefStringArray(data::Vector{UInt8}, ::Type{Union{Missing, T}}, rows::Integer) where {T} = WeakRefStringArray(Any[data], Vector{Union{Missing, T}}(rows))
WeakRefStringArray(data::Vector{UInt8}, A::Array{T}) where {T <: Union{WeakRefString, Missing}} = WeakRefStringArray(Any[data], A)
init(::Type{T}, rows) where {T} = fill(zero(T), rows)
if !isdefined(Base, :uninitialized)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though Compat provided this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only compat definition needed currently, so easier to just provide the single definition.


wk(w::WeakRefString) = string(w)
wk(::Missing) = missing

Base.size(A::WeakRefStringArray) = size(A.elements)
Base.eltype(A::WeakRefStringArray{T}) where {T <: WeakRefString} = String
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be needed if WeakRefStringArray correctly inherited from AbstractArray{String} or AbstractArray{Union{String, Missing}}.

@alyst
Copy link
Contributor

alyst commented Dec 25, 2017

BTW, I wonder whether we could find a type which would be easier to understand, especially for WeakRefStringArray, which does not actually use weak references AFAICT since it ensures the backing memory isn't freed, and doesn't return WeakRefString entries now.

What's worrysome to me with the current implementation is that upon any string-like operation WeakRefString calls string() each time creating a copy of the string. So if WeakRefStringArray stays longer than just for CSV parsing, it might actually be less effective than Array{String}. OTOH in my experience many dataframe columns are either not used at all or dropped immediately after the frame is read. So it still makes sense to keep WeakRefStringArray in the resulting dataframe to avoid unnecessary allocation of strings. I was just wondering whether WeakRefStringArray could be transformed into something like:

struct LazyStringArray{T <: LazyString, N} <: AbstractArray{String, N}
    data::Vector{Any}
    elements::Array{Union{String, T}, N}
end

The idea is that if elements[i] isa LazyString, it's substituted with string(elements[i]), and all subsequent accesses return that same string. It could also be extended to handle missing data (if 0.7 doesn't support efficient storage of elements::Array{Union{String, T, Missing}, N}, it should be possible to use 2 arrays as a workaround). With this design LazyString (aka WeakRefString) is just a tuple of string start and length, so it could be kept completely internal to the package.

Actually, this could be generalized to other lazy element allocation usecases, and I wonder whether something like that already exist (ala MappedArrays.jl, Lazy.jl)

@alyst
Copy link
Contributor

alyst commented Dec 25, 2017

The other question I had is how much

struct WeakRefString
    ptr::Ptr{UInt8}
    len::Int
end

is faster/smaller than

struct LazyString
    data::Array{Uint8}
    start::Int
    len::Int
end

that really references the data.

@nalimilan
Copy link
Member

I think we could also have the data field be an Array{String}/Array{Union{String,Missing}, but leave entries that have never been accessed as undef. You'd just need to call isassigned before accessing entries.

Though that LazyStringArray approach is really equivalent to filling an Array{String} progressively as entries are accessed. So maybe we should just return an Array{String}, and leave WeakRefStringArray opt-in, for cases where you know it's more efficient to recreate String objects on every getindex call.

@alyst
Copy link
Contributor

alyst commented Dec 25, 2017

Though that LazyStringArray approach is really equivalent to filling an Array{String} progressively as entries are accessed. So maybe we should just return an Array{String}, and leave WeakRefStringArray opt-in, for cases where you know it's more efficient to recreate String objects on every getindex call.

If all columns/rows of the dataframe created by CSV.read() are used, then IIUC it's certainly better/easier to return Array{String} right away. But that's not always the case. Suppose you are loading the table of all protein sequences of a certain organism (ID column + sequence string column), but then you keep (by looking at the IDs) only those that are observed in your experiment -- with the lazy approach you just never access and never allocate unobserved sequences.

@quinnj
Copy link
Member Author

quinnj commented Dec 26, 2017

On the question of storing data::Vector{UInt8} in a LazyString directly, it makes a huge difference. What you describe is essentially the current definition of SubString, and will always be heap allocated because Array is a mutable object. By only storing a pointer and length, it's a true immutable object that can be stack allocated. That's a big part of the entire efficiencies WeakRefStrings provide.

I'm also not entirely sure on the need to inherit from AbstractArray{String, N} here; it feels weird to "spoof" our element type on a type definition level as opposed to just overloading eltype. Like, what if someone defined a method specifically for AbstractVector{String}, assuming String layout, yet a WeakRefStringArray would get dispatched to it? The "GC-corruption bug" @alyst mentioned in JuliaData/CSV#132 is just a simple bug in how I understood the AbstractArray interface to work. I thought if you implemented the scalar getindex, the multidimensional would utilize the scalar, but that doesn't quite seem to be the case. I've fixed that issue, however, in my latest commit.

@nalimilan
Copy link
Member

I'm also not entirely sure on the need to inherit from AbstractArray{String, N} here; it feels weird to "spoof" our element type on a type definition level as opposed to just overloading eltype. Like, what if someone defined a method specifically for AbstractVector{String}, assuming String layout, yet a WeakRefStringArray would get dispatched to it? The "GC-corruption bug" @alyst mentioned in JuliaData/CSV#132 is just a simple bug in how I understood the AbstractArray interface to work. I thought if you implemented the scalar getindex, the multidimensional would utilize the scalar, but that doesn't quite seem to be the case. I've fixed that issue, however, in my latest commit.

AFAIK the T in AbstractArray{T} is supposed to reflect the result of eltype. It doesn't say anything about the storage type -- and how could it do that, given that there's no defined interface other than getindex to access the elements of an AbstractArray? CategoricalArray{T} is an AbstractArray{CategoricalValue{T}}, yet it doesn't actually store the entries as CategoricalValue.

And why would you want dispatch not to accept a WeakRefStringArray as an AbstractArray{String}? It behaves like an array of strings, so it should definitely be considered as such.

@quinnj
Copy link
Member Author

quinnj commented Dec 26, 2017

It will be considered as AbstractArray{<:AbstractString}, which seems good enough to me? I like how simple the type definition and methods are right now for the type, and I feel like the extra type parameter gymnastics in order to subtype AbstractArray{String} seem not worth it.

@nalimilan
Copy link
Member

"Good enough" doesn't seem enough to me when we can easily implement the correct definition (and already have code by @alyst for that). We've been that way with DataArray already, and in the end it created small problems in many cases which we would better have avoided from the start, rather than having to break people's code later.

@quinnj
Copy link
Member Author

quinnj commented Dec 26, 2017

But what if we end up implementing a proper AbstractString interface for WeakRefString directly? And all the sudden we actually want the element type to be WeakRefString? All that changes in my current definition here is we change what eltype returns, as opposed to more (breaking) type parameter gymnastics to undo a hard-coded String eltype.

If there are actual cases where subtyping AbstractArray{AbstractString} instead of AbstractArray{String} will be problematic, I'm all ears; but I've yet to see a compelling case here that would be worth the extra code complexity.

@alyst
Copy link
Contributor

alyst commented Dec 26, 2017

By only storing a pointer and length, it's a true immutable object that can be stack allocated. That's a big part of the entire efficiencies WeakRefStrings provide.

👍

as opposed to more (breaking) type parameter gymnastics

Nah, it was not a "type parameter gymnastics". We can do much more overcomplicated things than that ;)

Actually, I agree with @nalimilan. I think the issue of WeakRefStringsArray <: AbstractArray{T} inheritance is straightforward:

  • I always thought that eltype() is just a convenience method to get T
  • T should be the type returned by simple getindex() (see What does T mean in AbstractArray{T}? JuliaLang/julia#9586). If it's not the case, then WeakRefStringsArray breaks AbstractArray concept and will have to override (and maintain) a lot of Base AbstractArray methods (similar(), copy() etc).
  • T should be concrete and immutable for performance reasons
  • T should not be WeakRefString, because if it leaks outside of WeakRefStringsArray, there would be GC-related memory corruption problems. Also, CategoricalArray is an example of a container whose elements (by design) don't have sense without the container. That introduces a lot of issues, yet CategoricalValue at least has a proper reference to the pool.

I think that leaves us with T === String.

@nalimilan
Copy link
Member

See what the manual says:

Note that it's very important to specify the two parameters of the AbstractArray; the first defines the eltype(), and the second defines the ndims().

Anyway, moving from AbstractArray{String} to AbstractArray{WeakRefString} isn't more breaking than changing the result of eltype.

@andreasnoack
Copy link
Member

Thanks for making these changes

What you describe is essentially the current definition of SubString, and will always be heap allocated because Array is a mutable object.

I don't think this is so simple anymore. @yuyichao has improved the compiler such that some of these allocations can be avoided. I also think that Base.String is quite a bit faster than when WeakRefStrings was created. Switching from WeakRefString so Base.String in TextParse had a quite limited effect on the performance when reading a very large and text-heavy dataset so it might be worth doing some benchmarks to figure out exactly when unsafe optimizations are really worth it.

@quinnj
Copy link
Member Author

quinnj commented Jan 4, 2018

Alright, thanks for the encouragement @alyst and @nalimilan; I've updated the type parameters here to better reflect the AbstractArray{T} eltype.

I've also done a few rounds of benchmarking w/ parsing a single string up to reading a million strings w/ CSV; while the improvements are minor on the small scale, WeakRefStrings still really provide huge performance benefits when reading lots and lots of string values.

@alyst
Copy link
Contributor

alyst commented Jan 4, 2018

Great! Do you mean CSV.jl benchmarks or some internal ones? Would be nice to have some benchmark scripts in WeakRefString.jl.

Also CI is green on 0.7, but I guess if eltype() tests from #20 would be readded, it would still fail due to JuliaLang/julia#25240

@quinnj quinnj merged commit b5d2543 into master Jan 5, 2018
@quinnj quinnj deleted the jq/updates branch January 5, 2018 04:02
@quinnj
Copy link
Member Author

quinnj commented Jan 5, 2018

@alyst, I didn't see any issues adding the eltypes from the other PR 🤷‍♂️ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Correctness and performance issues for iteration
4 participants