Clarify WeakRefStrings purpose #21

quinnj · 2017-12-24T06:11:37Z

Remove a few pitfall functions that allowed direct processing of WeakRefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors

cc: @nalimilan @alyst: I see this as step one to the plan in JuliaData/CSV.jl#140

Should resolve #19 (@andreasnoack).

…RefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors

codecov · 2017-12-24T06:23:08Z

Codecov Report

Merging #21 into master will increase coverage by 0.86%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master      #21      +/-   ##
==========================================
+ Coverage   78.72%   79.59%   +0.86%     
==========================================
  Files           1        1              
  Lines          47       49       +2     
==========================================
+ Hits           37       39       +2     
  Misses         10       10

Impacted Files	Coverage Δ
src/WeakRefStrings.jl	`79.59% <85.71%> (+0.86%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 69d5ff8...a6c3eb3. Read the comment docs.

alyst · 2017-12-24T10:07:08Z

The key difference w.r.t #20 is that here WeakRefStringsArray{T} <: AbstractArray{T}.
I.e. for dispatch it's still an array of WeakRefString (or Union{WeakRefString, Missing}) elements, and only eltype(A) says it's String (or Union{String, Missing}).
So if e.g A::WeakRefStringsArray{WeakRefString{UInt8},1}, then B = A[1:2] yields B::Array{WeakRefString{UInt8},1}, so if A is gc-ed, the elements in B will point to invalid memory, and we are back to dataframe corruption problems of JuliaData/CSV.jl#132.

nalimilan · 2017-12-24T11:51:34Z

Yeah, I think this is mostly orthogonal to #20. I wonder whether it's a good idea to remove features, though. As long as CSV doesn't export WeakRefString and that indexing a WeakRefStringArray only returns String, there's no risk people will use them without being aware of the risks. EDIT: after reading #19, I understand the removed functions were broken anyway, so that's fine with me.

BTW, I wonder whether we could find a type which would be easier to understand, especially for WeakRefStringArray, which does not actually use weak references AFAICT since it ensures the backing memory isn't freed, and doesn't return WeakRefString entries now.

nalimilan · 2017-12-24T11:46:38Z

src/WeakRefStrings.jl

-WeakRefStringArray(data::Vector{UInt8}, ::Type{Union{Missing, T}}, rows::Integer) where {T} = WeakRefStringArray(Any[data], Vector{Union{Missing, T}}(rows))
-WeakRefStringArray(data::Vector{UInt8}, A::Array{T}) where {T <: Union{WeakRefString, Missing}} = WeakRefStringArray(Any[data], A)
+init(::Type{T}, rows) where {T} = fill(zero(T), rows)
+if !isdefined(Base, :uninitialized)


I though Compat provided this?

This is the only compat definition needed currently, so easier to just provide the single definition.

nalimilan · 2017-12-24T11:47:20Z

src/WeakRefStrings.jl


 wk(w::WeakRefString) = string(w)
 wk(::Missing) = missing

 Base.size(A::WeakRefStringArray) = size(A.elements)
+Base.eltype(A::WeakRefStringArray{T}) where {T <: WeakRefString} = String


These shouldn't be needed if WeakRefStringArray correctly inherited from AbstractArray{String} or AbstractArray{Union{String, Missing}}.

alyst · 2017-12-25T22:08:43Z

BTW, I wonder whether we could find a type which would be easier to understand, especially for WeakRefStringArray, which does not actually use weak references AFAICT since it ensures the backing memory isn't freed, and doesn't return WeakRefString entries now.

What's worrysome to me with the current implementation is that upon any string-like operation WeakRefString calls string() each time creating a copy of the string. So if WeakRefStringArray stays longer than just for CSV parsing, it might actually be less effective than Array{String}. OTOH in my experience many dataframe columns are either not used at all or dropped immediately after the frame is read. So it still makes sense to keep WeakRefStringArray in the resulting dataframe to avoid unnecessary allocation of strings. I was just wondering whether WeakRefStringArray could be transformed into something like:

struct LazyStringArray{T <: LazyString, N} <: AbstractArray{String, N}
    data::Vector{Any}
    elements::Array{Union{String, T}, N}
end

The idea is that if elements[i] isa LazyString, it's substituted with string(elements[i]), and all subsequent accesses return that same string. It could also be extended to handle missing data (if 0.7 doesn't support efficient storage of elements::Array{Union{String, T, Missing}, N}, it should be possible to use 2 arrays as a workaround). With this design LazyString (aka WeakRefString) is just a tuple of string start and length, so it could be kept completely internal to the package.

Actually, this could be generalized to other lazy element allocation usecases, and I wonder whether something like that already exist (ala MappedArrays.jl, Lazy.jl)

alyst · 2017-12-25T22:17:11Z

The other question I had is how much

struct WeakRefString
    ptr::Ptr{UInt8}
    len::Int
end

is faster/smaller than

struct LazyString
    data::Array{Uint8}
    start::Int
    len::Int
end

that really references the data.

nalimilan · 2017-12-25T22:29:46Z

I think we could also have the data field be an Array{String}/Array{Union{String,Missing}, but leave entries that have never been accessed as undef. You'd just need to call isassigned before accessing entries.

Though that LazyStringArray approach is really equivalent to filling an Array{String} progressively as entries are accessed. So maybe we should just return an Array{String}, and leave WeakRefStringArray opt-in, for cases where you know it's more efficient to recreate String objects on every getindex call.

alyst · 2017-12-25T22:56:09Z

Though that LazyStringArray approach is really equivalent to filling an Array{String} progressively as entries are accessed. So maybe we should just return an Array{String}, and leave WeakRefStringArray opt-in, for cases where you know it's more efficient to recreate String objects on every getindex call.

If all columns/rows of the dataframe created by CSV.read() are used, then IIUC it's certainly better/easier to return Array{String} right away. But that's not always the case. Suppose you are loading the table of all protein sequences of a certain organism (ID column + sequence string column), but then you keep (by looking at the IDs) only those that are observed in your experiment -- with the lazy approach you just never access and never allocate unobserved sequences.

quinnj · 2017-12-26T14:11:28Z

On the question of storing data::Vector{UInt8} in a LazyString directly, it makes a huge difference. What you describe is essentially the current definition of SubString, and will always be heap allocated because Array is a mutable object. By only storing a pointer and length, it's a true immutable object that can be stack allocated. That's a big part of the entire efficiencies WeakRefStrings provide.

I'm also not entirely sure on the need to inherit from AbstractArray{String, N} here; it feels weird to "spoof" our element type on a type definition level as opposed to just overloading eltype. Like, what if someone defined a method specifically for AbstractVector{String}, assuming String layout, yet a WeakRefStringArray would get dispatched to it? The "GC-corruption bug" @alyst mentioned in JuliaData/CSV#132 is just a simple bug in how I understood the AbstractArray interface to work. I thought if you implemented the scalar getindex, the multidimensional would utilize the scalar, but that doesn't quite seem to be the case. I've fixed that issue, however, in my latest commit.

nalimilan · 2017-12-26T14:23:51Z

I'm also not entirely sure on the need to inherit from AbstractArray{String, N} here; it feels weird to "spoof" our element type on a type definition level as opposed to just overloading eltype. Like, what if someone defined a method specifically for AbstractVector{String}, assuming String layout, yet a WeakRefStringArray would get dispatched to it? The "GC-corruption bug" @alyst mentioned in JuliaData/CSV#132 is just a simple bug in how I understood the AbstractArray interface to work. I thought if you implemented the scalar getindex, the multidimensional would utilize the scalar, but that doesn't quite seem to be the case. I've fixed that issue, however, in my latest commit.

AFAIK the T in AbstractArray{T} is supposed to reflect the result of eltype. It doesn't say anything about the storage type -- and how could it do that, given that there's no defined interface other than getindex to access the elements of an AbstractArray? CategoricalArray{T} is an AbstractArray{CategoricalValue{T}}, yet it doesn't actually store the entries as CategoricalValue.

And why would you want dispatch not to accept a WeakRefStringArray as an AbstractArray{String}? It behaves like an array of strings, so it should definitely be considered as such.

quinnj · 2017-12-26T14:28:19Z

It will be considered as AbstractArray{<:AbstractString}, which seems good enough to me? I like how simple the type definition and methods are right now for the type, and I feel like the extra type parameter gymnastics in order to subtype AbstractArray{String} seem not worth it.

nalimilan · 2017-12-26T14:48:20Z

"Good enough" doesn't seem enough to me when we can easily implement the correct definition (and already have code by @alyst for that). We've been that way with DataArray already, and in the end it created small problems in many cases which we would better have avoided from the start, rather than having to break people's code later.

quinnj · 2017-12-26T14:57:21Z

But what if we end up implementing a proper AbstractString interface for WeakRefString directly? And all the sudden we actually want the element type to be WeakRefString? All that changes in my current definition here is we change what eltype returns, as opposed to more (breaking) type parameter gymnastics to undo a hard-coded String eltype.

If there are actual cases where subtyping AbstractArray{AbstractString} instead of AbstractArray{String} will be problematic, I'm all ears; but I've yet to see a compelling case here that would be worth the extra code complexity.

alyst · 2017-12-26T21:20:19Z

By only storing a pointer and length, it's a true immutable object that can be stack allocated. That's a big part of the entire efficiencies WeakRefStrings provide.

👍

as opposed to more (breaking) type parameter gymnastics

Nah, it was not a "type parameter gymnastics". We can do much more overcomplicated things than that ;)

Actually, I agree with @nalimilan. I think the issue of WeakRefStringsArray <: AbstractArray{T} inheritance is straightforward:

I always thought that eltype() is just a convenience method to get T
T should be the type returned by simple getindex() (see What does T mean in AbstractArray{T}? JuliaLang/julia#9586). If it's not the case, then WeakRefStringsArray breaks AbstractArray concept and will have to override (and maintain) a lot of Base AbstractArray methods (similar(), copy() etc).
T should be concrete and immutable for performance reasons
T should not be WeakRefString, because if it leaks outside of WeakRefStringsArray, there would be GC-related memory corruption problems. Also, CategoricalArray is an example of a container whose elements (by design) don't have sense without the container. That introduces a lot of issues, yet CategoricalValue at least has a proper reference to the pool.

I think that leaves us with T === String.

nalimilan · 2017-12-26T21:43:13Z

See what the manual says:

Note that it's very important to specify the two parameters of the AbstractArray; the first defines the eltype(), and the second defines the ndims().

Anyway, moving from AbstractArray{String} to AbstractArray{WeakRefString} isn't more breaking than changing the result of eltype.

andreasnoack · 2017-12-28T11:39:54Z

Thanks for making these changes

What you describe is essentially the current definition of SubString, and will always be heap allocated because Array is a mutable object.

I don't think this is so simple anymore. @yuyichao has improved the compiler such that some of these allocations can be avoided. I also think that Base.String is quite a bit faster than when WeakRefStrings was created. Switching from WeakRefString so Base.String in TextParse had a quite limited effect on the performance when reading a very large and text-heavy dataset so it might be worth doing some benchmarks to figure out exactly when unsafe optimizations are really worth it.

quinnj · 2018-01-04T05:26:59Z

Alright, thanks for the encouragement @alyst and @nalimilan; I've updated the type parameters here to better reflect the AbstractArray{T} eltype.

I've also done a few rounds of benchmarking w/ parsing a single string up to reading a million strings w/ CSV; while the improvements are minor on the small scale, WeakRefStrings still really provide huge performance benefits when reading lots and lots of string values.

alyst · 2018-01-04T12:44:09Z

Great! Do you mean CSV.jl benchmarks or some internal ones? Would be nice to have some benchmark scripts in WeakRefString.jl.

Also CI is green on 0.7, but I guess if eltype() tests from #20 would be readded, it would still fail due to JuliaLang/julia#25240

quinnj · 2018-01-05T04:02:53Z

@alyst, I didn't see any issues adding the eltypes from the other PR 🤷‍♂️ .

Remove a few pitfall functions that allowed direct processing of Weak…

b807903

…RefStrings. Clarify in docs that WeakRefStrings are for IO optimization and not actual string processing. Cleanup the WeakRefStringArray type parameters and constructors

quinnj mentioned this pull request Dec 24, 2017

Correctness and performance issues for iteration #19

Closed

quinnj mentioned this pull request Dec 24, 2017

WeakRefStringArray: make eltype respect Missing #20

Closed

nalimilan reviewed Dec 24, 2017

View reviewed changes

fix getindex implementation for WeakRefStringArray

da8e40a

Fix AbstractArray eltype parameter

6304b15

quinnj added 3 commits January 4, 2018 20:25

Add more tests

17c06d7

test updates

39c51b7

Add codeunits compat for 0.7

a6c3eb3

quinnj merged commit b5d2543 into master Jan 5, 2018

quinnj deleted the jq/updates branch January 5, 2018 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify WeakRefStrings purpose #21

Clarify WeakRefStrings purpose #21

quinnj commented Dec 24, 2017

codecov bot commented Dec 24, 2017 •

edited

Loading

alyst commented Dec 24, 2017

nalimilan commented Dec 24, 2017 •

edited

Loading

nalimilan Dec 24, 2017

quinnj Dec 26, 2017

nalimilan Dec 24, 2017

alyst commented Dec 25, 2017 •

edited

Loading

alyst commented Dec 25, 2017

nalimilan commented Dec 25, 2017

alyst commented Dec 25, 2017

quinnj commented Dec 26, 2017

nalimilan commented Dec 26, 2017

quinnj commented Dec 26, 2017

nalimilan commented Dec 26, 2017

quinnj commented Dec 26, 2017

alyst commented Dec 26, 2017

nalimilan commented Dec 26, 2017

andreasnoack commented Dec 28, 2017

quinnj commented Jan 4, 2018

alyst commented Jan 4, 2018

quinnj commented Jan 5, 2018

Clarify WeakRefStrings purpose #21

Clarify WeakRefStrings purpose #21

Conversation

quinnj commented Dec 24, 2017

codecov bot commented Dec 24, 2017 • edited Loading

Codecov Report

alyst commented Dec 24, 2017

nalimilan commented Dec 24, 2017 • edited Loading

nalimilan Dec 24, 2017

Choose a reason for hiding this comment

quinnj Dec 26, 2017

Choose a reason for hiding this comment

nalimilan Dec 24, 2017

Choose a reason for hiding this comment

alyst commented Dec 25, 2017 • edited Loading

alyst commented Dec 25, 2017

nalimilan commented Dec 25, 2017

alyst commented Dec 25, 2017

quinnj commented Dec 26, 2017

nalimilan commented Dec 26, 2017

quinnj commented Dec 26, 2017

nalimilan commented Dec 26, 2017

quinnj commented Dec 26, 2017

alyst commented Dec 26, 2017

nalimilan commented Dec 26, 2017

andreasnoack commented Dec 28, 2017

quinnj commented Jan 4, 2018

alyst commented Jan 4, 2018

quinnj commented Jan 5, 2018

codecov bot commented Dec 24, 2017 •

edited

Loading

nalimilan commented Dec 24, 2017 •

edited

Loading

alyst commented Dec 25, 2017 •

edited

Loading