Improve copy of SubDataFrame and DataFrameRow #1564

bkamins · 2018-10-12T20:11:01Z

@nalimilan This is a follow up to #1534, and should be merged if the other is merged (I separate it as we have enough of changes there).

What it does:

change copy for SubDataFrame to ensure that it returns a DataFrame in the future (this is the same functionality as we have but with a syntax taking into account future changes to getindex that are currently in deprecation period)
introduce copy for DataFrameRow

Actually the second point is crucial as it is a simple change that is consistent with #1534 logic but shows one important thing. Should returned NamedTuple have minimal (narrowest) second parametric argument (type specification). This is what is implemented now.
Or it should have a wide type specification consistent with eltype of DataFrame column.

The difference is that either:

narrow type specification will be possibly faster when NamedTuple is passed to a function and it might be a bit faster to create (not 100% sure though);
wider type specification will mean that each NamedTuple created from a row of a given DataFrame will have the same concrete type.

I am not fully clear what is better (I have implemented first, but I have the feeling that maybe the second approach is better).

nalimilan · 2018-10-13T10:06:59Z

Thanks. I'd be inclined to use wider types instead, with the idea that it preserves more information and it makes things more "type-stable" in the sense that the type does not depend on the particular row (but of course inference doesn't know the columns types of a DataFrameRow). This is useful if you need to know what type to expect from remaining rows when allocating a column. It's possible that the wider type would be slightly less efficient (e.g. Union{T,Missing} rather than T), but taking a row in isolation doesn't make much sense.

Cc: @quinnj

nalimilan · 2018-10-16T08:06:41Z

See discussion at JuliaData/Tables.jl#28.

bkamins · 2018-10-26T20:04:13Z

@nalimilan
Given the decision in #1534 what copy on DataFrameRow should produce? A DataFrame or something else?

Also:

Base.convert(::Type{Array}, r::DataFrameRow) = convert(Array, parent(r)[row(r),:])

method seems strange given our current approach. Maybe we should deprecate it and only provide support for Vector only and then also add Vector(dfr) and Vector{T}(dfr) methods?

nalimilan · 2018-10-26T20:15:30Z

Maybe a NamedTuple would be better, as it is more similar to a DataFrameRow than a DataFrame (e.g. indexing gives a scalar).

The convert methods indeed need adjusting, but I think we'd better keep them as there's no reason to support only the constructor.

bkamins · 2018-10-26T20:29:25Z

OK.

With the convert methods the problem is that only convert(Array, dfr) now works and it produces a matrix not a vector. In particular convert(Vector, dfr) or convert(Matrix, dfr) currently fail.

nalimilan · 2018-10-26T20:45:53Z

OK. I guess we should at least support convert(Vector, dfr); it's a bit weird to see DataFrameRows as matrices.

Does this PR need anything more?

bkamins · 2018-10-26T20:58:01Z

I have added a doc string and it should be good.
I will handle conversions in a separate PR.

improve copy of SubDataFrame and DataFrameRow

011f03b

bkamins mentioned this pull request Oct 13, 2018

WIP: clean up getindex: part 2 #1534

Merged

add docstring

6075996

bkamins force-pushed the copy_cleanup branch from fd71fc4 to 6075996 Compare October 26, 2018 20:57

bkamins mentioned this pull request Oct 26, 2018

convert for DataFrameRow #1575

Merged

Add "a"

bbb272f

nalimilan approved these changes Oct 27, 2018

View reviewed changes

bkamins merged commit 10eb8d2 into JuliaData:master Nov 1, 2018

bkamins deleted the copy_cleanup branch November 1, 2018 12:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve copy of SubDataFrame and DataFrameRow #1564

Improve copy of SubDataFrame and DataFrameRow #1564

bkamins commented Oct 12, 2018

nalimilan commented Oct 13, 2018

nalimilan commented Oct 16, 2018

bkamins commented Oct 26, 2018

nalimilan commented Oct 26, 2018

bkamins commented Oct 26, 2018

nalimilan commented Oct 26, 2018 •

edited

Loading

bkamins commented Oct 26, 2018

Improve copy of SubDataFrame and DataFrameRow #1564

Improve copy of SubDataFrame and DataFrameRow #1564

Conversation

bkamins commented Oct 12, 2018

nalimilan commented Oct 13, 2018

nalimilan commented Oct 16, 2018

bkamins commented Oct 26, 2018

nalimilan commented Oct 26, 2018

bkamins commented Oct 26, 2018

nalimilan commented Oct 26, 2018 • edited Loading

bkamins commented Oct 26, 2018

nalimilan commented Oct 26, 2018 •

edited

Loading