Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column types get obliterated by Query.jl #313

Open
samuela opened this issue Aug 7, 2020 · 7 comments
Open

Column types get obliterated by Query.jl #313

samuela opened this issue Aug 7, 2020 · 7 comments

Comments

@samuela
Copy link

samuela commented Aug 7, 2020

I have a DataFrame df with correct column types (String, Float64, etc). However after processing with Query.jl I'm getting only Any column types. Here's the suspect code snippet:

brand_code_df =
    df |>
    @groupby((_.Brand, _.Product_Code)) |>
    @map({
        Brand = key(_)[1],
        Product_Code = key(_)[2],
        WAP = sum(_.Unit_Price .* _.Units) / sum(_.Units),
        WAC = sum(_.Unit_Cost .* _.Units) / sum(_.Units),
        total_sales = sum(_.Sales),
        gross_margin = sum(_.Margin),
        GMP = sum(_.Margin) / sum(_.Sales) * 100,
        total_code_units = sum(_.Units),
        weight = sum(_.Units) / brand_total_units[key(_)[1]],
        unique_prices = unique_prices(_),
    }) |>
    DataFrame

Now, brand_code_df will have only Any column types.

OTOH I've found that doing ... |> collect |> DataFrame does in fact retain the correct column types.

@davidanthoff
Copy link
Member

Is there a chance that you could a) post a short snippet that creates a DataFrame with the correct columns and just 1-2 rows with sample data? Literally something like DataFrame(Brand=["asdf", "lij"], Product_Code=[3, 4]) or something like that, so that I can reproduce this, and 2) can you post the code for unique_prices?

@samuela
Copy link
Author

samuela commented Aug 7, 2020

Hey @davidanthoff ! Yeah, let me see if I can come up with some mock data that have the same effect...

@extradosages
Copy link

I've observed this when using @mutate.

@davidanthoff
Copy link
Member

@extradosages @mutate uses @map under the hood. Any more data you could provide to replicate this would be helpful.

@i-aki-y
Copy link

i-aki-y commented Jan 13, 2021

Hi, @davidanthoff I encountered the same problem.
I hope this small example helps you somewhat.

julia> using DataFrames
julia> using Query

julia> struct Item
           value::Union{Missing, Float64}
       end

julia> df = DataFrame(:x => [Item(1.0)])
1×1 DataFrame
 Row │ x
     │ Item
─────┼───────────
   1 │ Item(1.0)

julia> df |> @mutate(y = _.x.value) |> DataFrame
1×2 DataFrame
 Row │ x          y
     │ Any        Any
─────┼────────────────
   1 │ Item(1.0)  1.0

@i-aki-y
Copy link

i-aki-y commented Feb 3, 2021

I have examined this problem furthermore.

The problem seems to happen in a return type estimation of a map function that is defined in the QueryOperators.

function map(source::Enumerable, f::Function, f_expr::Expr)
    TS = eltype(source)
    T = Base._return_type(f, Tuple{TS,})
    S = typeof(source)
    Q = typeof(f)
    return EnumerableMap{T,S,Q}(source, f)
end

cf. https://github.com/queryverse/QueryOperators.jl/blob/fd7534405a5f2db2d555f4dd9e796205d7711cde/src/enumerable/enumerable_map.jl#L12

Although I'm not sure what the Base._return_type is since it is undocumented, it seems to estimate a return type of the function f that generates a NamedTuple from the argument of @mutate.
And it fails with some kind of input are given.

This is an example.

using DataFrames
using QueryOperators

struct Item1
    value::Union{Missing, Int64}
end

struct Item2
    value::Int64
end

QueryOperators.map(QueryOperators.query([Item1(1.0)]), item -> (v = item.value, ), :()) |> DataFrame |> println
#1×1 DataFrame
# Row │ v   
#     │ Any 
#─────┼─────
#   1 │ 1


QueryOperators.map(QueryOperators.query([Item2(1.0)]), item -> (v = item.value, ), :()) |> DataFrame |> println
#1×1 DataFrame
# Row │ v     
#     │ Int64 
#─────┼───────
#   1 │     1

Sorry, I'm not sure why it happens, whether this is some limitation of a type inference of the language or kind of bugs. It will be difficult to investigate the cause any further with my limited knowledge now.

Anyway, I hope this will help.

@tlamadon
Copy link

tlamadon commented Feb 9, 2021

I am having the same problem on grouping on multiple columns. However ... |> collect |> DataFrame, so thanks for that suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants