allow creating TypedDataset from DFs with different column order #259

mfelsche · 2018-02-19T16:21:17Z

as they can appear when loading partitioned datasets from e.g. parquet.

Vanilla spark is able to deserialize from dataframes where the fields are in different order.
The reshaping in TypedDataset.createUnsafe just renamed column so that types did not align anymore, although beforehand, in this particular case, it actually should have worked.

Given a dataframe of ("a": A, "b": B, "c": C) we want a Dataset of case class X(b: B, c: C, a: A) from it. When using TypedDataset.createUnsafe we an encoder which will try to read the underlying relation/dataframe as: ("b": A, "c": B, "a": C) which will clearly fail to serialize/deserialize to X.

To work around this issue, we now check if the target columns are a subset of the source df columns and if so, do a select instead of a .toDf(names) which only renames.

as they can appear when loading partitioned datasets from e.g. parquet

codecov-io · 2018-02-19T17:08:30Z

Codecov Report

Merging #259 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #259      +/-   ##
==========================================
+ Coverage   96.56%   96.57%   +<.01%     
==========================================
  Files          51       51              
  Lines         874      876       +2     
  Branches       11       10       -1     
==========================================
+ Hits          844      846       +2     
  Misses         30       30

Impacted Files	Coverage Δ
...ataset/src/main/scala/frameless/TypedDataset.scala	`100% <ø> (ø)`	⬆️
...ain/scala/frameless/functions/UnaryFunctions.scala	`100% <0%> (ø)`	⬆️
...ore/src/main/scala/frameless/CatalystOrdered.scala	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6aca77e...5bb8162. Read the comment docs.

OlivierBlanvillain · 2018-02-20T18:25:46Z

dataset/src/main/scala/frameless/TypedDataset.scala

@@ -1036,8 +1036,15 @@ object TypedDataset {
 val shouldReshape = output.zip(targetColNames).exists {
 case (expr, colName) => expr.name != colName
 }
-
- val reshaped = if (shouldReshape) df.toDF(targetColNames: _*) else df
+ val canSelect = targetColNames.toSet.subsetOf(output.map(_.name).toSet)


Can you give an example where this would be false?

If you'd remove the .toDF("b", "a") call from the test case this one would be such a case where the column names are still _1 and _2 from the original dataframe.
Frameless would be successful here if the types would align.

OlivierBlanvillain · 2018-02-20T18:27:30Z

dataset/src/main/scala/frameless/TypedDataset.scala

+ val canSelect = targetColNames.toSet.subsetOf(output.map(_.name).toSet)
+
+ val reshaped = if (shouldReshape && canSelect) {
+ df.select(targetColNames.head, targetColNames.tail:_*)


It's not totally clear whether or not systematically doing a select would create a lot of additional work at runtime. What do you think?

The thing is i don't know how to otherwise reshape the dataframe into the correct form. I would suspect this is a rather cheap operation as it only shifts references, as it is guaranteed to only fetch a subset of the existing fields.

Actually the plan for the test case collect() is:

Project [_1#2 AS b#11, _2#3 AS a#12] +- LocalRelation [_1#2, _2#3]

which is just shifting the fields.

And the select is only done if the column names are not already in the right order.

OlivierBlanvillain · 2018-02-20T18:28:31Z

dataset/src/test/scala/frameless/CreateTests.scala

+ (a1: A, b1: B) => {
+ val ds = TypedDataset.create(
+ Vector((b1, a1))
+ ).dataset.toDF("b", "a").as[X2[A, B]](TypedExpressionEncoder[X2[A, B]])


Why do you have to pass this TypedExpressionEncoder explicitly?

Otherwise DataFrame.as is going to use spark encoders.

OlivierBlanvillain · 2018-02-20T18:29:49Z

dataset/src/test/scala/frameless/CreateTests.scala

+ TypedDataset.create(ds).collect().run().head ?= X2(a1, b1)
+ }
+ }
+ check(prop[Double, Double])


Maybe to prop[X1[Double], X1[X1[Double]]] or something, because having both Double do no check for proper column ordering :)

totally true, gonna change that.

OlivierBlanvillain · 2018-02-21T17:53:50Z

LGTM, thanks!

allow creating TypedDataset from DFs with different column order

3dd307e

as they can appear when loading partitioned datasets from e.g. parquet

OlivierBlanvillain reviewed Feb 20, 2018

View reviewed changes

better params for column order create test case

5bb8162

OlivierBlanvillain merged commit 53fcf79 into typelevel:master Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow creating TypedDataset from DFs with different column order #259

allow creating TypedDataset from DFs with different column order #259

mfelsche commented Feb 19, 2018

codecov-io commented Feb 19, 2018 •

edited

Loading

OlivierBlanvillain Feb 20, 2018

mfelsche Feb 21, 2018

OlivierBlanvillain Feb 20, 2018

mfelsche Feb 21, 2018

OlivierBlanvillain Feb 20, 2018

mfelsche Feb 21, 2018

OlivierBlanvillain Feb 20, 2018

mfelsche Feb 21, 2018

OlivierBlanvillain commented Feb 21, 2018

allow creating TypedDataset from DFs with different column order #259

allow creating TypedDataset from DFs with different column order #259

Conversation

mfelsche commented Feb 19, 2018

codecov-io commented Feb 19, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OlivierBlanvillain commented Feb 21, 2018

codecov-io commented Feb 19, 2018 •

edited

Loading