Store logical type values in Row instead of base values #11074

reuvenlax · 2020-03-09T04:30:35Z

@kanterov @alexvanboxel I believe this PR will fix the issues you've both had with logical types.

After some thought, I think we're better off storing the logical type value in the Row object. It potentially makes SchemaCoder a tiny bit slower, but unlikely this is noticeable. One bonus: it means that storage for OneOf types is far more memory efficient, as we no longer need to store the entire row.

reuvenlax · 2020-03-09T04:43:17Z

R: @kanterov

alexvanboxel · 2020-03-09T07:23:28Z

I can review the PR only this evening, but the first question that pops into my mind is:
What traits does the logical type need to have to play nice with Beam:

Does it need to be comparable?
Does the equal need to have proper behaviour?
These things we have control over when storing and working with the base type.

kanterov

Overall, the idea looks good. We have room to keep using base values in some implementation of Row, that would have specialized implementation of Coder. getValue being "cheap" is that what I would expect from such API as a user, that what makes it better than the existing approach.

I've left a few comments, didn't have time to go through some pieces (e.g. ByteBuddy changes).

kanterov · 2020-03-09T12:56:05Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoderHelpers.java

+ .put(TypeName.BOOLEAN, BooleanCoder.of())
+ .build();
+
+ private static class LogicalTypeCoder extends Coder {


There is similar code in cb3dce0#diff-7ed3da1c0f656d0645423b9dab4da881R31

There are a few differences that might make sense to incorporate:

don't use rawtypes for better typesafety

override structuralValue

explicitly override constentWithEquals

I wish we could do something better for contentWithEquals, but out of my head it would be unexpected to add any of this information to LogicalType

having registerByteSizeObserver, isRegisterByteSizeObserverCheap might make sense

getCoderArguments could return baseCoder

Made these changes.

We should probably update SchemaCoder as well to do this properly, but that can be in a different PR IMO.

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoderHelpers.java

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/transforms/GroupTest.java

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/JavaFieldSchemaTest.java

sdks/java/core/src/main/java/org/apache/beam/sdk/values/Row.java

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoder.java

sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoderGenerator.java

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/FromRowUsingCreator.java

reuvenlax · 2020-03-09T22:53:11Z

@alexvanboxel Yes, that is exactly true. This means that the logical type value must now implement equals and hashCode, whereas before it didn't (because we stored only the base value).

We could work around this by always converting to the base type inside of equals and hashCode, though that might be a bit expensive. However value types without a proper equals are generally discouraged in Java, so I think we can simply document that equals is required.

alexvanboxel · 2020-03-10T07:46:27Z

We could work around this by always converting to the base type inside of equals and hashCode, though that might be a bit expensive. However value types without a proper equals are generally discouraged in Java, so I think we can simply document that equals is required.

Documenting the requirement seems reasonable.

reuvenlax · 2020-03-11T16:10:55Z

Run Java PreCommit

reuvenlax · 2020-03-19T04:28:42Z

@alexvanboxel @kanterov any more comments?

@alexvanboxel this will conflict badly with your PR I think (as you moved code into a new file, which makes merges tricky) so we need to be quite careful with how we merge these PRs.

alexvanboxel · 2020-03-19T10:45:50Z

@alexvanboxel this will conflict badly with your PR I think (as you moved code into a new file, which makes merges tricky) so we need to be quite careful with how we merge these PRs.

No more comment, I like the consistency. We're using some logical types into our pipelines, so I will make a custom build before the release to see if all goes well.

I've merged my PR onto master as soon as I saw the LGTM, so I think the easiest thing is to rebase this branch onto master and handle the schema conflicts. I'll review as soon as this is done.

reuvenlax · 2020-03-19T22:49:40Z

@alexvanboxel I've rebased on top of your PR now. the only conflicts were in Row.java, which I've now handled.

alexvanboxel

LGTM except a small comment, but we need to realize this is a braking change for pipelines that use logical types. Probably some IO's that are schema aware will brake as well.

alexvanboxel · 2020-03-22T09:54:51Z

...ava/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamComplexTypeTest.java

- .apply(Create.of(row))
- .setRowSchema(outputRowSchema)
+ .apply(Create.of(row).withRowSchema(inputRowSchema))
+ // .setRowSchema(outputRowSchema)


Remove comment

reuvenlax · 2020-03-23T05:43:23Z

Run SQL Postcommit

reuvenlax · 2020-03-23T05:55:19Z

Run SQL Postcommit

reuvenlax · 2020-03-23T06:54:42Z

Run Java PreCommit

reuvenlax · 2020-03-23T06:58:57Z

Run Java PreCommit

kanterov · 2020-03-23T12:25:02Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/SchemaCoderHelpers.java

+ public void encode(InputT value, OutputStream outStream) throws CoderException, IOException {
+ BaseT baseType = logicalType.toBaseType(value);
+ if (isDateTime) {
+ baseType = (BaseT) ((ReadableInstant) baseType).toInstant();


Why is this check needed?

It's to maintain the current invariant that any ReadableInstant can be passed in, while the current InstantCoder requires an Instant. This used to be enforced in the Row builder because we would call toBaseType there.

We need to redo DateTime types I think (really we have a timestamp type, not a datetime type), but until then I wanted to maintain the existing behavior.

Thanks. Makes sense. Agree that DateTime types need to be redone.

reuvenlax · 2020-03-23T16:20:43Z

Run Java PreCommit

reuvenlax · 2020-03-23T17:26:59Z

Run Java PreCommit

reuvenlax · 2020-03-23T18:43:41Z

Run Java PreCommit

reuvenlax · 2020-03-23T20:00:28Z

Run Java PreCommit

reuvenlax · 2020-03-23T21:25:39Z

Run Java PreCommit

reuvenlax · 2020-03-23T22:19:45Z

Run Java PreCommit

reuvenlax · 2020-03-24T00:38:08Z

After 8 runs, the only Java Precommit failures have been random flakes (e.g. in Flink tests).

mosche · 2022-03-03T14:50:41Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/RowWithGetters.java

+ return (T) oneOfType.createValue(oneOfValue.getCaseType(), convertedOneOfField);
+ } else if (type.getTypeName().isLogicalType()) {
+ // Getters are assumed to return the base type.
+ return (T) type.getLogicalType().toInputType(fieldValue);


@reuvenlax Wondering about handling of logical types here: I don't think it matters too much as I couldn't find any usage of logical types in the various GetterBasedSchemaProviders. But I ran into this when testing various approaches.

When using a logical type with RowWithGetters, getterTarget will contain a corresponding field of input type. The getter will have to convert that field value to base type to match this assumption, just to convert it back to input type here. And vice versa for the setter here.

Wouldn't it be more performant to expect getters to return the input type?

Alternatively, I wonder if we could we add a way to register getters that are assumed to return the input type?

There was a past world where the getters supported returning either logical type or base type. This was suppose to be whatever the previous transform provided (or the base type if coming from a coder). The goal was to save a conversion from base type to logical type for values that were just passed along. I believe the only use case was SQL and I removed that in #13930 because it is broken and like you notice here more expensive.

It should be possible to do more simplification such that we always work with logical types instead of base types. Eventually we will want a pass-through optimization again, but it will need to be at a lower level than logical types. (The only expensive type today is String, which isn't a logical type.)

probot-autolabeler bot added the java label Mar 9, 2020

reuvenlax requested a review from alexvanboxel March 9, 2020 04:43

kanterov reviewed Mar 9, 2020

View reviewed changes

probot-autolabeler bot added extensions protobuf sql labels Mar 10, 2020

reuvenlax force-pushed the redo_logical_type_storage branch from 40b9b65 to 71ea5fc Compare March 19, 2020 22:48

alexvanboxel approved these changes Mar 22, 2020

View reviewed changes

reuvenlax force-pushed the redo_logical_type_storage branch from 71ea5fc to 4d27173 Compare March 23, 2020 05:43

Reuven Lax added 2 commits March 22, 2020 22:46

store logical type values in Row instead of base values

f11a6b8

fix spotbugs

6e77596

reuvenlax force-pushed the redo_logical_type_storage branch from 4d27173 to 6e77596 Compare March 23, 2020 05:55

kanterov reviewed Mar 23, 2020

View reviewed changes

reuvenlax merged commit fc6cef9 into apache:master Mar 24, 2020

robinyqiu mentioned this pull request Jun 29, 2020

[BEAM-10224] Test group by and aggregation on DATE and TIME type #12120

Merged

apilloud mentioned this pull request Apr 6, 2021

[BEAM-9379] Simplify BeamCalcRel inputs #13930

Merged

4 tasks

mosche reviewed Mar 3, 2022

View reviewed changes

mosche mentioned this pull request Mar 23, 2022

[BEAM-13416] Introduce Schema provider for AWS model classes extending SdkPojo #16947

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store logical type values in Row instead of base values #11074

Store logical type values in Row instead of base values #11074

reuvenlax commented Mar 9, 2020

reuvenlax commented Mar 9, 2020

alexvanboxel commented Mar 9, 2020

kanterov left a comment

kanterov Mar 9, 2020

reuvenlax Mar 11, 2020

reuvenlax commented Mar 9, 2020

alexvanboxel commented Mar 10, 2020

reuvenlax commented Mar 11, 2020

reuvenlax commented Mar 19, 2020

alexvanboxel commented Mar 19, 2020

reuvenlax commented Mar 19, 2020

alexvanboxel left a comment

alexvanboxel Mar 22, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

kanterov Mar 23, 2020

reuvenlax Mar 23, 2020

kanterov Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 24, 2020

mosche Mar 3, 2022

TheNeuralBit Mar 23, 2022

apilloud Mar 23, 2022

Store logical type values in Row instead of base values #11074

Store logical type values in Row instead of base values #11074

Conversation

reuvenlax commented Mar 9, 2020

reuvenlax commented Mar 9, 2020

alexvanboxel commented Mar 9, 2020

kanterov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Mar 9, 2020

alexvanboxel commented Mar 10, 2020

reuvenlax commented Mar 11, 2020

reuvenlax commented Mar 19, 2020

alexvanboxel commented Mar 19, 2020

reuvenlax commented Mar 19, 2020

alexvanboxel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 23, 2020

reuvenlax commented Mar 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment