[BEAM-7174] Add schema modification transforms #8425

reuvenlax · 2019-04-29T15:44:52Z

Add three transforms for modifying schemas:
AddFields Add new fields to a schema. Existing rows are padded with null values in the position of these new fields (or alternatively the user can specify a default value)
DropFields Drop fields from a schema.
RenameFields Rename schema fields.

reuvenlax · 2019-05-03T20:16:15Z

R: @robinyqiu

robinyqiu · 2019-05-07T18:46:13Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/FieldAccessDescriptor.java

+ .collect(Collectors.toList());
+ List<String> nestedSelectors =
+ getNestedFieldsAccessed().entrySet().stream()
+ .map(e -> e.getKey().getFieldName() + "." + e.getValue().toString())


If there are multiple sub-descriptors under a same field (e.g. a.b and a.c), won't this return something like "a.b, c"?
To make the string looks prettier, how about we simply add a bracket around sub-descriptor if it does not referencesSingleField (i.e. return something like "a.[b, c]")?

Current if you have a.b and a.c, this with return a.b, a.c.
We could make it return a.[b,c], but it might get confusing for multiple nested descriptors? For example then a.b.c.d, a.b.c.e, a.c.f, a.g

will be printed as a.[b.[c.[d, e] , f], g]

Where it's probably more readable as simply a.b.c.d, a.b.c.e, a.c.f, a.g

robinyqiu · 2019-05-07T18:49:49Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/FieldAccessDescriptor.java

+ /**
+ * Return the field names accessed. Should not be called until after {@link #resolve} is called.
+ */
+ public Set<String> fieldNamesAccessed() {


Consider adding some unit tests for these new functions?

robinyqiu · 2019-05-07T23:49:38Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/AddFields.java

+ * .field("userDetails.isSpecialUser", "FieldType.BOOLEAN", false));
+ * }</pre>
+ */
+public class AddFields {


Add @Experiment (same for the other transforms).

Good catch. done.

robinyqiu · 2019-05-07T23:55:09Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/DropFields.java

+ FieldAccessDescriptor selectDescriptor =
+ complement(inputSchema, fieldsToDrop.resolve(inputSchema));
+
+ return Select.<T>fieldAccess(selectDescriptor).expand(input);


Is there a reason why we call the expand function here explicitly, instead of doing
return input.apply(Select.<T>fieldAccess(selectDescriptor));
as usual?

No, input.apply is the preferred way. Changed.

robinyqiu · 2019-05-09T18:15:08Z

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/transforms/DropFieldsTest.java

+ @Category(NeedsRunner.class)
+ public void testDropNestedField() {
+ Schema expectedSchema =
+ Schema.builder().addStringField("string").addStringField("field2").build();


The original schema here is:
nested: ROW[field1: int, field2: string], string: string
After dropping nested.field1, the output schema becomes:
string: string, field2: string.

I have two questions here:

The order of the fields has changed (string comes before field2 now). Is this the intended behavior? I see this is an implementation detail in the SelectHelpers.union function (It merges direct fields accessed first and then nested fields accessed).

nested.field2 is "unwrapped" to field2. I agree to make nested.field2 a top-level field in the output schema, but I think we need to do some work to properly name the field (e.g. name it nested.field2 instead of only field2). Otherwise this may result in unexpected behaviors: e.g. if in the example schema the second top-level field is also named field2, then there will be a naming conflict and the output schema construction will fail.)

Either way, I think we need a clear documentation (and better unit tests) on the intended behavior in the javadoc on Select, in addition to SelectHelpers.

You're absolutely correct. The problem actually is that FieldAccessDescriptor stores top-level fields and nested fields separately so the transforms can't tell the original order. This is something we need to fix, and will require redoing FieldAccessDescriptor a bit. I'll file a JIRA as it's out of scope of this PR, and also affects the Select transform.

This is a current weakness of the Select transform. select("a.b", "c.b") currently doesn't work. We need to add select(x).as(y) functionality. I think there might already be a JIRA for this, if not I'll add one.

robinyqiu · 2019-05-10T18:39:44Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/RenameFields.java

+ * PCollection<Row> renamedEvents =
+ * events.apply(RenameFields.<Event>create()
+ * .rename("userName", "userId")
+ * .rename("location.country", "location.countryCode"));


The implementation seems to be different as described here. I think this line should be:
.rename("location.country", "countryCode")

It would be good if we can add a line of comment here to make it clear that to rename a nested field, users don't need to specify the prefix for the new name.

robinyqiu · 2019-05-10T22:45:03Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/AddFields.java

+ public static class Inner<T> extends PTransform<PCollection<T>, PCollection<Row>> {
+ /** Internal object representing a new field added. */
+ @AutoValue
+ abstract static class NewField implements Serializable {


Why are we putting NewField and AddFieldsInformation inside Inner?

No specific reason. these are effectively private classes so it doesn't change the interface (we can't make them actually private due to AutoValue).

robinyqiu · 2019-05-12T05:33:30Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/AddFields.java

+ return new Inner<>(fields);
+ }
+
+ private AddFieldsInformation getAddFieldsInformation(


I believe these functions can be static.

robinyqiu · 2019-05-12T06:13:24Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/AddFields.java

+ builder.addField(field);
+ }
+
+ // Add any new fields at this level.


By reading the implementation code here, I see that top-level fields are added before nested fields. Can we document this in the class level javadoc? Otherwise this may be confusing to users, e.g. after calling
pc.apply(AddFields.create().field("new1.nested", FieldType.STRING).field("new2", FieldType.INT32)
users may expect a new schema in which new1 appears before new2, which is not the case.

Will talk about this more generally below.

robinyqiu · 2019-05-12T06:37:21Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/AddFields.java

+ // If there are brand new simple (i.e. have no nested values) fields at this level, then add
+ // the default values for all of them.
+ newValues.addAll(addFieldsInformation.getDefaultValues());
+ // If we are creating new recursive fields, populate new values for the here.


robinyqiu · 2019-05-12T06:57:08Z

Hi Reuven, thank you for your patience. Reviewing this PR took some time. I added some comments in here.

A major problem I see here is that the order of the fields in the output schema of AddFields, DropFields (and Select) are not well specified. It depends on various implementation details, like wether the field is a nested field or top-level field. Sometimes the order of fields in the output schema could be unexpected. Can we have a better defined behavior of these transforms, and properly document it as well?

reuvenlax · 2019-05-14T14:58:32Z

About your concern w.r.t. field ordering:
In general I agree with you. Transforms like Select etc. should return schemas with the fields in the order selected, and transforms like DropFields and RenameFields should preserve the original field order. This requires redoing FieldAcessDescriptor as it currently doesn't have enough information to preserve this (since it stores top-level fields and nested fields separately). This is a general bug, and I just filed BEAM-4076 for us to fix it.

For the AddFields transform it requires a bit more thought. If the user calls AddFields("a.b", "c", "a.d"), then we cannot preserve the exact order, as a is specified both as the first and the third field. IMO we should let the first one win and add in the order [a.[b, d], c].

robinyqiu · 2019-05-15T00:04:38Z

LGTM. Thanks for the fix.

BEAM-7301 is tracking the remaining field ordering issue, and I am happy to work on the fix for that once I can find some time.

…ransforms

Add new schema transforms.

ad66a04

reuvenlax force-pushed the add_schema_modification_transforms branch from ba95f57 to ad66a04 Compare April 29, 2019 21:04

reuvenlax changed the title ~~Add schema modification transforms~~ [BEAM-7174] Add schema modification transforms Apr 29, 2019

Reuven Lax added 2 commits May 2, 2019 11:03

foo

a29efd1

fixes

987382e

Reuven Lax added 2 commits May 3, 2019 13:20

fix imports

b752d6a

fix checkstyle

2823464

robinyqiu reviewed May 12, 2019

View reviewed changes

Address comments.

b7509ed

reuvenlax merged commit c630217 into apache:master May 15, 2019

charithe pushed a commit to shehzaadn-vd/vend-beam that referenced this pull request May 16, 2019

Merge pull request apache#8425: [BEAM-7174] Add schema modification t…

ab7da29

…ransforms

ajamato pushed a commit to ajamato/beam that referenced this pull request May 18, 2019

Merge pull request apache#8425: [BEAM-7174] Add schema modification t…

96b5a4b

…ransforms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-7174] Add schema modification transforms #8425

[BEAM-7174] Add schema modification transforms #8425

reuvenlax commented Apr 29, 2019 •

edited

Loading

reuvenlax commented May 3, 2019

robinyqiu May 7, 2019

reuvenlax May 14, 2019

robinyqiu May 7, 2019

robinyqiu May 7, 2019

reuvenlax May 14, 2019

robinyqiu May 7, 2019

reuvenlax May 14, 2019

robinyqiu May 9, 2019

reuvenlax May 14, 2019

robinyqiu May 10, 2019 •

edited

Loading

reuvenlax May 14, 2019

robinyqiu May 10, 2019

reuvenlax May 14, 2019

robinyqiu May 12, 2019

reuvenlax May 14, 2019

robinyqiu May 12, 2019

reuvenlax May 14, 2019

robinyqiu May 12, 2019

reuvenlax May 14, 2019

robinyqiu commented May 12, 2019

reuvenlax commented May 14, 2019

robinyqiu commented May 15, 2019

[BEAM-7174] Add schema modification transforms #8425

[BEAM-7174] Add schema modification transforms #8425

Conversation

reuvenlax commented Apr 29, 2019 • edited Loading

reuvenlax commented May 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinyqiu May 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinyqiu commented May 12, 2019

reuvenlax commented May 14, 2019

robinyqiu commented May 15, 2019

reuvenlax commented Apr 29, 2019 •

edited

Loading

robinyqiu May 10, 2019 •

edited

Loading