-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Schema field selections in DoFn using NewDoFn injection #8311
Allow Schema field selections in DoFn using NewDoFn injection #8311
Conversation
Run Java_Examples_Dataflow PreCommit |
Run Python PreCommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for taking some time to go over this with me in-person. Just a few relatively minor comments.
FieldAccessDescriptor.withFieldNames("integerField"); | ||
|
||
@FieldAccess("intsSelector") | ||
final FieldAccessDescriptor stringsSelector = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be intsSelector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the variable name - it looks like a copy-paste error since the FieldAccess
string, intsSelector
, doesn't match the variable name, stringSelector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed variable name
@Override | ||
public int hashCode() { | ||
return Objects.hash(integerField, stringField); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use AutoValue here instead of implementing equals and hashCode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion, done. Note that this uncovered a bug in schema inference that I fixed.
public void process( | ||
@FieldAccess("inner.*") PojoForExtraction pojo, | ||
@FieldAccess("inner") PojoForExtraction pojo2, | ||
@FieldAccess("inner.stringField") String stringField, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a FieldAccess
has a comma-separated list of selectors? like "inner.stringField, inner.integerField"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the grammar for our expression parser doesn't include commas, so it will fail with parse error.
@@ -136,7 +136,7 @@ | |||
InputT element(DoFn<InputT, OutputT> doFn); | |||
|
|||
/** Provide a link to the input element. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring here to explain what index is? It might also be useful to update the element
docstring as well - "Provide a link to the input element." seems misleading, from what I can tell the implementations should just return a reference to the element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
abstract DoFnSchemaInformation build(); | ||
} | ||
|
||
public abstract Builder toBuilder(); | ||
|
||
public <T> DoFnSchemaInformation withElementParameterSchema(SchemaCoder<T> schemaCoder) { | ||
return toBuilder().setElementParameterSchema(schemaCoder).build(); | ||
DoFnSchemaInformation withElementParameterSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a docstring here and on withUnboxParameter
? I think that would help a lot with readability. It doesn't need to be exhaustive, but there are a couple of questions I had to work through by reading the code that I think could be clarified here:
- Can
elementCoder
be consideredoutputCoder
(i.e. the coder to use for selectOutputSchema), or is it something else? - What's the difference between
withElementParameterSchema
withunbox = true
andwithUnboxParameter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Run Dataflow ValidatesRunner |
Run Dataflow Runner Nexmark Tests |
Run Dataflow ValidatesRunner |
Run Dataflow Runner Nexmark Tests |
Run Dataflow ValidatesRunner |
1 similar comment
Run Dataflow ValidatesRunner |
Run Dataflow Runner Nexmark Tests |
author to select fields as a POJO or a primitive type.
3b6db54
to
bca3798
Compare
Run Dataflow ValidatesRunner |
Run Dataflow ValidatesRunner |
Run Java PreCommit |
2 similar comments
Run Java PreCommit |
Run Java PreCommit |
@TheNeuralBit I believe I've address all comments |
Run Java PreCommit |
Yep! LGTM It looks like the last test failure was a timeout, so I guess this is probably still safe to merge? |
@TheNeuralBit I already got a green Java PreCommit run, and as you said the rerun was a timeout. |
Run Dataflow Runner Nexmark Tests |
This change enables selecting fields out of input schemas directly in a DoFn. Fields can automatically be converted to user types if the types match.
For example, if a input schema contained a latitude and a longitude field, you could write:
ParDo.of(new DoFn<>() {
@ProcessElement public void process(
@FieldAccess("latitude") double latitude, @FieldAccess('longitude") double longitude) {
}
});
And Beam will automatically extract those fields from the input schema and inject just them into the dofn.
Beam will also automatically convert to user types. If the input schema has a nested row that matches a custom user type's schema, the user can use that Java type in their DoFn parameter and Beam will convert the field to that type.
R: @TheNeuralBit