[BEAM-4461] Support inner and outer style joins in CoGroup. #7353

reuvenlax · 2018-12-26T20:53:48Z

Multiple improvements to the schema CoGroup transform:

Allow the user to use strings instead of TupleTags. TupleTags existed to make Java type inference work, and this is not needed with the schema-based join as the types are in the schema. This also allows a simpler builder for PCollectionTuple.
Instead of multiple CoGroup.byFieldNames, byFieldIds, etc. the new syntax is CoGroup.join(By.fieldNames), CoGroup.join(By.fieldIds), etc. This shrinks the API surface area, and also provides a place to provide per-input options (used for outer joins).
Add a .crossProductJoin. This expands the iterables into an inner-product. For example:
PCollection innerJoined = inputs.apply(
CoGroup.join("input1", By.fieldNames("user"))
.join("input2", By.fieldNames("user"))
.crossProductJoin();
Each input can be marked for "outer-join" participation semantics. This means that if no records for that input are present for a join key, an output is still generated from the cross product with the value for that input replaced by a null. This generalizes normal left/right/full outer joins to N inputs. For example with two inputs:
PCollection leftOuterJoined = inputs.apply(
CoGroup.join("input1", By.fieldNames("user").withOuterJoinParticipation())
.join("input2", By.fieldNames("user"))
.crossProductJoin();
R: @dpmills
R: @akedin

reuvenlax · 2019-01-16T06:35:51Z

@akedin any comments on this PR?

reuvenlax · 2019-02-13T16:26:13Z

@kanterov do you have any time to help review this PR?

kanterov · 2019-02-13T18:38:03Z

@reuvenlax I'm going on vacation, but I review on the week of 25th February

reuvenlax · 2019-02-25T05:16:57Z

@kanterov thank you very much!

…eneralizes inner and outer joins to multiple input PCollections.

kanterov · 2019-02-27T20:54:56Z

I'm looking into it.

kanterov

The code looks good. Would be useful to abstract over tuple cardinality in SQL transforms, and fixing issues, for instance BEAM-5049.

I've carefully gone through the Javadoc, and it's great, was mostly nitpicking there.

One case I didn't find covered is PCollectionTuple having not all tags from JoinArguments. It would be interesting to see performance comparing to join-library, however, as I see it, at the moment the focus is to stabilize APIs.

I find TODO for doing lazy Row implementation for iteration over CoGbkResult worth JIRA ticket.

kanterov · 2019-02-27T20:44:42Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ * identical between input1 and input2.
+ *
+ * <p>This transform also supports outer-join semantics. By default, all input PCollections must
+ * participate fully in the join, providing inner-join semantics. This means that if if all input


nit: s/if if/if/

kanterov · 2019-02-27T20:46:05Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ *
+ * <p>This transform also supports outer-join semantics. By default, all input PCollections must
+ * participate fully in the join, providing inner-join semantics. This means that if if all input
+ * save one have values for a given user "Bob" the join will produce no values for "Bob." However,


nit: looks like it was phrased as "if all inputs have values for a given user" or similar

kanterov · 2019-02-27T20:49:16Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ *
+ * <pre>{@code
+ * PCollection<Row> joined = PCollectionTuple.of("input1", input1, "input2", input2)
+ * .apply(CoGroup.join("input1", By.fieldNames("user").withOuterJoinParticipation()


nit: missing closing bracket in join( for code examples, if I get it correctly, it should be:

PCollection<Row> joined = PCollectionTuple.of("input1", input1, "input2", input2) .apply(CoGroup.join("input1", By.fieldNames("user").withOuterJoinParticipation()) .join("input2", By.fieldNames("user")) .crossProductJoin();

kanterov · 2019-02-27T20:51:13Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ *
+ * <pre>{@code
+ * PCollection<Row> joined = PCollectionTuple.of("input1", input1, "input2", input2)
+ * .apply(CoGroup.join("input1", By.fieldNames("user")


nit: missing bracket

kanterov · 2019-02-27T20:51:24Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ *
+ * <pre>{@code
+ * PCollection<Row> joined = PCollectionTuple.of("input1", input1, "input2", input2)
+ * .apply(CoGroup.join("input1", By.fieldNames("user").withOuterJoinParticipation()


nit: missing bracket

kanterov · 2019-02-27T22:42:46Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ String tag = sortedTags.get(tagIndex);
+ SerializableFunction<Object, Row> toRow = toRows.get(tag);
+ Iterable items = gbkResult.getAll(tagToTupleTag.get(tag));
+ if (!items.iterator().hasNext() && joinArgs.getOuterJoinParticipation(tag)) {


I'm wondering if it worth to preserve iterator for the next loop because creating a new one and checking hasNext could be expensive in the case of UnionValueIterator.

kanterov · 2019-02-27T22:45:30Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

+ private final Schema outputSchema;
+ private final Map<String, String> tagToTupleTag;
+
+ public ExpandToRows(


nit: What do you think about splitting into two parts:

dealing with tags, gbkResult and creating iterators with a single null element in case of outer join participation

doing cross-product for List<Iterable>

From what I see, it could save us from doing lookups in sortedTags and tagToTupleTag for each output row, and simplify the code.

This is an excellent point. Split this out so we create the iterators ahead of time. Also changed maps to key off of tag index so that we don't have to hash the string on every single element.

kanterov · 2019-02-27T22:46:29Z

sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/join/CoGbkResult.java

@@ -190,6 +201,12 @@ public String toString() {
 return innerGetOnly(tag, defaultValue, true);
 }

+ /** Like {@link #getOnly(TupleTag, Object)} but uisng a String instead of a TupleTag. */


nit: s/uisng/using/

kanterov · 2019-02-27T22:50:05Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java

@@ -123,6 +200,10 @@ public static PCollectionTuple empty(Pipeline pipeline) {
 return pcollectionMap.containsKey(tag);
 }

+ public <T> boolean has(String tag) {


nit: should we copy Javadoc?

kanterov · 2019-02-27T22:50:10Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java

@@ -137,6 +218,10 @@ public static PCollectionTuple empty(Pipeline pipeline) {
 return pcollection;
 }

+ public <T> PCollection<T> get(String tag) {


nit: should we copy Javadoc?

reuvenlax · 2019-02-28T19:10:47Z

Also filed [BEAM-6756]

reuvenlax · 2019-03-01T06:28:42Z

Run Java PreCommit

kanterov · 2019-03-01T11:42:16Z

sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/CoGroup.java

- return byFieldAccessDescriptor(FieldAccessDescriptor.withFieldIds(fieldIds));
+ static {
+ NULL_LIST = Lists.newArrayList();
+ NULL_LIST.add(null);


I'm not sure it's going to do safe publishing because list is modified after final value is assigned. What about NULL_LIST = Arrays.asList((Object) null)

kanterov · 2019-03-01T11:45:14Z

LGTM

kanterov · 2019-03-01T11:45:17Z

Run Java PreCommit

reuvenlax requested a review from akedin December 26, 2018 20:55

reuvenlax added 8 commits February 24, 2019 22:03

Allow string fields to be used instead of TupleTags.

06df2c8

Create a class to encapsulate per-input join participation.

bfc1e91

Update javadoc.

7fc52c4

Add cross-product expansion to schema-based CoGroup transform. This g…

6e5bb0a

…eneralizes inner and outer joins to multiple input PCollections.

Add JavaDoc.

60818d4

Apply spotless.

718bd19

Fix failures.

da4c336

Some fixes.

984ea49

reuvenlax force-pushed the schema_join branch from e2a191a to 984ea49 Compare February 25, 2019 06:13

kanterov reviewed Feb 27, 2019

View reviewed changes

Address code-review comments.

22e9919

kanterov reviewed Mar 1, 2019

View reviewed changes

reuvenlax merged commit c41b3c0 into apache:master Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-4461] Support inner and outer style joins in CoGroup. #7353

[BEAM-4461] Support inner and outer style joins in CoGroup. #7353

reuvenlax commented Dec 26, 2018 •

edited

Loading

reuvenlax commented Jan 16, 2019

reuvenlax commented Feb 13, 2019

kanterov commented Feb 13, 2019

reuvenlax commented Feb 25, 2019

kanterov commented Feb 27, 2019

kanterov left a comment

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

kanterov Feb 27, 2019

reuvenlax Feb 28, 2019

reuvenlax commented Feb 28, 2019

reuvenlax commented Mar 1, 2019

kanterov Mar 1, 2019

kanterov commented Mar 1, 2019

kanterov commented Mar 1, 2019

[BEAM-4461] Support inner and outer style joins in CoGroup. #7353

[BEAM-4461] Support inner and outer style joins in CoGroup. #7353

Conversation

reuvenlax commented Dec 26, 2018 • edited Loading

reuvenlax commented Jan 16, 2019

reuvenlax commented Feb 13, 2019

kanterov commented Feb 13, 2019

reuvenlax commented Feb 25, 2019

kanterov commented Feb 27, 2019

kanterov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuvenlax commented Feb 28, 2019

reuvenlax commented Mar 1, 2019

Choose a reason for hiding this comment

kanterov commented Mar 1, 2019

kanterov commented Mar 1, 2019

reuvenlax commented Dec 26, 2018 •

edited

Loading