[FLINK-1622][java-api][scala-api] add a GroupCombine operator #466

mxm · 2015-03-09T16:38:02Z

The GroupCombine operator acts like a the optional combine step in the
GroupReduceFunction. It is more general because it combines from an
input to an arbitrary output type. Combining is performed on the
partitions with as much data in memory as possible. This may lead to
partial results.

The operator can be used to pre-combine elements into an intermediate
output format before applying a proper groupReduce to produce the final
output format.

hsaputra · 2015-03-09T19:00:58Z

...src/main/java/org/apache/flink/api/common/operators/base/GroupReducePartialOperatorBase.java

+import java.util.List;
+
+/**
+ *


@mxm, could you add class description on why the base class need to be created and how it relates to existing ones?
We would love to add more documentation on the code to help new contributors get comfortable with the code flow.

Thanks for spotting the missing doc in this class. I'll add some.

aljoscha · 2015-03-10T09:01:25Z

I like the implementation, except for my comments on groupReducePartial() on grouped DataSets. Also, the tests seem a bit shady because of all the grouping and regular reduceGroup operations. I would suggest partitioning the data using a manual partition operation and then applying a GroupReducePartial.

hsaputra · 2015-03-10T14:54:58Z

I assume this will be new operator but I do not see updates on the documentation files.

mxm · 2015-03-10T15:19:40Z

@aljoscha Thanks for the comments. I agree, the tests are a bit shady because they test the operator by first performing a partial, then a full reduce. Using a custom partitioner would make more sense.

@hsaputra Yes, we absolutely have to add docs apart from the Java/Scala docs.

What do you think about the name? @fhueske suggested to expose the operator as combine because it is essentially a user-accessible combiner..

fhueske · 2015-03-10T15:59:06Z

Yes, I've got a couple of comments as well.

First of all, as @mxm said, I would propose to call this operator combine because it is a generalized combiner (output type may be different from input type). The current combiner could be implemented as special case of this one (input = output type).

I think @aljoscha is right. partialGroupReduce (or combine) should be applied on a DataSet, not a GroupedDataSet. Otherwise, you lose most of the benefits (local processing) and would ship all data which is exactly what you want to avoid when using a combiner. If you want to run a reduceafterwards, you simply do groupBy(x).reduceGroup().

Apart from documentation, also a few Scala tests should be added.

fhueske · 2015-03-10T16:02:29Z

flink-java/src/main/java/org/apache/flink/api/java/operators/GroupReducePartialOperator.java

+ * This operator behaves like the GroupReduceOperator with Combine but only runs the Combine part which reduces all data
+ * locally in their partitions. The combine part can return an arbitrary data type. This is useful to pre-combine values 
+ * into an intermediate representation before applying a proper reduce operation.
+ *


Isnt't the partial GroupReduce only working on subsets of a local partition (as much as fits in the sort buffer)?

That is correct.

fhueske · 2015-03-10T16:31:25Z

I have to correct myself. A combiner should of course be called on groups of records. Therefore, calling it on a Grouping makes absolute sense. However, the semantics of the Grouping are different for Combine and Reduce. For Reduce a full Grouping is established with repartitioning and full sort, whereas for Combine only a partial local grouping should be created.

mxm · 2015-03-10T16:37:49Z

@aljoscha @fhueske For a general combine, the operator can be used without grouping. When we want to combine elements before performing a proper groupReduce with a groupBy, we also need to use groupBy for the combine. Otherwise, we wouldn't know in the combiner, which keys belong together. However, there are cases where a combiner without a groupBy would be appriopriate. That's why the new operator is exposed in DataSet, GroupedDataSet, ad SortedDataSet. We currently have the same behavior for a normal GroupReduce despite the fact that the combine of the GroupReduce cannot change the input type.

aljoscha · 2015-03-10T17:17:59Z

Sorry, I completely blanked, of course, You still need the grouping, only the shuffle step you don't need.

So, I suggest only better tests, using a combination of partitionByHash() and groupReducePartial().

mxm · 2015-03-13T16:22:23Z

@aljoscha @fhueske @hsaputra Thanks for the feedback. Some people suggested that the name is confusing and that my pull request involved too much code duplication. I propose to call the new operator combineGroup because it is a combiner that works on Groups, just like the existing combiners in GroupReduceCombineDriver and AllGroupReduceDriver. I refactored the code to reuse Flink's existing logic for combiners.

@hsaputra I added documentation for the operator in the code and the official documentation. Some Scala tests for the API were added as well.

When merging this pull request, I would squash the two commits and keep the commit message of the latest one. I simply wanted to keep the first one to show the development process.

hsaputra · 2015-03-13T17:53:23Z

flink-compiler/src/main/java/org/apache/flink/compiler/operators/GroupCombineProperties.java

+import java.util.Collections;
+import java.util.List;
+
+public final class GroupCombineProperties extends OperatorDescriptorSingle {


Could you add JavaDoc for this class? I know other derived classes for OperatorDescriptorSingle does not have JavaDoc, yet but I am trying to get new code added to have at least simple explanation why the class should be created.

Totally agree with you. We should not accept undocumented classes. I added some doc, also for the base class.

hsaputra · 2015-03-13T17:55:01Z

Thanks @mxm ! I add couple more comments about documentation. I am trying to promote habit of adding JavaDoc comment for new classes to help indicate why it has to be created. Reduction of code is harder to do than adding new ones.

The partial GroupReduce operator acts like a regular GroupReduce operator but does not perform a full group reduce. Instead, it performs the GroupReduce only on the individual partitions. This may lead to a partial GroupReduce result. The operator can be used to pre-combine elements into an intermediate output format before applying a proper groupReduce to produce the final output format.

The GroupCombine operator acts like a the optional combine step in the GroupReduceFunction. It is more general because it combines from an input to an arbitrary output type. Combining is performed on the partitions with as much data in memory as possible. This may lead to partial results. The operator can be used to pre-combine elements into an intermediate output format before applying a proper groupReduce to produce the final output format. * rename GroupReducePartial to GroupCombine * make Combine and FlatCombine generic by adding an output type * add documentation * Reuse GroupReduceCombineDriver and SynchronousChainedCombineDriver for GroupCombine operator ** make them more generic by specifying input and output type ** implement AllCombineDriver * add Java tests * add Scala test

mxm · 2015-03-16T09:57:09Z

@hsaputra Thanks for the feedback. Promoting the habit of documenting every class is a very good thing.

hsaputra · 2015-03-16T16:27:45Z

HI @mxm, appreciate the ACK. Just trying to keep up all informed with more code coming in.
It is harder to read other people code and flow so was just trying to ease up the process =)

mxm · 2015-03-18T10:21:57Z

Any further comments?

aljoscha · 2015-03-18T10:35:38Z

I would say it's good to go now.

The GroupCombine operator acts like a the optional combine step in the GroupReduceFunction. It is more general because it combines from an input to an arbitrary output type. Combining is performed on the partitions with as much data in memory as possible. This may lead to partial results. The operator can be used to pre-combine elements into an intermediate output format before applying a proper groupReduce to produce the final output format. * make Combine and FlatCombine generic by adding an output type * add documentation * Reuse GroupReduceCombineDriver and SynchronousChainedCombineDriver for GroupCombine operator ** make them more generic by specifying input and output type ** implement AllCombineDriver * add Java tests * add Scala test This closes apache#466

hsaputra reviewed Mar 9, 2015
View reviewed changes

mxm force-pushed the reducePartialOperator branch from 8eb4dfe to e655d22 Compare March 10, 2015 15:20

fhueske reviewed Mar 10, 2015
View reviewed changes

mxm changed the title ~~[FLINK-1622][java-api][scala-api] add a partial GroupReduce operator~~ [FLINK-1622][java-api][scala-api] add a GroupCombine operator Mar 13, 2015

mxm force-pushed the reducePartialOperator branch from 8b60fbb to 11d0581 Compare March 13, 2015 17:12

hsaputra reviewed Mar 13, 2015
View reviewed changes

mxm force-pushed the reducePartialOperator branch from 11d0581 to f3aaa31 Compare March 14, 2015 12:49

mxm force-pushed the reducePartialOperator branch from f3aaa31 to 2df89c9 Compare March 16, 2015 09:47

mxm force-pushed the reducePartialOperator branch from 2df89c9 to 5080602 Compare March 16, 2015 09:49

asfgit closed this in e93e0cb Mar 18, 2015

rmetzger added the component=Runtime/Operators label Mar 14, 2019

flinkbot added component=Runtime/Task and removed component=Runtime/Operators labels Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1622][java-api][scala-api] add a GroupCombine operator #466

[FLINK-1622][java-api][scala-api] add a GroupCombine operator #466

mxm commented Mar 9, 2015

hsaputra Mar 9, 2015

mxm Mar 10, 2015

aljoscha commented Mar 10, 2015

hsaputra commented Mar 10, 2015

mxm commented Mar 10, 2015

fhueske commented Mar 10, 2015

fhueske Mar 10, 2015

mxm Mar 10, 2015

fhueske commented Mar 10, 2015

mxm commented Mar 10, 2015

aljoscha commented Mar 10, 2015

mxm commented Mar 13, 2015

hsaputra Mar 13, 2015

mxm Mar 16, 2015

hsaputra commented Mar 13, 2015

mxm commented Mar 16, 2015

hsaputra commented Mar 16, 2015

mxm commented Mar 18, 2015

aljoscha commented Mar 18, 2015

[FLINK-1622][java-api][scala-api] add a GroupCombine operator #466

[FLINK-1622][java-api][scala-api] add a GroupCombine operator #466

Conversation

mxm commented Mar 9, 2015

hsaputra Mar 9, 2015

Choose a reason for hiding this comment

mxm Mar 10, 2015

Choose a reason for hiding this comment

aljoscha commented Mar 10, 2015

hsaputra commented Mar 10, 2015

mxm commented Mar 10, 2015

fhueske commented Mar 10, 2015

fhueske Mar 10, 2015

Choose a reason for hiding this comment

mxm Mar 10, 2015

Choose a reason for hiding this comment

fhueske commented Mar 10, 2015

mxm commented Mar 10, 2015

aljoscha commented Mar 10, 2015

mxm commented Mar 13, 2015

hsaputra Mar 13, 2015

Choose a reason for hiding this comment

mxm Mar 16, 2015

Choose a reason for hiding this comment

hsaputra commented Mar 13, 2015

mxm commented Mar 16, 2015

hsaputra commented Mar 16, 2015

mxm commented Mar 18, 2015

aljoscha commented Mar 18, 2015