[FLINK-758] Add count operator to DataSet #63

uce · 2014-07-07T14:51:27Z

Adds a count operator to DataSet:

DataSet<String> text = env.fromElements(
    "Who's there?",
    "I think I hear them. Stand, ho! Who's there?");
DataSet<Long> count = text.count(); // 2

Adds an all (ungrouped) reduce variant, which allows to specify an initial value to the reduce function
Adds an utility method to InstantiationUtil to serialize a record to a byte array
Fixes some javadocs warnings in DataSet

I've tested the count operator on a cluster with DOP 1, 200 and verified the results with cat | wc -l.

hsaputra · 2014-07-08T19:58:04Z

Does it need corresponding Scala API counterpart?

uce · 2014-07-09T08:41:47Z

It already has the corresponding counterpart. But after your question I realized that the Java API is lacking the count on grouped datasets, which the Scala API supports.

input.count() // works
input.groupBy(...).count() // does not work

I will add it to this PR.

uce · 2014-07-09T10:40:18Z

Should be in line with Scala now.

@aljoscha: expression key grouping did not work wih group reduce. after looking into the reduce operator, where it does work, I figured that the small change in 3ccb500 should be enough. Can you confirm this?

aljoscha · 2014-07-09T14:57:48Z

I think it should work, yes.

Adds an all (ungrouped) reduce variant to DataSet, which allows to specify an initial value for the ReduceFunction. The initial value is an extra input element to the ReduceFuntion and needs to be of the same type as the ReduceFunction input type. The initial value is not used with the combiners. In cases, where there is no input to the reduce function, for example after a filter operator, which filters all elements, the ReduceFunction will be called with the initial value only.

Adds a count method to DataSet, which translates to a map-reduce. The map operator maps each element to a 1 and the reduce operator sums up all the 1s, resulting in the total count of elements.

The GroupReduceOperator did not work with DataSets, which have been grouped by expression keys like groupBy("myField").

Adds support for count() on grouped DataSets, which will be translated to a non-combinable GroupReduceFunction. If the grouping is done on an empty DataSet, e.g. after a filter-all, the result of the count will be empty (since there are no groups to count).

uce · 2014-07-21T12:53:06Z

I've rebased this PR on the renamed master, but it is just a single commit (not the most elegant way to do this... if someone complains I'll fix it).

fhueske · 2014-09-07T13:52:02Z

flink-java/src/main/java/org/apache/flink/api/java/operators/CountOperator.java

+ return countReduceOp;
+ }
+ else {
+ return new ReduceGroupOperator<IN, Long>(grouping, new CountingGroupReduceUdf<IN>())


Using a non-combinable GroupReduceFunction for counting is unnecessarily inefficient.
We could extract the key fields using a Mapper and add a count-1 and use a ReduceFunction as well.
This requires a few cases due to different key types but should be the way to go.

fhueske · 2014-09-07T14:29:44Z

I had a look at this PR and found a few issues:

it contains changes for several independent features
- Initial value for ReduceFunction
- Count operator
- many cosmetic changes / documentation improvements
my gut feeling is, that rebasing this PR onto the current master will cause many merge conflicts. It might be worthwhile to separate these issues into independent PRs to make the merging easier.
counting for grouped datasets is done with a non-combinable GroupReduceFunction which is not vey efficient
An initial value for ReduceFunction is only supported for AllReduce. I see that the original motivation for this (a 0-valued count for empty datasets) does not make sense for grouped ReduceFunctions, but this is not the only way an initial value could be used.

uce · 2014-09-07T16:30:13Z

Thanks for the review. The initial value for the reduce function and the count operator are tightly connected. The reduce with initial value is the general solution, of which the count operator is a special case. Therefore, I wouldn't say that these are independent features. The refactorings are also limited to files related to the initial value reduce/count operator.

The counting for grouped data sets was a quick fix after @hsaputra's comment. We can either fix it with this PR or open a seperate issue if we want to merge it.

I think the limitation to AllReduce was the result of a discussion with you and @StephanEwen.

All in all, I think that we should wait for the upcoming changes to the runtime and scheduler to support the more intuitive API of simply returning the count to the user program. As you said, we might move some of the changes (like initial value reduce) to a separate issue if we find them useful.

fhueske · 2014-09-08T09:26:13Z

Sure, the count op requires the init value, but not vice versa. I meant we could do the init value change first (which also has a separate JIRA I think) and put the count op on top.

If we wait for the runtime changes, should we just close this PR for now and extract the init value changes for a separate PR?

StephanEwen · 2014-09-08T11:53:11Z

Bear in mind that the count will be given for free also when the first-class handling of intermediate results is merged.

uce · 2014-11-18T11:38:11Z

See #210.

uce · 2014-11-19T10:46:23Z

I'll think it's best to close this PR. It's superceded by the upcoming intermediate result changes and #210. Nobody complained about not having a fold-style reduce with an initial value, so I think it's OK to discard these features as well. I will only merge the byte array serialization commit (1b893a1) of this PR as I think it might be useful in general.

…Stateful Functions documentation This closes apache#63.

uce mentioned this pull request Jul 7, 2014

[FLINK-758] Add initial value to GenericReduce and adjust AllReduceDriver #20

Closed

uce added 5 commits July 21, 2014 14:02

Add byte array serialization to InstantiationUtil

1b893a1

[FLINK-758] Add count operator to DataSet

73db1b2

Adds a count method to DataSet, which translates to a map-reduce. The map operator maps each element to a 1 and the reduce operator sums up all the 1s, resulting in the total count of elements.

Add expression key support for GroupReduce operators

71e1ddb

The GroupReduceOperator did not work with DataSets, which have been grouped by expression keys like groupBy("myField").

Rebase on renamed master

5db6110

fhueske reviewed Sep 7, 2014
View reviewed changes

uce closed this Nov 19, 2014

uce deleted the FLINK-758-count_operator branch February 23, 2015 09:18

rmetzger added the component=API/Scala label Mar 14, 2019

tzulitai pushed a commit to tzulitai/flink that referenced this pull request Jan 15, 2021

[FLINK-16557][docs] Document YAML-ized Kafka egresses / ingresses in …

b3a9ea1

…Stateful Functions documentation This closes apache#63.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-758] Add count operator to DataSet #63

[FLINK-758] Add count operator to DataSet #63

uce commented Jul 7, 2014

hsaputra commented Jul 8, 2014

uce commented Jul 9, 2014

uce commented Jul 9, 2014

aljoscha commented Jul 9, 2014

uce commented Jul 21, 2014

fhueske Sep 7, 2014

uce Sep 7, 2014

fhueske commented Sep 7, 2014

uce commented Sep 7, 2014

fhueske commented Sep 8, 2014

StephanEwen commented Sep 8, 2014

uce commented Nov 18, 2014

uce commented Nov 19, 2014

[FLINK-758] Add count operator to DataSet #63

[FLINK-758] Add count operator to DataSet #63

Conversation

uce commented Jul 7, 2014

hsaputra commented Jul 8, 2014

uce commented Jul 9, 2014

uce commented Jul 9, 2014

aljoscha commented Jul 9, 2014

uce commented Jul 21, 2014

fhueske Sep 7, 2014

Choose a reason for hiding this comment

uce Sep 7, 2014

Choose a reason for hiding this comment

fhueske commented Sep 7, 2014

uce commented Sep 7, 2014

fhueske commented Sep 8, 2014

StephanEwen commented Sep 8, 2014

uce commented Nov 18, 2014

uce commented Nov 19, 2014