Implement the convenience methods count and collect in DataSet #210

mxm · 2014-11-17T18:04:33Z

These methods provide convenience for the API user to get intermediate results into the program.

zentol · 2014-11-17T19:43:33Z

Does this work when using count() or collect() multiple times in the same plan?

you use this.getExecutionEnvironment().getIdString() to identify the accumulators, and as far as i see this value is the same for all operators in a plan. As such, multiple usages of count() will return the sum count of all datasets, and collect() will throw an exception since the same key is used when calling getRuntimeContext().addAccumulator(id, accumulator)

zentol · 2014-11-17T19:57:52Z

well nevermind, just saw that you're calling this.getExecutionEnvironment().execute(); within these methods, so when using it, count() or collect() would be the last method called. I find that very odd behaviour, i can't think of another method that executes than right away.

zentol · 2014-11-17T20:00:57Z

Why didn't you implement these as an identity-(map)function and save the information to unique accumulators (maybe identified by a user-supplied ID? you could insert them into any plan at any point any number of times, i can see that being useful.

mxm · 2014-11-18T10:25:39Z

Why didn't you implement these as an identity-(map)function and save the information to unique accumulators (maybe identified by a user-supplied ID?

Thank you for your feedback. The idea is to have intermediate results available to the user program within the execution of a Flink program. Currently, this is only possible through RichFunction which provide access to the RuntimeContext.

StephanEwen · 2014-11-18T10:53:35Z

I think executing right away makes sense, because we need the long value or the List<T> immediately. This is a new type of methods, where you fetch back data to the driver program.

As for IDs, I agree that a random generated ID per method makes sense.

uce · 2014-11-18T11:27:24Z

Very cool! I think this is a great addition and we will improve on the performance soon :)

@zentol, as @StephanEwen said it makes sense to execute right away. You can think of this as "actions" from Spark, which get results back to the user. He can't use a map function, because he doesn't emit any records, but just adds them to the accumulator.

I've tried this out locally and it works, but you have to add the random IDs as you run into problem with multiple actions in the same program otherwise.

With this change, we also should allow to have prorams without sinks. Currently, you get a warning if you only have "actions" like collect/count without a sink.

Before we can merge this, you should also add test cases for both operators and make sure that it works with multiple actions per program.

zentol · 2014-11-18T13:02:23Z

hmm...alright, i can see the point.

Doesn't executing right away carries the risk of it being inefficient when using them multiple times though? Since it effectively means executing multiple jobs within the same program (i think ... ), any common part of the jobs are done an extra time. (if I'm wrong here skip the rest)

example:

List l1 = A.map(X).map(Y).collect();
List l2 = A.map(X).map(Z).collect();

<some user code using l1 & l2>

this would result in 2 jobs being executed, with map(X) being executed twice. whereas

B = A.map(X);
B.map(Y).collect("c1");
B.map(Z).collect("c2");

JobExecutionResult jre = env.execute()
List l1 = jre.getAccumulatorResult("c1");
List l2 = jre.getAccumulatorResult("c2);

<some user code using l1 & l2>

would only be 1 job, with map(X) done only once. it is not as pretty (by a fair margin i admit), but in line with the current API.

mxm · 2014-11-18T14:18:16Z

I changed the code to generate a unique identifier for each call of count or collect.

@zentol Executing in a lazy fashion could be implemented but would require additional changes to the API. The methods should then return something like a LocalDataSet which lets the user retrieve the accumulator result once the job has been executed.

zentol · 2014-11-18T14:37:53Z

why additional changes? you can already retrieve the accumulators by using the JobExecutionResult.

the only necessary change i see is removing the execute() within collect() and count(), and now that you generate the ID's yourself returning the ID, so the accumulators can be accessed later.

B = A.map(X);
String id1 = B.map(Y).collect();
String id2 = B.map(Z).collect();

JobExecutionResult jer = env.execute()
List l1 = jer.getAccumulatorResult(id1);
List l2 = jer.getAccumulatorResult(id2);

<some user code using l1 & l2>

why would this not be sufficient?

StephanEwen · 2015-01-16T13:56:56Z

flink-core/src/main/java/org/apache/flink/api/common/accumulators/ListAccumulator.java

+ public void write(DataOutputView out) throws IOException {
+ ObjectOutputStream outStream = new ObjectOutputStream(new DataOutputViewStream(out));
+ outStream.writeObject(typeSerializer);
+ outStream.flush();


Let us change this to close()

StephanEwen · 2015-01-16T14:14:57Z

With the scheduler and intermediate data set enhancements coming up for 0.9 soon, this is now quite feasible to use. I suggest to merge it once the inline comments are addressed.

mxm · 2015-01-19T17:58:07Z

I fixed the proposed changes and rebased to the current master. The changes are only reflected in the Java API and need to be added to the Scala API as well.

mxm · 2015-01-20T19:42:43Z

I've implemented count and collect in the Scala API. There is still a problem with the ListAccumulator for non-primitive Objects (e.g. not Integer or Long) probably due to Object reuse.

mxm · 2015-01-27T17:29:25Z

Looks like this is now ready to merge.

@zentol I understand your concern. However, I think that it is much easier to execute in this way. Most of the times, the user probably wants just one accumulator result and not multiple. This is supposed to be a convenience function.

StephanEwen · 2015-01-27T20:59:54Z

@zentol You are right, for the time being, that this results in parts in repeated execution. While not totally unavoidable in all cases, the code going in soon about caching intermediate results will help there big time.

mxm · 2015-02-26T10:00:04Z

I squashed the commits and rebased to the current master. Any objections against merging this?

rmetzger · 2015-02-26T13:33:59Z

flink-core/pom.xml

@@ -58,6 +58,12 @@ under the License.
 <version>0.5.1</version>
 </dependency>

+ <dependency>


Why are you moving netty to flink-core ?

Simply because AbstractID depends on io.netty.buffer.ByteBuf

Allright, that makes sense

Actually, only a single convenience method in AbstractID depends on this. I would propose to refactor that method into one of the existing utility classes in flink-runtime and keep the Netty dependency in flink-runtime.

StephanEwen · 2015-02-26T17:25:55Z

Let us not move netty to flink-core. Let us rather pull that method out of the AbstractID, or, not use AbstractID at in your code, but the Java UUID.

… DataSet to the client - this implements two convenience methods on DataSet for the Java and Scala API - appropriate tests have been added count(): returns the number of elements in a DataSet collect(): returns a List<T> with the actual elements of a DataSet<T> - both methods use accumulators to get the results back to the client - both methods force an execution of the job to generate the results

mxm · 2015-02-27T12:08:16Z

I moved the writeInto(ByteBuf buf) method to the subclasses of AbstractID used by NettyMessage. This is the only context where the method is being used and the subclasses already have another fromByteBuf(ByteBuf buf) method.

mxm · 2015-02-27T15:09:47Z

If there are no objections, I will merge this once Travis passes.

… DataSet to the client - this implements two convenience methods on DataSet for the Java and Scala API - appropriate tests have been added count(): returns the number of elements in a DataSet collect(): returns a List<T> with the actual elements of a DataSet<T> - both methods use accumulators to get the results back to the client - both methods force an execution of the job to generate the results This closes apache#210

This closes apache#210.

uce mentioned this pull request Nov 18, 2014

[FLINK-758] Add count operator to DataSet #63

Closed

StephanEwen reviewed Jan 16, 2015
View reviewed changes

mxm force-pushed the count/collect branch from b18f73c to 461d7ab Compare January 19, 2015 17:57

mxm force-pushed the count/collect branch from 45ab5f6 to 7e8ea5c Compare January 20, 2015 19:39

mxm force-pushed the count/collect branch from 7e8ea5c to 32fcb11 Compare January 27, 2015 17:20

mxm force-pushed the count/collect branch from 32fcb11 to 5672fde Compare January 28, 2015 10:29

mxm force-pushed the count/collect branch 7 times, most recently from 40f4e1c to 1ef8e82 Compare February 25, 2015 20:06

mxm force-pushed the count/collect branch from 1ef8e82 to a2f3258 Compare February 25, 2015 21:00

rmetzger reviewed Feb 26, 2015
View reviewed changes

Maximilian Michels and others added 2 commits February 27, 2015 13:04

move AbstractID from runtime to core

4ba3bd3

mxm force-pushed the count/collect branch from a2f3258 to cba8eb7 Compare February 27, 2015 12:05

asfgit closed this in 3dc2fe1 Feb 27, 2015

zhijiangW pushed a commit to zhijiangW/flink that referenced this pull request Jul 23, 2019

Add a last update time to the roadmap.

ef97981

This closes apache#210.

HuangZhenQiu pushed a commit to HuangZhenQiu/flink that referenced this pull request Sep 20, 2022

[FLINK-27595] Make security context configurable in helm

af08660

This closes apache#210.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the convenience methods count and collect in DataSet #210

Implement the convenience methods count and collect in DataSet #210

mxm commented Nov 17, 2014

zentol commented Nov 17, 2014

zentol commented Nov 17, 2014

zentol commented Nov 17, 2014

mxm commented Nov 18, 2014

StephanEwen commented Nov 18, 2014

uce commented Nov 18, 2014

zentol commented Nov 18, 2014

mxm commented Nov 18, 2014

zentol commented Nov 18, 2014

StephanEwen Jan 16, 2015

StephanEwen commented Jan 16, 2015

mxm commented Jan 19, 2015

mxm commented Jan 20, 2015

mxm commented Jan 27, 2015

StephanEwen commented Jan 27, 2015

mxm commented Feb 26, 2015

rmetzger Feb 26, 2015

mxm Feb 26, 2015

rmetzger Feb 26, 2015

uce Feb 26, 2015

StephanEwen commented Feb 26, 2015

mxm commented Feb 27, 2015

mxm commented Feb 27, 2015

Implement the convenience methods count and collect in DataSet #210

Implement the convenience methods count and collect in DataSet #210

Conversation

mxm commented Nov 17, 2014

zentol commented Nov 17, 2014

zentol commented Nov 17, 2014

zentol commented Nov 17, 2014

mxm commented Nov 18, 2014

StephanEwen commented Nov 18, 2014

uce commented Nov 18, 2014

zentol commented Nov 18, 2014

mxm commented Nov 18, 2014

zentol commented Nov 18, 2014

StephanEwen Jan 16, 2015

Choose a reason for hiding this comment

StephanEwen commented Jan 16, 2015

mxm commented Jan 19, 2015

mxm commented Jan 20, 2015

mxm commented Jan 27, 2015

StephanEwen commented Jan 27, 2015

mxm commented Feb 26, 2015

rmetzger Feb 26, 2015

Choose a reason for hiding this comment

mxm Feb 26, 2015

Choose a reason for hiding this comment

rmetzger Feb 26, 2015

Choose a reason for hiding this comment

uce Feb 26, 2015

Choose a reason for hiding this comment

StephanEwen commented Feb 26, 2015

mxm commented Feb 27, 2015

mxm commented Feb 27, 2015