[FLINK-3943] Add support for EXCEPT operator #2169

mushketyk · 2016-06-27T19:47:08Z

Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.

General
- The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
- The pull request addresses only one issue
- Each commit in the PR has a meaningful commit message (including the JIRA id)
Documentation
- Documentation has been added for new functionality
- Old documentation affected by the pull request has been updated
- JavaDoc for public methods has been added
Tests & Build
- Functionality added by the pull request is covered by tests
- mvn clean verify has been executed successfully locally or a Travis build has passed

mushketyk · 2016-06-27T19:52:38Z

Renamed UnionITCase to SetOperationsITCase as suggested here: #2159

wuchong · 2016-06-28T02:02:41Z

.../flink-table/src/main/scala/org/apache/flink/api/table/plan/nodes/dataset/DataSetMinus.scala

+
+ val minusRes = leftDataSet.minus(rightDataSet)
+ if (!all) {
+ minusRes.distinct()


I think it's better to remove duplicate records in CoGroup, emit only one record instead of every record from left data set when it is a minus without all. There is no need a distinct afterwards. It's the more robust choice because it won't create a huge intermediate result in case of many duplicate records.

Makes sense. Will update.

On the second hand "union" method in DataSet.java preserves duplicates and used to implement UNION ALL. UNION is implemented by using "distinct" operation on top of UNION ALL.
It seems that what you suggest will only add code duplication since it will basically implement a specialized version of "distinct" operator specifically for the "minus" method.

I have no strong preferences though. @fhueske what do you think?

I would like to move the code of DataSet.minus() here. However, I think the semantics of EXCEPT ALL are a bit different than in your implementation. It is not simply checking if there is a match in the second input and forwarding everything if there is none. It basically removes for each match in the second input one matching record from the first input (see also the PostgreSQL docs).

I would be in favor of an implementation that is similar to @wuchong's implementation of INTERSECT / INTERSECT ALL in PR #2159.

wuchong · 2016-06-28T02:41:56Z

Hi @mushketyk, I think we should remove duplicate records in CoGroup instead of using distinct. Others looks good to me.

mushketyk · 2016-06-29T20:37:05Z

Updated documentation as suggested.

fhueske · 2016-06-29T22:58:23Z

flink-java/src/main/java/org/apache/flink/api/java/DataSet.java

+ * @param other The other DataSet which is set minus with the current DataSet.
+ * @return The resulting DataSet.
+ */
+ public CoGroupOperator<T, T, T> minus(DataSet<T> other){


This issue is about adding EXCEPT to the Table. The DataSet API which is touched here is a rather low level API and we are quite careful about adding new operators. Therefore, changes to the DataSet API should go through a separate JIRA issue. Please move this code to DataSetMinus class and revert the changes to this file. You can open a JIRA issue to discuss adding a minus operator to the DataSet API. Thank you.

fhueske · 2016-06-29T23:52:37Z

Hi @mushketyk, thanks for the PR! I added a few comments inline.

Best, Fabian
@wuchong, thanks for reviewing!

mushketyk · 2016-06-30T08:27:55Z

@fhueske Thank you for the detailed review! I've updated my code according to your comments.

I noticed that @wuchong is performing some type conversions in his INTERSECT implementation: https://github.com/apache/flink/pull/2159/files#diff-a6c2112ca46d26fcf49f1edba1c73f75R121

Should I do something similar in the EXCEPT case? If yes, does it mean that my test coverage is not sufficient and does not cover some particular case?

mushketyk · 2016-07-07T20:42:12Z

@fhueske Hey. Sorry for bothering you. Do I need to change something else in this PR?

twalthr · 2016-07-08T13:02:59Z

docs/apis/table.md

@@ -536,6 +536,29 @@ Table result = left.unionAll(right);
 </tr>

 <tr>
+ <td><strong>Minus</strong></td>
+ <td>
+ <p>Similar to a SQL EXCEPT clause. Except returns records from the first table that do not exist in the second table. Duplicate records in the first table are returned exactly once, i.e., duplicates are removed. Both tables must have identical schema, i.e., field names and types.</p>


"Except returns records" should be "Minus returns records" to be consistent. I would also use "left/right" table instead of first and second according to your example code.

twalthr · 2016-07-08T13:19:22Z

.../flink-table/src/main/scala/org/apache/flink/api/table/plan/nodes/dataset/DataSetMinus.scala

+import scala.collection.JavaConversions._
+
+/**
+ * Flink RelNode which matches along with DataSetOperator.


This description does not make much sense.

twalthr · 2016-07-08T13:52:30Z

@mushketyk Thanks for the PR.
I also reviewed the current status. I think it's good to compare your code with #2159 before you rework it.

mushketyk · 2016-07-09T21:42:08Z

@twalthr Thank you for your review! I've update the PR according to your comments.

mushketyk · 2016-07-11T07:06:43Z

@twalthr Updated PR according to your commits and rebased on top of the master branch to avoid merge conflicts.

twalthr · 2016-07-11T11:49:17Z

Thanks for updating the PR. I reviewed the code again. There were still some issues regarding the documentation and also expectedType handling in DataSetMinus, I fixed them myself and will merge now...

mushketyk · 2016-07-11T14:33:57Z

@twalthr Thank you for accepting my changes!

This closes apache#2169.

mushketyk force-pushed the except-2 branch from daea09f to 861d83d Compare June 27, 2016 19:51

wuchong reviewed Jun 28, 2016
View reviewed changes

fhueske reviewed Jun 29, 2016
View reviewed changes

twalthr reviewed Jul 8, 2016
View reviewed changes

mushketyk added 6 commits July 10, 2016 22:36

[FLINK-3943] Add EXCEPT operator

b87a176

[FLINK-3943] Add EXCEPT documentation

8e805ce

[FLINK-3943] Update documentation

e770fb5

[FLINK-3943] Fixed according to PR comments

b91fb58

[FLINK-3943] Fixed according to PR comments

0c1d20d

[FLINK-3943] Update Table API documentation

73f3c7b

mushketyk force-pushed the except-2 branch from 17b4238 to 73f3c7b Compare July 11, 2016 07:04

asfgit closed this in 9753393 Jul 11, 2016

hequn8128 pushed a commit to hequn8128/flink that referenced this pull request Jun 22, 2017

[FLINK-3943] [table] Add support for EXCEPT operator

9ec91be

This closes apache#2169.

rmetzger added the component=API/TableSQL label Mar 14, 2019

flinkbot added component=TableSQL/API and removed component=API/TableSQL labels Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-3943] Add support for EXCEPT operator #2169

[FLINK-3943] Add support for EXCEPT operator #2169

mushketyk commented Jun 27, 2016

mushketyk commented Jun 27, 2016

wuchong Jun 28, 2016

mushketyk Jun 29, 2016

mushketyk Jun 29, 2016

fhueske Jun 29, 2016

wuchong commented Jun 28, 2016

mushketyk commented Jun 29, 2016

fhueske Jun 29, 2016

fhueske commented Jun 29, 2016

mushketyk commented Jun 30, 2016

mushketyk commented Jul 7, 2016

twalthr Jul 8, 2016

twalthr Jul 8, 2016

twalthr commented Jul 8, 2016

mushketyk commented Jul 9, 2016

mushketyk commented Jul 11, 2016

twalthr commented Jul 11, 2016

mushketyk commented Jul 11, 2016

[FLINK-3943] Add support for EXCEPT operator #2169

[FLINK-3943] Add support for EXCEPT operator #2169

Conversation

mushketyk commented Jun 27, 2016

mushketyk commented Jun 27, 2016

wuchong Jun 28, 2016

Choose a reason for hiding this comment

mushketyk Jun 29, 2016

Choose a reason for hiding this comment

mushketyk Jun 29, 2016

Choose a reason for hiding this comment

fhueske Jun 29, 2016

Choose a reason for hiding this comment

wuchong commented Jun 28, 2016

mushketyk commented Jun 29, 2016

fhueske Jun 29, 2016

Choose a reason for hiding this comment

fhueske commented Jun 29, 2016

mushketyk commented Jun 30, 2016

mushketyk commented Jul 7, 2016

twalthr Jul 8, 2016

Choose a reason for hiding this comment

twalthr Jul 8, 2016

Choose a reason for hiding this comment

twalthr commented Jul 8, 2016

mushketyk commented Jul 9, 2016

mushketyk commented Jul 11, 2016

twalthr commented Jul 11, 2016

mushketyk commented Jul 11, 2016