[BEAM-6772] Change Select semantics to match what a user expects #8006

reuvenlax · 2019-03-06T22:59:33Z

Details in the JIRA: The current Select transform provides confusing and less-useful semantics. This PR causes the transform to remove extra levels of nesting that are left after a Select.

reuvenlax · 2019-03-06T22:59:47Z

R: @kanterov

kanterov · 2019-03-13T17:39:18Z

I've started looking, the code makes sense, however, I need more time to think about the idea of automatic unnesting. I'm wondering if we can make it less implicit.

As for me, I would expect the previous behavior, that's how, for instance, Spark data frames work, IIRC.

reuvenlax · 2019-03-13T17:42:47Z

The new behavior better matches how SQL works (Select a.b return an int if b is an int. It doesn't return a nested row). However we could also make it an option on the Select transform so the user can pick which behavior they want.

…

On Wed, Mar 13, 2019 at 10:39 AM Gleb Kanterov ***@***.***> wrote: I've started looking, the code makes sense, however, I need more time to think about the idea of automatic unnesting. I'm wondering if we can make it less implicit. As for me, I would expect the previous behavior, that's how, for instance, Spark data frames work, IIRC. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1dwShANQpi5Q1EIyg-1oQY1XdHKEks5vWTfMgaJpZM4bh_oa> .

kanterov · 2019-03-13T17:55:20Z

I checked a similar query in BigQuery:

SELECT location FROM UNNEST([
  STRUCT(STRUCT(1.0 as latitude, 2.0 as longtitude) as location, "abc" as userId)
]);

It returns:

[
  {
    "location": {
      "latitude": "1.0",
      "longtitude": "2.0"
    }
  }
]

As I understand this PR, it makes Beam to return:

[{
  "latitude": "1.0",
  "longtitude": "2.0"
}]

reuvenlax · 2019-03-13T18:00:09Z

what happens without UNNEST. If you simply do SELECT a.location?

…

On Wed, Mar 13, 2019 at 10:55 AM Gleb Kanterov ***@***.***> wrote: I checked a similar query in BigQuery: SELECT location FROM UNNEST([ STRUCT(STRUCT(1.0 as latitude, 2.0 as longtitude) as location, "abc" as userId) ]); It returns: [ { "location": { "latitude": "1.0", "longtitude": "2.0" } } ] I understand this PR, it make Beam to return: [{ "latitude": "1.0", "longtitude": "2.0" }] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1cX8xq_UbvjQFrt1pu75bQm353eBks5vWTuegaJpZM4bh_oa> .

reuvenlax · 2019-03-13T18:16:50Z

Leaving the nested structure in place seem like it would be surprising to users. For example: Event: location: ROW(Location) A user would expect to be able to write the following: pc.apply(ParDo.of(new DoFn.... @ProcessElement public void process(@FieldAccess("location") Location loc) { })); (where Location is a POJO that matches the Location schema). However this conversion will fail if the select actually returns an extra level of nesting. In addition they would expect to be able to write: pc.apply(ParDo.of(new DoFn.... @ProcessElement public void process(@FieldAccess("location.lat") double lat) { })); But again, that will fail if the select leaves extra layers of nesting.

…

On Wed, Mar 13, 2019 at 10:59 AM Reuven Lax ***@***.***> wrote: what happens without UNNEST. If you simply do SELECT a.location? On Wed, Mar 13, 2019 at 10:55 AM Gleb Kanterov ***@***.***> wrote: > I checked a similar query in BigQuery: > > SELECT location FROM UNNEST([ > STRUCT(STRUCT(1.0 as latitude, 2.0 as longtitude) as location, "abc" as userId) > ]); > > It returns: > > [ > { > "location": { > "latitude": "1.0", > "longtitude": "2.0" > } > } > ] > > I understand this PR, it make Beam to return: > > [{ > "latitude": "1.0", > "longtitude": "2.0" > }] > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#8006 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AUGE1cX8xq_UbvjQFrt1pu75bQm353eBks5vWTuegaJpZM4bh_oa> > . >

kanterov · 2019-03-13T18:24:53Z

As I understand it, FieldAccess("location") would work only if we leave an extra layer of nesting.

I did few BigQuery experiments:

CREATE TABLE IF NOT EXISTS test.test (
  userId STRING,
  location STRUCT<latitude FLOAT64, longtitude FLOAT64>
);

INSERT INTO test.test (userId, location) VALUES ("abc", (123.0, 234.0));

SELECT location FROM test.test;
-- Row  location.latitude   location.longtitude 
-- 1    123.0               234.0

SELECT location.latitude FROM gleb_test.test;
-- Row latitude    
-- 1   123.0

kanterov · 2019-03-13T18:25:28Z

I need to check how FieldAccess works today

reuvenlax · 2019-03-13T19:30:22Z

But doesn't your BigQuery experiment confirm this behavior? You got back a Row containing location.latitude and location.longtitude, not a Row of Row. It does look like BigQuery flattened the filed names though.

…

On Wed, Mar 13, 2019 at 11:25 AM Gleb Kanterov ***@***.***> wrote: As I understand it, FieldAccess("location") would work only if we leave an extra layer of nesting. I did few BigQuery experiments: CREATE TABLE IF NOT EXISTS test.test ( userId STRING, location STRUCT<latitude FLOAT64, longtitude FLOAT64> ); INSERT INTO test.test (userId, location) VALUES ("abc", (123.0, 234.0)); SELECT location FROM test.test; -- Row location.latitude location.longtitude -- 1 123.0 234.0 SELECT location.latitude FROM gleb_test.test; -- Row latitude -- 1 123.0 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1Zx9vkVCOc_1nO0c7hker6Ai-JK3ks5vWUKDgaJpZM4bh_oa> .

kanterov · 2019-03-13T20:26:26Z

It created a schema with nested records, to confirm I did an experiment:

CREATE TABLE `test.test_schema` AS
SELECT location FROM test.test;

And I got:

location | RECORD | NULLABLE |  
-- | -- | -- | --
location. latitude | FLOAT | NULLABLE |  
location. longtitude | FLOAT | NULLABLE

kanterov · 2019-03-13T20:28:29Z

This way it's more visible:

$ bq show --schema test.test_schema | jq
[
  {
    "fields": [
      {
        "type": "FLOAT",
        "name": "latitude"
      },
      {
        "type": "FLOAT",
        "name": "longtitude"
      }
    ],
    "type": "RECORD",
    "name": "location"
  }
]

reuvenlax · 2019-03-13T20:36:03Z

+Kenn for comment This seems very surprising to me, to be honest. It's not what I would expect. This behavior has some mathematical niceties (e.g. it makes select distribute across a union of selectors), but seems like it's almost never what a user actually wants. In Beam, part of the goal of schemas is to interact seamlessly with user types (such as Pojos), so users don't have to deal with Row objects (unless they want to). The above example for ParDo is one (though the implementation for that is not yet merged). Another examples is that users might expect to be able to write: PCollection<Location> locations = pc.apply(Select.fieldNames("location")) .apply(Convert.to(Location.class)); And that will only work if the selected item matches the schema of Location, not if it's a nested schema. Maybe for now we should add an option to the Select transform so that the user can specify which behavior they want?

…

On Wed, Mar 13, 2019 at 1:28 PM Gleb Kanterov ***@***.***> wrote: This way it's more visible: $ bq show --schema test.test_schema | jq [ { "fields": [ { "type": "FLOAT", "name": "latitude" }, { "type": "FLOAT", "name": "longtitude" } ], "type": "RECORD", "name": "location" } ]``` — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1YZ5-hg8j4BjFSrhQXI3ACpYTImyks5vWV9zgaJpZM4bh_oa> .

kanterov · 2019-03-13T20:54:09Z

I did a similar experiment with Spark:

df = spark.createDataFrame(
    [{"userId":"abc", "location":{"longtitude":1.0,"latitude":2.0}}], 
    T.StructType([
        T.StructField("userId", T.StringType()),
        T.StructField("location", T.StructType([
            T.StructField("longtitude", T.DoubleType()),
            T.StructField("latitude", T.DoubleType()),
        ])),        
    ])
)


df
> DataFrame[userId: string, location: struct<longtitude:double,latitude:double>]

df.select("location")
> DataFrame[location: struct<longtitude:double,latitude:double>]

df.select("location.latitude")
> DataFrame[latitude: double]

reuvenlax · 2019-03-13T21:17:26Z

So select(location.latitude) in Spark matches the unnest semantics, but select(location) does not. So Spark is inconsistent in itself?

…

On Wed, Mar 13, 2019 at 1:54 PM Gleb Kanterov ***@***.***> wrote: I did a similar experiment with Spark: df = spark.createDataFrame( [{"userId":"abc", "location":{"longtitude":1.0,"latitude":2.0}}], T.StructType([ T.StructField("userId", T.StringType()), T.StructField("location", T.StructType([ T.StructField("longtitude", T.DoubleType()), T.StructField("latitude", T.DoubleType()), ])), ]) ) df > DataFrame[userId: string, location: struct<longtitude:double,latitude:double>] df.select("location") > DataFrame[location: struct<longtitude:double,latitude:double>] df.select("location.latitude") > DataFrame[latitude: double] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1b9I4gFnsRpq_v48XBuWOzKWwpUmks5vWWV5gaJpZM4bh_oa> .

kanterov · 2019-03-13T21:39:40Z

As I understand, the approach is to take the last element in the field path, and use it as name, as I see, BigQuery works the same way:

CREATE TABLE `test.test_schema_2` AS
SELECT location.latitude FROM test.test;

$ bq show --schema test.test_schema_2 | jq
[
  {
    "type": "FLOAT",
    "name": "latitude"
  }
]

kanterov · 2019-03-13T21:47:21Z

It seems that Spark allows duplicate column names:

df.select("location.latitude", "location.latitude").printSchema()
root
 |-- latitude: double (nullable = true)
 |-- latitude: double (nullable = true)

And BigQuery doesn't:

Duplicate column names in the result are not supported. Found duplicate(s): latitude

kennknowles · 2019-03-13T22:11:05Z

I don't actually see a contradiction here. When you SELECT location or when you df.select("location") the output is a row with a single column containing a location struct. This is because the value of the selected field is a struct and it gets a default name. That's what this is:

[
  {
    "location": {
      "latitude": "1.0",
      "longtitude": "2.0"
    }
  }
]

It isn't maintaining the original structure of the input row, it is a list of rows, and each row has a column you selected, and that column contains a struct.

kennknowles · 2019-03-13T22:14:06Z

I'm curious about this one:

CREATE TABLE IF NOT EXISTS test.test (
  userId STRING,
  location STRUCT<latitude FLOAT64, longtitude FLOAT64>
);

INSERT INTO test.test (userId, location) VALUES ("abc", (123.0, 234.0));

SELECT location FROM test.test;
-- Row  location.latitude   location.longtitude 
-- 1    123.0               234.0

SELECT location.latitude FROM gleb_test.test;
-- Row latitude    
-- 1   123.0

The output when you SELECT location is flattened. It should be one column containing a row with the two fields. Is it just for display purposes?

reuvenlax · 2019-03-13T22:42:52Z

Ok this makes sense then. As I see it BigQuery (and Spark) is unnesting just like this PR doees. The difference is that it then creates a new Row to contain the selected result. So select("location") returns a Row containing a location. select("location.latitude") returns a Row containing a double. I would assume that select("location.*") would return a row containing a latitude and a longitude. We could validate this by nesting the location three layers deep. In that case I suspect BQ would return the same thing it did in the above experiment, not a three-layered nested row.

…

On Wed, Mar 13, 2019 at 2:39 PM Gleb Kanterov ***@***.***> wrote: As I understand, the approach is to take the last element in the field path, as I see, BigQuery works the same way: CREATE TABLE `test.test_schema_2` AS SELECT location.latitude FROM test.test; $ bq show --schema test.test_schema_2 | jq [ { "type": "FLOAT", "name": "latitude" } ] — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1XIPD-j12E_DRmJcDaVECYPizaOcks5vWXAhgaJpZM4bh_oa> .

kanterov · 2019-03-13T22:55:38Z

@kennknowles it isn't flattened, you are right:

SELECT location FROM test.test;

Gives

[
  {
    "fields": [
      {
        "type": "FLOAT",
        "name": "latitude"
      },
      {
        "type": "FLOAT",
        "name": "longtitude"
      }
    ],
    "type": "RECORD",
    "name": "location"
  }
]

kanterov · 2019-03-13T23:09:48Z

Makes sense. I tried and both BigQuery and Spark have consistent behavior for selecting rows nested in rows:

df.printSchema()
root
 |-- userId: string (nullable = true)
 |-- position: struct (nullable = true)
 |    |-- location: struct (nullable = true)
 |    |    |-- longtitude: double (nullable = true)
 |    |    |-- latitude: double (nullable = true)

df.select("position.location").printSchema()
root
 |-- location: struct (nullable = true)
 |    |-- longtitude: double (nullable = true)
 |    |-- latitude: double (nullable = true)

reuvenlax · 2019-03-13T23:17:45Z

Always adding one extra row around whatever is selected makes sense in Spark and BQ, because it makes selection consistent between primitive and nested types. Select must always return a row, so if you select userId, it returns a Row with a single string field (not a scalar string field). Spark and BQ have therefore decided to treat nested rows the exact same way: even though the nested row could be returned directly, it is treated consistently like primitive types and boxed inside a result row.

…

On Wed, Mar 13, 2019 at 4:09 PM Gleb Kanterov ***@***.***> wrote: Makes sense. I tried and both BigQuery and Spark have consistent behavior for selecting rows nested in rows: df.printSchema() root |-- userId: string (nullable = true) |-- position: struct (nullable = true) | |-- location: struct (nullable = true) | | |-- longtitude: double (nullable = true) | | |-- latitude: double (nullable = true) df.select("position.location").printSchema() root |-- location: struct (nullable = true) | |-- longtitude: double (nullable = true) | |-- latitude: double (nullable = true) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1X7rouPUfs4n2cmi2PS6w6iTZj-Uks5vWYVCgaJpZM4bh_oa> .

reuvenlax · 2019-03-15T16:35:00Z

@kanterov I'm changing the semantics here to always return a new row, matching BigQuery and Spark. Will ping once that's done.

kanterov · 2019-03-15T17:59:38Z

@reuvenlax To avoid miscommunication, I didn't reply anything not because I wasn't certain about the solution or something wasn't clear, but because I didn't have time to continue the review.

kanterov · 2019-03-22T13:02:05Z

I would try to summarize my understanding of Spark properties:

resulting schema of select(<xs : x>), would always have a field named x, and schema will match the corresponding schema for <xs : x> path in the input schema
resulting schema select(<path1>, <path2>) is union of select(<path1>) and select(<path2>)

kanterov · 2019-03-22T13:05:46Z

I also uploaded the notebook I used for Spark: https://gist.github.com/kanterov/78972d40ebe5b6f5c553b5a018dbb0bf

You can get local Spark instance with Jupyter with:

$ docker run -p8888:8888 -it jupyter/pyspark-notebook

Or using Dataproc

kanterov · 2019-03-22T13:40:00Z

Spark has an implementation of schema prunning, that behaves similar to the code in this PR (preserves schema structure). It was introduced in apache/spark#21320, there is a test suite demonstrating it's behavior https://github.com/apache/spark/pull/21320/files#diff-3131013c95ca682e798070a2d50d6896.

As I understand, it isn't used in user-facing API, only for pushing down projections into Parquet data source.

kennknowles · 2019-03-22T17:05:50Z

Oh, I see now. The issue is eliding the array layers actually.

kennknowles · 2019-03-22T17:12:21Z

I think the summary by @kanterov is super clear are the right approach, but I don't know if I would call that "union" so much as row building.

For SQL I would state this as SELECT <expr1 : name1>, <expr2, name2> ... each result is a new row containing the columns name1, name2 etc and the type of each column is the type of the expression.

Now for raw Beam if you select just a single column and the type of that column can convert to a Java type, you shoulud be able to convert. Specifically:

row with one int column convertible to int
row with one row column matching a POJO can convert directly

This is a step towards a thing mentioned in the Go SDK coders thread:

row with one bytes column w/ coder metadata equivalent to today's approach of coders

reuvenlax · 2019-03-22T19:10:54Z

Given that this is mostly an issue eliding nested arrays, I wonder if we should merge this PR and file a JIRA for the nested array case. We can discuss the correct behavior on the JIRA. This PR seems more correct than the old behavior.

…

On Fri, Mar 22, 2019 at 10:12 AM Kenn Knowles ***@***.***> wrote: I think the summary by @kanterov <https://github.com/kanterov> is super clear are the right approach, but I don't know if I would call that "union" so much as row building. For SQL I would state this as SELECT <expr1 : name1>, <expr2, name2> ... each result is a new row containing the columns name1, name2 etc and the type of each column is the type of the expression. Now for raw Beam if you select just a single column and the type of that column can convert to a Java type, you shoulud be able to convert. Specifically: - row with one int column convertible to int - row with one row column matching a POJO can convert directly This is a step towards a thing mentioned in the Go SDK coders thread: - row with one bytes column w/ coder metadata equivalent to today's approach of coders — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8006 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AUGE1QmldObUjrqJDswOV3CXYalW27aCks5vZQ77gaJpZM4bh_oa> .

reuvenlax · 2019-03-23T17:37:45Z

@kanterov I'm surprised that Spark cannot resolve select("a.c.d"). Is there a reason for that?

kanterov · 2019-03-25T16:48:33Z

@reuvenlax I was surprised to see that Spark didn't resolve select("a.c.d")

Agree that we should keep iterations quick, and given API is heavily experimental we could move forward with increments. I think we did a good analysis, and now the problem better. However, we still don't have a good formalization of desired semantics.

I'm going to revisit the code again, didn't have time to check this today.

kanterov · 2019-03-26T14:37:57Z

@reuvenlax I've tried the following test:

  @Test
  public void testSelectFieldOfRecord() {
    Schema f1 = Schema.builder().addInt64Field("f0").build();
    Schema f2 = Schema.builder().addRowField("f1", f1).build();
    Schema f3 = Schema.builder().addRowField("f2", f2).build();
    
    Row r1 = Row.withSchema(f1).addValue(42L).build(); // {"f0": 42}
    Row r2 = Row.withSchema(f2).addValue(r1).build();  // {"f1": {"f0": 42}}
    Row r3 = Row.withSchema(f3).addValue(r2).build();  // {"f2": {"f1": {"f0": 42}}}

    FieldAccessDescriptor fieldAccessDescriptor = FieldAccessDescriptor
        .withFieldNames("f2.f1")
        .resolve(f3);

    Schema outputSchema = SelectHelpers.getOutputSchema(f3, fieldAccessDescriptor);

    Row out = SelectHelpers.selectRow(r3, fieldAccessDescriptor, r3.getSchema(), outputSchema);

    assertEquals(outputSchema, f2);
    assertEquals(out, r2);
  }

And it fails with an exception:

ava.lang.IllegalArgumentException: For field name f0 and type INT64 found incorrect class type class org.apache.beam.sdk.values.RowWithStorage
	at org.apache.beam.sdk.values.Row$Builder.verifyPrimitiveType(Row.java:724)
	at org.apache.beam.sdk.values.Row$Builder.verify(Row.java:587)
	at org.apache.beam.sdk.values.Row$Builder.verify(Row.java:571)
	at org.apache.beam.sdk.values.Row$Builder.build(Row.java:748)
	at org.apache.beam.sdk.schemas.utils.SelectHelpers.selectIntoRow(SelectHelpers.java:204)
	at org.apache.beam.sdk.schemas.utils.SelectHelpers.selectRow(SelectHelpers.java:167)

I don't see anything wrong with the test itself.

reuvenlax · 2019-03-28T16:57:03Z

Good catch on that test. This was a bug in my code. I've fixed that bug and added your test.

reuvenlax · 2019-03-28T16:59:16Z

@kennknowles and I discussed the array issue. We both think that the best approach is probably to distribute out array selects. However there are other options that should be discussed (e.g. completely unnesting arrays whenever a select happens), so we think this discussion should be brought to the dev list. This PR I believe is generally an improvement, so (assuming other comments are resolved) I think we should merge it and address array select semantics separately.

reuvenlax · 2019-03-28T17:00:45Z

FYI by "distribute array selects" I mean the following:

if you have { a: {b: int, c: int}[] } and someone select "a.b", "a.c" the result should be {b: int[], c: int[]}. The same should be true if someone selects a.*. However if someone selects just a, you get the original schema back.

kanterov

@reuvenlax sorry for long iterations

I've tried the following test, and it didn't work:

// expected {"f2": {"f1": {"f0": 42}}}
// got {"f1": {"f0": 42}}

  @Test
  public void testSelectFieldOfRecordOrRecord() {
    Schema f1 = Schema.builder().addInt64Field("f0").build();
    Schema f2 = Schema.builder().addRowField("f1", f1).build();
    Schema f3 = Schema.builder().addRowField("f2", f2).build();
    Schema f4 = Schema.builder().addRowField("f3", f3).build();

    Row r1 = Row.withSchema(f1).addValue(42L).build(); // {"f0": 42}
    Row r2 = Row.withSchema(f2).addValue(r1).build(); // {"f1": {"f0": 42}}
    Row r3 = Row.withSchema(f3).addValue(r2).build(); // {"f2": {"f1": {"f0": 42}}}
    Row r4 = Row.withSchema(f4).addValue(r3).build(); // {"f3": {"f2": {"f1": {"f0": 42}}}}

    FieldAccessDescriptor fieldAccessDescriptor =
        FieldAccessDescriptor.withFieldNames("f3.f2").resolve(f4);

    Schema outputSchema = SelectHelpers.getOutputSchema(f3, fieldAccessDescriptor);

    Row out = SelectHelpers.selectRow(r4, fieldAccessDescriptor, r4.getSchema(), outputSchema);

    assertEquals(f3, outputSchema);
    assertEquals(r3, out);
  }

kanterov · 2019-04-03T21:16:32Z

sdks/java/core/src/test/java/org/apache/beam/sdk/schemas/utils/SelectHelpersTest.java

+ Row out = SelectHelpers.selectRow(r3, fieldAccessDescriptor, r3.getSchema(), outputSchema);
+
+ assertEquals(outputSchema, f2);
+ assertEquals(out, r2);


Parameters should be in a different order: expected, actual

kanterov · 2019-04-03T21:20:07Z

I think we can merge and iterate from there, but there are still issues with semantics that can be discussed separately.

reuvenlax · 2019-04-04T19:55:04Z

@kanterov looking. I suspect my fix to the last issue you found is what caused this use case to break.

reuvenlax · 2019-04-04T21:44:14Z

@kanterov I believe actually that the bug is in your test. You have:
Schema outputSchema = SelectHelpers.getOutputSchema(f3, fieldAccessDescriptor);
when it should be
Schema outputSchema = SelectHelpers.getOutputSchema(f4, fieldAccessDescriptor);

When I fix this, the test now passes.

reuvenlax · 2019-04-04T23:31:42Z

Run Java PreCommit

reuvenlax · 2019-04-05T02:45:42Z

Run Java PreCommit

reuvenlax · 2019-04-05T04:28:41Z

Run Java PreCommit

reuvenlax · 2019-04-05T05:36:09Z

@kanterov I went ahead and changed array and map selects to distribute the select. This preserves the invariant that the field name selected is always the one that appears in the resulting schema, and I believe is closer to what Spark does.

I still think we should have a thread on the dev list to discuss the detailed semantics, as we still might want to change them. However I think with this last commit, we are probably closer to where we want to end up.

reuvenlax · 2019-04-11T02:34:59Z

@kanterov did you have any further comments on this PR?

kanterov · 2019-04-12T14:02:35Z

@reuvenlax sorry for the delay, LGTM 👍 please feel free to merge when you are comfortable with it

…o match what a user expects

reuvenlax added 3 commits March 6, 2019 12:38

Initial commit.

b642294

Spotless.

ab6f1e1

Apply spotless.

be4baf2

reuvenlax added 3 commits March 6, 2019 18:43

Fix GroupTest.

99cf59c

Fix Javadoc.

2d14af0

Apply spotless.

e588e47

Address comments.

6a9a7da

kanterov reviewed Apr 3, 2019

View reviewed changes

Address code-review comments.

0d667c0

Fix semantics for list and map selects.

73d5d4b

reuvenlax and others added 3 commits April 5, 2019 10:46

Fix SelectTest.

7a4482a

Fix CheckStyle failures

c03d649

Fix CheckStyle

4d9e63d

reuvenlax merged commit 323fc9e into apache:master Apr 12, 2019

ibzib pushed a commit to ibzib/beam that referenced this pull request Apr 22, 2019

Merge pull request apache#8006: [BEAM-6772] Change Select semantics t…

3d90cef

…o match what a user expects

[BEAM-6772] Change Select semantics to match what a user expects #8006

[BEAM-6772] Change Select semantics to match what a user expects #8006

Conversation

reuvenlax commented Mar 6, 2019

reuvenlax commented Mar 6, 2019

kanterov commented Mar 13, 2019

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019 • edited Loading

reuvenlax commented Mar 13, 2019 via email

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019

kanterov commented Mar 13, 2019

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019

kanterov commented Mar 13, 2019 • edited Loading

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019 • edited Loading

kanterov commented Mar 13, 2019

kennknowles commented Mar 13, 2019

kennknowles commented Mar 13, 2019

reuvenlax commented Mar 13, 2019 via email

kanterov commented Mar 13, 2019

kanterov commented Mar 13, 2019

reuvenlax commented Mar 13, 2019 via email

reuvenlax commented Mar 15, 2019

kanterov commented Mar 15, 2019

kanterov commented Mar 22, 2019 • edited Loading

kanterov commented Mar 22, 2019

kanterov commented Mar 22, 2019

kennknowles commented Mar 22, 2019

kennknowles commented Mar 22, 2019

reuvenlax commented Mar 22, 2019 via email

reuvenlax commented Mar 23, 2019

kanterov commented Mar 25, 2019

kanterov commented Mar 26, 2019

reuvenlax commented Mar 28, 2019

reuvenlax commented Mar 28, 2019

reuvenlax commented Mar 28, 2019

kanterov left a comment

Choose a reason for hiding this comment

kanterov Apr 3, 2019

Choose a reason for hiding this comment

reuvenlax Apr 4, 2019

Choose a reason for hiding this comment

kanterov commented Apr 3, 2019

reuvenlax commented Apr 4, 2019

reuvenlax commented Apr 4, 2019

reuvenlax commented Apr 4, 2019

reuvenlax commented Apr 5, 2019

reuvenlax commented Apr 5, 2019

reuvenlax commented Apr 5, 2019

reuvenlax commented Apr 11, 2019

kanterov commented Apr 12, 2019

kanterov commented Mar 13, 2019 •

edited

Loading

kanterov commented Mar 13, 2019 •

edited

Loading

kanterov commented Mar 13, 2019 •

edited

Loading

kanterov commented Mar 22, 2019 •

edited

Loading