AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase. #805

rumeshkrish · 2020-02-07T14:37:43Z

Jira

This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,

Canonical normaliser should filter and order the reserved as well as user given properties.
User able to normalise schema with additional user defined logical types.
reserved avro property ordering as below, followed by user given properties.
JIRA issue : https://issues.apache.org/jira/browse/AVRO-2299
No additional dependency changes.
Normalisation new feature for Java only.

Tests

New unit test included in the file lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java

Documentation

Updated spec.xml
Updated java documentation.

Reference discussion:

Sending to Review : @cutting , @tjwp , @Fokko

…ache/avro.

prestona · 2020-03-03T23:36:13Z

@rumeshkrish I stumbled across AVRO-2299 when I realised that parsing canonical form strips too many properties from the schema to be useful for my use-case. I'm interested to try standard canonical form, so I pulled in these changes and tried generating toCanonicalForm on a few schemas. Overall everything worked as expected, but I did hit a couple of problems:

It looks like these changes introduce duplicate classes called TestStandardCanonicalSchema and TestCustomCanonicalSchema into lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java. Eyeballing these duplicates they appear to be identical to each other - so I just deleted one copy of each, and that seemed to resolve the problem.
Generating the standard canonical form for an enum schema doesn't appear to preserve the default property as I had expected. Here's the test code I wrote:

import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaNormalization;

public class App {
    public static void main(String[] args) throws IOException {
        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(new File("./test.avro"));
        System.out.println(SchemaNormalization.toCanonicalForm(schema));
    }
}

And here's the contents of ./test.avro:

{
  "type": "enum",
  "name": "Suit",
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"],
  "default": "HEARTS"
}

This generates the following output, which does not include the default:

{"name":"Suit","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"]}

It looks like the arm of the code that handles ENUM's is missing a check to see if the enum has a default value. By adding the following code to the end of this block, I was able to generate a default property in the output:

if (s.getEnumDefault() != null) {
    o.append(",\"default\":").append(JacksonUtils.toJsonNode(s.getEnumDefault()).toString());
}

prestona · 2020-03-04T00:18:54Z

Having played a little more, there also appears to be a discrepancy between the specification for standard canonical form and the handling of the decimal logical type. If I generate the standard canonical form for this schema:

{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"type":"bytes""logicalType":"decimal","precision":4,"scale":2}

The most obvious problem is that this is not valid JSON. However, if I fix this then there is still the problem that the spec. does not include precision or scale in the list of preserved attributes.

In this case, I think the current implementation is correct, and the specification needs to be amended. This is because stripping the precision attribute results in an invalid logical type (as it is a required field). Stripping the scale attribute could change the way the schema would de-serialize data as if this attribute is omitted then a default of 0 is assumed.

Experimenting with decimal logical types that are based on a fixed type also has some inconsistencies. Specifically, if I generate the standard canonical form for this schema:

{
  "type": "fixed",
  "size": 16,
  "name": "x",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"name":"x","type":"fixed","size":16}

This has discarded the logicalType attribute (which the definition of standard canonical form indicates should be preserved) as well as the other attributes of decimal which, arguably, should be present to avoid changing the interpretation of the schema.

cutting

This currently fails to compile.

It would be great if someone could carefully read the proposed changes to the specification. I don't have time for that today. Thanks!

cutting · 2020-04-07T17:33:51Z

lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java

+ }
+ }
+
+ @RunWith(Parameterized.class)


This class is a duplicate of the above and causes a compilation error.

duplicate class removed. Thanks @cutting .

cutting · 2020-04-07T17:34:14Z

lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java

+ }
+ }
+
+ @RunWith(Parameterized.class)


This is also a duplicate.

duplicate class removed. Thanks @cutting .

rumeshkrish · 2020-06-14T09:42:22Z

@prestona Thanks for your time to do some more testing. I have update the specification with following test cases.

<<INPUT {"namespace":"x.y.z", "type":"enum", "name":"foo", "doc":"foo bar", "symbols":["SPADES", "HEARTS", "DIAMONDS", "CLUBS"], "default": "HEARTS"}
<<canonical {"name":"x.y.z.foo","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"],"doc":"foo bar","default":"HEARTS"}

// 035
<<INPUT {"type":"bytes", "logicalType":"decimal", "precision": 4, "scale": 2}
<<canonical {"type":"bytes","logicalType":"decimal","precision":4,"scale":2}

// 036
<<INPUT {"type": "fixed", "size": 16, "name": "x", "logicalType": "decimal", "precision": 4, "scale": 2}
<<canonical {"name":"x","type":"fixed","size":16,"logicalType":"decimal","precision":4,"scale":2}

…logical type standard canonical form changes along with additional test cases.

rumeshkrish · 2020-06-14T09:49:23Z

@cutting @iemejia @Fokko It would be great to have this feature as part of 1.10.0 release. Kindly review this changes, if any support for fixing the bug or refactor, i am happy to contribute.

> mvn spotless:apply

RyanSkraba · 2020-06-15T10:59:18Z

Hello! My apologies for the late, late reply -- I haven't been following this issue very well! I guess my big question is whether this really needs to be part of the specification and SchemaNormalization.

There are lots of reasons we might want useful, custom code that do schema transformations, including customized normalizations. Stripping custom user properties and "non-essential" metadata is certainly one of those! But is the specification necessary or useful for portability between languages or versions?

What do you think about just adding the code and tools to do "custom normalizations" into the Java SDK and others without defining a new canonical form?

rumeshkrish · 2020-06-15T15:11:13Z

@RyanSkraba It would be great, if we could add Standard Normalizations into other languages or versions.

RyanSkraba

(More "Request Discussion" than "Request Changes"!)

This could be really important as a tool for working with schemas that have a longer lifetime (especially with versioning, evolution and registries), so I think it's worth getting it right.

Thanks for keeping on top of this feature -- I appreciate the work you've put into it!

RyanSkraba · 2020-06-17T15:51:50Z