Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase. #805

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

rumeshkrish
Copy link

Jira

This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,

  • Canonical normaliser should filter and order the reserved as well as user given properties.
    User able to normalise schema with additional user defined logical types.
    reserved avro property ordering as below, followed by user given properties.
  • JIRA issue : https://issues.apache.org/jira/browse/AVRO-2299
  • No additional dependency changes.
  • Normalisation new feature for Java only.

Tests

  • New unit test included in the file lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java

Documentation

  • Updated spec.xml
  • Updated java documentation.

Reference discussion:

Sending to Review : @cutting , @tjwp , @Fokko

@prestona
Copy link

prestona commented Mar 3, 2020

@rumeshkrish I stumbled across AVRO-2299 when I realised that parsing canonical form strips too many properties from the schema to be useful for my use-case. I'm interested to try standard canonical form, so I pulled in these changes and tried generating toCanonicalForm on a few schemas. Overall everything worked as expected, but I did hit a couple of problems:

  1. It looks like these changes introduce duplicate classes called TestStandardCanonicalSchema and TestCustomCanonicalSchema into lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java. Eyeballing these duplicates they appear to be identical to each other - so I just deleted one copy of each, and that seemed to resolve the problem.

  2. Generating the standard canonical form for an enum schema doesn't appear to preserve the default property as I had expected. Here's the test code I wrote:

import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaNormalization;

public class App {
    public static void main(String[] args) throws IOException {
        Schema.Parser parser = new Schema.Parser();
        Schema schema = parser.parse(new File("./test.avro"));
        System.out.println(SchemaNormalization.toCanonicalForm(schema));
    }
}

And here's the contents of ./test.avro:

{
  "type": "enum",
  "name": "Suit",
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"],
  "default": "HEARTS"
}

This generates the following output, which does not include the default:

{"name":"Suit","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"]}

It looks like the arm of the code that handles ENUM's is missing a check to see if the enum has a default value. By adding the following code to the end of this block, I was able to generate a default property in the output:

if (s.getEnumDefault() != null) {
    o.append(",\"default\":").append(JacksonUtils.toJsonNode(s.getEnumDefault()).toString());
}

@prestona
Copy link

prestona commented Mar 4, 2020

Having played a little more, there also appears to be a discrepancy between the specification for standard canonical form and the handling of the decimal logical type. If I generate the standard canonical form for this schema:

{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"type":"bytes""logicalType":"decimal","precision":4,"scale":2}

The most obvious problem is that this is not valid JSON. However, if I fix this then there is still the problem that the spec. does not include precision or scale in the list of preserved attributes.

In this case, I think the current implementation is correct, and the specification needs to be amended. This is because stripping the precision attribute results in an invalid logical type (as it is a required field). Stripping the scale attribute could change the way the schema would de-serialize data as if this attribute is omitted then a default of 0 is assumed.

Experimenting with decimal logical types that are based on a fixed type also has some inconsistencies. Specifically, if I generate the standard canonical form for this schema:

{
  "type": "fixed",
  "size": 16,
  "name": "x",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}

Then I get this:

{"name":"x","type":"fixed","size":16}

This has discarded the logicalType attribute (which the definition of standard canonical form indicates should be preserved) as well as the other attributes of decimal which, arguably, should be present to avoid changing the interpretation of the schema.

Copy link
Contributor

@cutting cutting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently fails to compile.

It would be great if someone could carefully read the proposed changes to the specification. I don't have time for that today. Thanks!

}
}

@RunWith(Parameterized.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is a duplicate of the above and causes a compilation error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate class removed. Thanks @cutting .

}
}

@RunWith(Parameterized.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a duplicate.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate class removed. Thanks @cutting .

@rumeshkrish
Copy link
Author

@prestona Thanks for your time to do some more testing. I have update the specification with following test cases.

<<INPUT {"namespace":"x.y.z", "type":"enum", "name":"foo", "doc":"foo bar", "symbols":["SPADES", "HEARTS", "DIAMONDS", "CLUBS"], "default": "HEARTS"}
<<canonical {"name":"x.y.z.foo","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"],"doc":"foo bar","default":"HEARTS"}

// 035
<<INPUT {"type":"bytes", "logicalType":"decimal", "precision": 4, "scale": 2}
<<canonical {"type":"bytes","logicalType":"decimal","precision":4,"scale":2}

// 036
<<INPUT {"type": "fixed", "size": 16, "name": "x", "logicalType": "decimal", "precision": 4, "scale": 2}
<<canonical {"name":"x","type":"fixed","size":16,"logicalType":"decimal","precision":4,"scale":2}

…logical type standard canonical form changes along with additional test cases.
@rumeshkrish rumeshkrish requested a review from cutting June 14, 2020 09:46
@rumeshkrish
Copy link
Author

@cutting @iemejia @Fokko It would be great to have this feature as part of 1.10.0 release. Kindly review this changes, if any support for fixing the bug or refactor, i am happy to contribute.

@RyanSkraba
Copy link
Contributor

Hello! My apologies for the late, late reply -- I haven't been following this issue very well! I guess my big question is whether this really needs to be part of the specification and SchemaNormalization.

There are lots of reasons we might want useful, custom code that do schema transformations, including customized normalizations. Stripping custom user properties and "non-essential" metadata is certainly one of those! But is the specification necessary or useful for portability between languages or versions?

What do you think about just adding the code and tools to do "custom normalizations" into the Java SDK and others without defining a new canonical form?

@rumeshkrish
Copy link
Author

rumeshkrish commented Jun 15, 2020

@RyanSkraba It would be great, if we could add Standard Normalizations into other languages or versions.

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(More "Request Discussion" than "Request Changes"!)

This could be really important as a tool for working with schemas that have a longer lifetime (especially with versioning, evolution and registries), so I think it's worth getting it right.

Thanks for keeping on top of this feature -- I appreciate the work you've put into it!

@@ -1310,6 +1310,92 @@
</ul>
</section>

<section>
<title>Standard Canonical Form for Schemas</title>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I proposed in the JIRA that we don't create a new specification for this form, and just consider getting the "plain" schema as an SDK tool issue.

If we go that way, you can ignore the comments on this file :D

as follows: <code>name</code>, <code>type</code>,
<code>fields</code>, <code>symbols</code>,
<code>items</code>, <code>values</code>,
<code>logicalType</code>, <code>size</code>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch these to keep the initial attributes the same as for Parsing Canonical Form: name, type, fields, symbols, items, values, size ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define the order for user/custom properties? I'm tempted to say keep the 7 attributes from Parsing Canonical form in their order, followed by all other kept attributes in alphabetical order...


<li> [STRIP] Keep only attributes that are relevant to
reserved properties, which are:
<code>type</code>, <code>name</code>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that the order specified here is different than below. It's also the case for Parsing Canonical Form documentation though...

A good question -- size is "sometimes" relevant, and sometimes ignored, for example. In the Java SDK we can add it as an attribute to a field but not an enum, and it's stripped from the field despite being a "kept" attribute. Is this a bug with canonical form in general? What do we want to happen with the Standard Canonical Form?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size is only relevant for fixed, it should not be present in any other type.


<li> [INTEGERS] Eliminate quotes around and any leading
zeros in front of JSON integer literals (which appear in the
<code>size</code> attributes of <code>fixed</code> schemas).</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also appear in default attributes now. We also have floating point numbers that should be normalized.

JSON arrays should be fine, but JSON objects in defaults should probably have their fields ordered alphabetically.

<p><em>Standard Canonical Form</em> is a transformation of a schema
into standard canonical ordered. It contains only avro reserved
properties <code>"name", "type", "fields", "symbols", "items", "values",
"logicalType", "size", "order", "doc", "aliases", "default"</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth mentioning that any properties the are used to configure a logical type should also be kept ("scale", "precision" and user-defined logical types.)

As a consequence, when generating a canonical form that includes a user-defined LogicalType, all languages should have the same defined attributes for that logical type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we're keeping default, we should keep all of the attributes that might appear in a JSON object in the default.

@@ -17,17 +17,22 @@
*/
package org.apache.avro;

import org.apache.avro.util.internal.JacksonUtils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is currently Jackson-independent... I think the JSON is being written manually here to ensure that it's isolated from any specific JSON implementations, and the trend of the last few versions has been to reduce the dependency on Jackson in particular.

Is it possible to avoid introducing this class?

}

private static void setLogicalProps(Appendable o, LogicalType lt) throws IOException {
o.append(",\"").append(LogicalType.LOGICAL_TYPE_PROP).append("\":\"").append(lt.getName()).append("\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect user-defined logical type properties to work here! (But that could be left as a known issue as well...)


private static void setComplexProps(Appendable o, Schema s) throws IOException {
if (s.getDoc() != null && !s.getDoc().isEmpty())
o.append(",\"doc\":\"").append(s.getDoc()).append("\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be o.append(",\"doc\":").append(toJsonNode(s.getDoc()).toString());

if (f.order() != null)
o.append(",\"order\":\"").append(f.order().toString()).append("\"");
if (f.doc() != null)
o.append(",\"doc\":\"").append(f.doc()).append("\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be o.append(",\"doc\":").append(toJsonNode(f.doc()).toString());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding website
Projects
None yet
6 participants