Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2299 Normalize Avro Standard Canonical Schema. #452

Closed
wants to merge 24 commits into from
Closed

AVRO-2299 Normalize Avro Standard Canonical Schema. #452

wants to merge 24 commits into from

Conversation

rumeshkrish
Copy link

JIRA: https://issues.apache.org/jira/browse/AVRO-2299

This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,

  • Canonical normaliser should filter and order the reserved as well as user given properties.
  • User able to normalise schema with additional user defined logical types.
  • reserved avro property ordering as below, followed by user given properties.

Avro Schema Reserved Keys:
"name", "type", "fields", "symbols", "items", "values", "logicalType", "size", "order", "doc", "aliases", "default"

As mentioned canonical form for avro types as followed by :-
Input :
{"type":"<primitive_types>", "x":"y"}
Output:
"<primitive_types>" //boolean, bytes, double, float, int, long, null, string
Canonical form for Union:
[ <avro Types>, ... ]
Canonical form for Record:

{
  "name": "---",
  "namespace": "---", // optional if input contain then present
  "type": "record",
  "fields": [
    {
      "name": "---",
      "types": ...,
      "order": "---", 
      "doc": "---", // optional if input contain then present
      "aliases": ["---", ...], // optional if input contain then present
      "default": ... // optional if input contain then present
    }
    ...
  ],
  "doc": "---", // optional if input contain then present
  "aliases": ["---", ...] // optional if input contain then present
}

Canonical form for Enum:

{
  "name": "---",
  "namespace": "---", // optional if input contain then present
  "type": "enum",
  "symbols": [ ... ],
  "doc": "---", // optional if input contain then present
  "aliases": ["---", ...] // optional if input contain then present
}

Canonical form for Fixed:

{
  "name": "---",
  "namespace": "---", // optional if input contain then present
  "type": "fixed",
  "size": ...,
  "doc": "---", // optional if input contain then present
  "aliases": ["---", ...] // optional if input contain then present
}

Canonical form for Logical Types:

{
  "type": "...", // <primitive types>
  "logicalType": "...." // reserved avro logical types or user registered logical types
  "precision": ..., // if logical type is decimal
  "scale": ... // if logical type is decimal then optional if input contain then present
}

@probot-autolabeler probot-autolabeler bot added the Java Pull Requests for Java binding label Feb 11, 2019
Copy link
Contributor

@cutting cutting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally had a chance to review this. Thanks for your patience.

@rumeshkrish
Copy link
Author

rumeshkrish commented Mar 5, 2019

Hi @doug Cutting, I have modified the code based on your review comments. Thanks a lot for your commitments. Updated the Spec for properties. The idea is to pass user defined properties for adding canonical form, user defined property should not be avro reserved properties.

@cutting
Copy link
Contributor

cutting commented Mar 5, 2019

There are a lot of whitespace changes in this which makes it difficult to review. Maybe revert and make only the required changes? If your editor makes whitespace changes automatically then you should disable that feature.

@cutting
Copy link
Contributor

cutting commented Mar 5, 2019 via email

@rumeshkrish
Copy link
Author

Thank you so much @cutting for your review comments. I refactor the changes with minimal modification and what is required without formatting. Hope this help to make review better way. Kindly let me know, if any information required from my end. I have provided the Specs for properties user defined.

The properties parameter used by user to defined property required as part of canonical normalization excluding avro reserved properties "name", "type", "fields", "symbols", "items", "values", "logicalType", "size", "order", "doc", "aliases", "default", so we remove avro specific property from user given properties. For example: size, format two properties given by user than size is avro reserved field, so we consider format is used specified property and then used to create canonical form for complex structure included this property.

Input Schema:
{ "type": "record", "name": "sample", "fields": [ {"name": "name", "type": "string", "format": "[a-zA-Z]", "xxx": "xzy"}, {"name": "id", "type": "int" } ] }
Canonical Form of Schema:
{"type":"record","name":"sample","fields": [{"name":"name","type":"string","format":"[a-zA-Z]"},{"name":"id","type":"int"}]}

Copy link
Contributor

@cutting cutting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good except for the two javadoc issues below.

@tjwp
Copy link
Contributor

tjwp commented Mar 7, 2019

@rumeshkrish What do you think about changing the order of the properties to:

"name", "type", "fields", "symbols", "items", "values",  "size", "default", "aliases", "logicalType", "order", "doc"

This has the nice behavior that if you stop at "size" you have the Parsing Canonical Form. And if you stop at "logicalType" you have the previously proposed Resolution Canonical Form.

If this is conceptually extending the list of properties for the parsing canonical form, instead of mixing other properties in to the list, maybe it helps to simplify the implementation for other languages?

I'd also be in favor (https://issues.apache.org/jira/browse/AVRO-2299?focusedCommentId=16755375&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16755375) of adding a named Resolution Canonical Form to the implementation and the spec, but that could happen separately.

@rumeshkrish
Copy link
Author

rumeshkrish commented Mar 13, 2019

@cutting @tjwp i will work on Spec, kindly let me know , if you have any other review comments.

rumeshkrishnan added 4 commits March 14, 2019 10:02
AVRO-2299 Get Avro Standard Canonical Schema.
removed namespace from the avro reserved list.

AVRO-2299 Get Avro Standard Canonical Schema.
1. additional spec for used defined properties
2. toCanonicalForm function code fix for avoid mutating static states.
2. Code changes done based on review comments.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 3d8adfb.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit e0c6249.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f93aafe.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 4fe04d4.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 6326880.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 2f790f0.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f18a773.

AVRO-2299 Canonical Normalization code changes based on review comments and spec for properties.

AVRO-2299 fully qualified named aliases with sorted order  in Schema/Field.
AVRO-2299 Get Avro Standard Canonical Schema.
removed namespace from the avro reserved list.

AVRO-2299 Get Avro Standard Canonical Schema.
1. additional spec for used defined properties
2. toCanonicalForm function code fix for avoid mutating static states.
2. Code changes done based on review comments.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 3d8adfb.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit e0c6249.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f93aafe.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 4fe04d4.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 6326880.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 2f790f0.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f18a773.

AVRO-2299 Canonical Normalization code changes based on review comments and spec for properties.

AVRO-2299 fully qualified named aliases with sorted order  in Schema/Field.
# Conflicts:
#	lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
#	lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
Copy link
Author

@rumeshkrish rumeshkrish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard Canonical Form Spec added. Kindly review this @cutting .

<section>
  <title>Standard Canonical Form for Schemas</title>

  <p>One of defined way to normalize the avro schema using
    <em>Standard Canonical Form Transformation</em>. This involves
    stripping unwanted properties and maintain same canonical
    ordering. The canonical ordering involves ordering avro
    reserved properties followed by custom properties if mentioned while
    transforming. Normalization schema which helps to reduce the
    total memory size of schema (removed unwanted properties and whitespace)
    while transfer avro schema between two system and also reduce the parsing
    time for compatibility check and schema evolution.
  </p>

  <p><em>Standard Canonical Form</em> is a transformation of a schema
    into standard canonical ordered. It contains avro reserved properties 
    <code>"name", "type", "fields", "symbols", "items", "values",
      "logicalType", "size", "order", "doc", "aliases", "default"</code>
    and <em>other (custom properties)</em> schema properties.
  </p>

  <section>
    <title>Transforming into Standard Canonical Form</title>

    <p>Assuming an input schema (in JSON form) that's already
      UTF-8 text for a <em>valid</em> Avro schema (including all
      quotes as required by JSON), the following transformations
      will produce its Standard Canonical Form:</p>
    <ul>
      <li> [PRIMITIVES] Convert primitive schemas to their simple
        form (e.g., <code>int</code> instead of
        <code>{"type":"int"}</code>).</li>

      <li> [FULLNAMES] Replace short names with fullnames, using
        applicable namespaces to do so.  Then eliminate
        <code>namespace</code> attributes, which are now redundant.</li>

      <li> [STRIP] Keep only attributes that are relevant to
        reserved properties, which are:
        <code>type</code>, <code>name</code>,
        <code>fields</code>, <code>symbols</code>,
        <code>items</code>, <code>values</code>,
        <code>logicalType</code>, <code>size</code>,
        <code>order</code>, <code>doc</code>
        <code>aliases</code> and <code>default</code>.
        Strip all others user defined properties (e.g., <code>format</code>).</li>

      <li> [ORDER] Order the appearance of fields of JSON objects
        as follows: <code>name</code>, <code>type</code>,
        <code>fields</code>, <code>symbols</code>,
        <code>items</code>, <code>values</code>,
        <code>logicalType</code>, <code>size</code>,
        <code>order</code>, <code>doc</code>,
        <code>aliases</code>, <code>default</code>.
        For example, if an object has <code>type</code>,
        <code>name</code>, and <code>size</code> fields, then the
        <code>name</code> field should appear first, followed by the
        <code>type</code> and then the <code>size</code> fields.</li>

      <li> [STRINGS] For all JSON string literals in the schema
        text, replace any escaped characters (e.g., \uXXXX escapes)
        with their UTF-8 equivalents.</li>

      <li> [INTEGERS] Eliminate quotes around and any leading
        zeros in front of JSON integer literals (which appear in the
        <code>size</code> attributes of <code>fixed</code> schemas).</li>

      <li> [WHITESPACE] Eliminate all whitespace in JSON outside of string literals.</li>
    </ul>
  </section>

  <section>
    <title>Transforming with Custom Properties</title>

    <p>In addition to the standard canonical form transformation, including
      <em>custom</em> <code>Schema</code> or <code>Field</code> properties by
      passing the properties names while transforming.
      For example, if an object has <code>format</code>, <code>type</code>,
      <code>name</code>, and <code>size</code> fields, then the
      <code>name</code> field should appear first, followed by the
      <code>type</code>, <code>size</code> and then <code>format</code>
      (custom properties) fields.
    </p>
  </section>
</section>

@rumeshkrish
Copy link
Author

rumeshkrish commented Mar 15, 2019

what this mean ?

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#Build-times-out-because-no-output-was-received

The build has been terminated

@rumeshkrish
Copy link
Author

@cutting is it possible to add this new feature in 1.9.0 version ?

…va code (#487)

* AVRO-2353: Configure spotless-maven-plugin to auto format/validate Java code

* AVRO-2353: Adapt code formatter to Avro default style (2 spaces)

* AVRO-2353: Remove trailing spaces check from checkstyle

This is a bug of eclipse-jdt-formatter so we should disable the trailing
validation temporally. For more info see https://bugs.eclipse.org/bugs/show_bug.cgi?id=488229

* AVRO-2353: Auto-format existing code
@iemejia
Copy link
Member

iemejia commented Mar 26, 2019

We introduced automatic code formatting. For more info see the "How to contribute" page. This probably affected this PR can you please rebase it.

@cutting
Copy link
Contributor

cutting commented Mar 26, 2019 via email

dkulp and others added 10 commits March 28, 2019 14:01
AVRO-2299 Get Avro Standard Canonical Schema.
removed namespace from the avro reserved list.

AVRO-2299 Get Avro Standard Canonical Schema.
1. additional spec for used defined properties
2. toCanonicalForm function code fix for avoid mutating static states.
2. Code changes done based on review comments.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 3d8adfb.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit e0c6249.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f93aafe.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 4fe04d4.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 6326880.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 2f790f0.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f18a773.

AVRO-2299 Canonical Normalization code changes based on review comments and spec for properties.

AVRO-2299 fully qualified named aliases with sorted order  in Schema/Field.
AVRO-2299 Get Avro Standard Canonical Schema.
removed namespace from the avro reserved list.

AVRO-2299 Get Avro Standard Canonical Schema.
1. additional spec for used defined properties
2. toCanonicalForm function code fix for avoid mutating static states.
2. Code changes done based on review comments.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts based on https://github.com/apache/avro/pull/452/conflicts.

AVRO-2299 Get Avro Standard Canonical Schema.
resolved code based on git master conflicts.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 3d8adfb.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit e0c6249.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f93aafe.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 4fe04d4.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 6326880.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit 2f790f0.

Revert "AVRO-2299 Get Avro Standard Canonical Schema."

This reverts commit f18a773.

AVRO-2299 Canonical Normalization code changes based on review comments and spec for properties.

AVRO-2299 fully qualified named aliases with sorted order  in Schema/Field.
# Conflicts:
#	lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
#	lang/java/avro/src/test/java/org/apache/avro/TestSchemaNormalization.java
@rumeshkrish
Copy link
Author

Hi, Did any one have some time for reviewing the documentation ?

@Fokko
Copy link
Contributor

Fokko commented Oct 22, 2019

@rumeshkrish Something looks off with the PR. It looks like you're changing 631 files, can you fix this? Right now we're running into the same issue, and I believe that the default value should be part of the fingerprint as well.

@rumeshkrish
Copy link
Author

rumeshkrish commented Jan 20, 2020 via email

@rumeshkrish
Copy link
Author

@Fokko please find another pull request for the same #805

@rumeshkrish rumeshkrish closed this Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Java Pull Requests for Java binding website
Projects
None yet
9 participants