Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase. #805

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions doc/src/content/xdocs/spec.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1310,6 +1310,92 @@
</ul>
</section>

<section>
<title>Standard Canonical Form for Schemas</title>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I proposed in the JIRA that we don't create a new specification for this form, and just consider getting the "plain" schema as an SDK tool issue.

If we go that way, you can ignore the comments on this file :D


<p>One of defined way to normalize the avro schema using
<em>Standard Canonical Form Transformation</em>. This involves
stripping unwanted properties and maintain same canonical
ordering. The canonical ordering involves ordering avro
reserved properties followed by custom properties if mentioned while
transforming. Normalization schema which helps to reduce the
total memory size of schema (removed unwanted properties and whitespace)
while transfer avro schema between two system and also reduce the parsing
time for compatibility check and schema evolution.
</p>

<p><em>Standard Canonical Form</em> is a transformation of a schema
into standard canonical ordered. It contains only avro reserved
properties <code>"name", "type", "fields", "symbols", "items", "values",
"logicalType", "size", "order", "doc", "aliases", "default"</code>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth mentioning that any properties the are used to configure a logical type should also be kept ("scale", "precision" and user-defined logical types.)

As a consequence, when generating a canonical form that includes a user-defined LogicalType, all languages should have the same defined attributes for that logical type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we're keeping default, we should keep all of the attributes that might appear in a JSON object in the default.

and <em>other (custom properties)</em> schema properties.
</p>

<section>
<title>Transforming into Standard Canonical Form</title>

<p>Assuming an input schema (in JSON form) that's already
UTF-8 text for a <em>valid</em> Avro schema (including all
quotes as required by JSON), the following transformations
will produce its Standard Canonical Form:</p>
<ul>
<li> [PRIMITIVES] Convert primitive schemas to their simple
form (e.g., <code>int</code> instead of
<code>{"type":"int"}</code>).</li>

<li> [FULLNAMES] Replace short names with fullnames, using
applicable namespaces to do so. Then eliminate
<code>namespace</code> attributes, which are now redundant.</li>

<li> [STRIP] Keep only attributes that are relevant to
reserved properties, which are:
<code>type</code>, <code>name</code>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's odd that the order specified here is different than below. It's also the case for Parsing Canonical Form documentation though...

A good question -- size is "sometimes" relevant, and sometimes ignored, for example. In the Java SDK we can add it as an attribute to a field but not an enum, and it's stripped from the field despite being a "kept" attribute. Is this a bug with canonical form in general? What do we want to happen with the Standard Canonical Form?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size is only relevant for fixed, it should not be present in any other type.

<code>fields</code>, <code>symbols</code>,
<code>items</code>, <code>values</code>,
<code>logicalType</code>, <code>size</code>,
<code>order</code>, <code>doc</code>
<code>aliases</code> and <code>default</code>.
Strip all others user defined properties (e.g., <code>format</code>).</li>

<li> [ORDER] Order the appearance of fields of JSON objects
as follows: <code>name</code>, <code>type</code>,
<code>fields</code>, <code>symbols</code>,
<code>items</code>, <code>values</code>,
<code>logicalType</code>, <code>size</code>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch these to keep the initial attributes the same as for Parsing Canonical Form: name, type, fields, symbols, items, values, size ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define the order for user/custom properties? I'm tempted to say keep the 7 attributes from Parsing Canonical form in their order, followed by all other kept attributes in alphabetical order...

<code>order</code>, <code>doc</code>,
<code>aliases</code>, <code>default</code>.
For example, if an object has <code>type</code>,
<code>name</code>, and <code>size</code> fields, then the
<code>name</code> field should appear first, followed by the
<code>type</code> and then the <code>size</code> fields.</li>

<li> [STRINGS] For all JSON string literals in the schema
text, replace any escaped characters (e.g., \uXXXX escapes)
with their UTF-8 equivalents.</li>

<li> [INTEGERS] Eliminate quotes around and any leading
zeros in front of JSON integer literals (which appear in the
<code>size</code> attributes of <code>fixed</code> schemas).</li>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also appear in default attributes now. We also have floating point numbers that should be normalized.

JSON arrays should be fine, but JSON objects in defaults should probably have their fields ordered alphabetically.


<li> [WHITESPACE] Eliminate all whitespace in JSON outside of string literals.</li>
</ul>
</section>

<section>
<title>Transforming with Custom Properties</title>

<p>In addition to the standard canonical form transformation, including
<em>custom</em> <code>Schema</code> or <code>Field</code> properties by
passing the properties names while transforming.
For example, if an object has <code>format</code>, <code>type</code>,
<code>name</code>, and <code>size</code> fields, then the
<code>name</code> field should appear first, followed by the
<code>type</code>, <code>size</code> and then <code>format</code>
(custom properties) fields.
</p>
</section>
</section>

<section id="schema_fingerprints">
<title>Schema Fingerprints</title>

Expand Down
107 changes: 97 additions & 10 deletions lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,22 @@
*/
package org.apache.avro;

import org.apache.avro.util.internal.JacksonUtils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is currently Jackson-independent... I think the JSON is being written manually here to ensure that it's isolated from any specific JSON implementations, and the trend of the last few versions has been to reduce the dependency on Jackson in particular.

Is it possible to avoid introducing this class?

import java.util.Arrays;
import java.util.LinkedHashSet;
import java.util.Map;
import java.util.HashMap;
import java.util.TreeSet;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
* Collection of static methods for generating the canonical form of schemas
* (see {@link #toParsingForm}) -- and fingerprints of canonical forms
* ({@link #fingerprint}).
* Collection of static methods for generating the parser canonical form of
* schemas (see {@link #toParsingForm}), standard canonical form of schemas (see
* {@link #toCanonicalForm}) with user defined properties and fingerprints of
* canonical forms ({@link #fingerprint}).
*/
public class SchemaNormalization {

Expand All @@ -38,9 +43,31 @@ private SchemaNormalization() {
* Returns "Parsing Canonical Form" of a schema as defined by Avro spec.
*/
public static String toParsingForm(Schema s) {
return toNormalizedForm(s, true, new LinkedHashSet<>());
}

/**
* Returns "Standard Canonical Form" of a schema as defined by Avro spec.
*/
public static String toCanonicalForm(Schema s) {
return toCanonicalForm(s, new LinkedHashSet<>());
}

/**
* Returns "Standard Canonical Form" of a schema as defined by Avro spec with
* additional user standard properties.
*/
public static String toCanonicalForm(Schema s, LinkedHashSet<String> properties) {
LinkedHashSet<String> reservedProperties = new LinkedHashSet<>(Arrays.asList("name", "type", "fields", "symbols",
"items", "values", "logicalType", "size", "order", "doc", "aliases", "default"));
properties.removeAll(reservedProperties);
return toNormalizedForm(s, false, properties);
}

private static String toNormalizedForm(Schema s, Boolean ps, LinkedHashSet<String> aps) {
try {
Map<String, String> env = new HashMap<>();
return build(env, s, new StringBuilder()).toString();
return build(env, s, new StringBuilder(), ps, aps).toString();
} catch (IOException e) {
// Shouldn't happen, b/c StringBuilder can't throw IOException
throw new RuntimeException(e);
Expand Down Expand Up @@ -103,12 +130,19 @@ public static long parsingFingerprint64(Schema s) {
return fingerprint64(toParsingForm(s).getBytes(StandardCharsets.UTF_8));
}

private static Appendable build(Map<String, String> env, Schema s, Appendable o) throws IOException {
private static Appendable build(Map<String, String> env, Schema s, Appendable o, Boolean ps,
LinkedHashSet<String> aps) throws IOException {
boolean firstTime = true;
Schema.Type st = s.getType();
LogicalType lt = null;
if (!ps)
lt = s.getLogicalType();
switch (st) {
default: // boolean, bytes, double, float, int, long, null, string
return o.append('"').append(st.getName()).append('"');
if (!ps && lt != null)
return writeLogicalType(s, lt, o, aps);
else
return o.append('"').append(st.getName()).append('"');

case UNION:
o.append('[');
Expand All @@ -117,17 +151,19 @@ private static Appendable build(Map<String, String> env, Schema s, Appendable o)
o.append(',');
else
firstTime = false;
build(env, b, o);
build(env, b, o, ps, aps);
}
return o.append(']');

case ARRAY:
case MAP:
o.append("{\"type\":\"").append(st.getName()).append("\"");
if (st == Schema.Type.ARRAY)
build(env, s.getElementType(), o.append(",\"items\":"));
build(env, s.getElementType(), o.append(",\"items\":"), ps, aps);
else
build(env, s.getValueType(), o.append(",\"values\":"));
build(env, s.getValueType(), o.append(",\"values\":"), ps, aps);
if (!ps)
setSimpleProps(o, s.getObjectProps(), aps); // adding the reserved property if not parser canonical schema
return o.append("}");

case ENUM:
Expand Down Expand Up @@ -160,14 +196,65 @@ private static Appendable build(Map<String, String> env, Schema s, Appendable o)
else
firstTime = false;
o.append("{\"name\":\"").append(f.name()).append("\"");
build(env, f.schema(), o.append(",\"type\":")).append("}");
build(env, f.schema(), o.append(",\"type\":"), ps, aps);
if (!ps)
setFieldProps(o, f, aps); // if standard canonical form then add reserved properties
o.append("}");
}
o.append("]");
}
if (!ps) {
setComplexProps(o, s);
setSimpleProps(o, s.getObjectProps(), aps);
} // adding the reserved property if not parser canonical schema
return o.append("}");
}
}

private static Appendable writeLogicalType(Schema s, LogicalType lt, Appendable o, LinkedHashSet<String> aps)
throws IOException {
o.append("{\"type\":\"").append(s.getType().getName()).append("\"");
o.append("\"").append(LogicalType.LOGICAL_TYPE_PROP).append("\":\"").append(lt.getName()).append("\"");
if (lt.getName().equals("decimal")) {
LogicalTypes.Decimal dlt = (LogicalTypes.Decimal) lt;
o.append(",\"precision\":").append(Integer.toString(dlt.getPrecision()));
if (dlt.getScale() != 0)
o.append(",\"scale\":").append(Integer.toString(dlt.getScale()));
}
// adding the reserved property
setSimpleProps(o, s.getObjectProps(), aps);
return o.append("}");
}

private static void setSimpleProps(Appendable o, Map<String, Object> schemaProps, LinkedHashSet<String> aps)
throws IOException {
for (String propKey : aps) {
if (schemaProps.containsKey(propKey)) {
String propValue = JacksonUtils.toJsonNode(schemaProps.get(propKey)).toString();
o.append(",\"").append(propKey).append("\":").append(propValue);
}
}
}

private static void setComplexProps(Appendable o, Schema s) throws IOException {
if (s.getDoc() != null && !s.getDoc().isEmpty())
o.append(",\"doc\":\"").append(s.getDoc()).append("\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be o.append(",\"doc\":").append(toJsonNode(s.getDoc()).toString());

if (s.getAliases() != null && !s.getAliases().isEmpty())
o.append(",\"aliases\":").append(JacksonUtils.toJsonNode(new TreeSet<String>(s.getAliases())).toString());
}

private static void setFieldProps(Appendable o, Schema.Field f, LinkedHashSet<String> aps) throws IOException {
if (f.order() != null)
o.append(",\"order\":\"").append(f.order().toString()).append("\"");
if (f.doc() != null)
o.append(",\"doc\":\"").append(f.doc()).append("\"");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be o.append(",\"doc\":").append(toJsonNode(f.doc()).toString());

if (!f.aliases().isEmpty())
o.append(",\"aliases\":").append(JacksonUtils.toJsonNode(new TreeSet<String>(f.aliases())).toString());
if (f.defaultVal() != null)
o.append(",\"default\":").append(JacksonUtils.toJsonNode(f.defaultVal()).toString());
setSimpleProps(o, f.getObjectProps(), aps);
}

final static long EMPTY64 = 0xc15d213aa4d7a795L;

/* An inner class ensures that FP_TABLE initialized only when needed. */
Expand Down
Loading