doc/src/content/xdocs/spec.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      https://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "https://forrest.apache.org/dtd/document-v20.dtd" [
  <!ENTITY % avro-entities PUBLIC "-//Apache//ENTITIES Avro//EN"
	   "../../../../build/avro.ent">
  %avro-entities;
]>
<document>
  <header>
    <title>Apache Avro&#153; &AvroVersion; Specification</title>
  </header>
  <body>

    <section id="preamble">
      <title>Introduction</title>

      <p>This document defines Apache Avro.  It is intended to be the
        authoritative specification. Implementations of Avro must
        adhere to this document.
      </p>

    </section>

    <section id="schemas">
      <title>Schema Declaration</title>
      <p>A Schema is represented in <a href="ext:json">JSON</a> by one of:</p>
      <ul>
        <li>A JSON string, naming a defined type.</li>

        <li>A JSON object, of the form:

          <source>{"type": "<em>typeName</em>" ...<em>attributes</em>...}</source>

          where <em>typeName</em> is either a primitive or derived
          type name, as defined below.  Attributes not defined in this
          document are permitted as metadata, but must not affect
          the format of serialized data.
          </li>
        <li>A JSON array, representing a union of embedded types.</li>
      </ul>

      <section id="schema_primitive">
        <title>Primitive Types</title>
        <p>The set of primitive type names is:</p>
        <ul>
          <li><code>null</code>: no value</li>
          <li><code>boolean</code>: a binary value</li>
          <li><code>int</code>: 32-bit signed integer</li>
          <li><code>long</code>: 64-bit signed integer</li>
          <li><code>float</code>: single precision (32-bit) IEEE 754 floating-point number</li>
          <li><code>double</code>: double precision (64-bit) IEEE 754 floating-point number</li>
          <li><code>bytes</code>: sequence of 8-bit unsigned bytes</li>
          <li><code>string</code>: unicode character sequence</li>
        </ul>

        <p>Primitive types have no specified attributes.</p>

        <p>Primitive type names are also defined type names.  Thus, for
          example, the schema "string" is equivalent to:</p>

        <source>{"type": "string"}</source>

      </section>

      <section id="schema_complex">
        <title>Complex Types</title>

        <p>Avro supports six kinds of complex types: records, enums,
        arrays, maps, unions and fixed.</p>

        <section id="schema_record">
          <title>Records</title>

	  <p>Records use the type name "record" and support three attributes:</p>
	  <ul>
	    <li><code>name</code>: a JSON string providing the name
	    of the record (required).</li>
	    <li><em>namespace</em>, a JSON string that qualifies the name;</li>
	    <li><code>doc</code>: a JSON string providing documentation to the
	    user of this schema (optional).</li>
	    <li><code>aliases:</code> a JSON array of strings, providing
	      alternate names for this record (optional).</li>
	    <li><code>fields</code>: a JSON array, listing fields (required).
	    Each field is a JSON object with the following attributes:
	      <ul>
		<li><code>name</code>: a JSON string providing the name
		  of the field (required), and </li>
		<li><code>doc</code>: a JSON string describing this field
                  for users (optional).</li>
		<li><code>type:</code> a <a href="#schemas">schema</a>, as defined above</li>
		<li><code>default:</code> A default value for this
		  field, used when reading instances that lack this
		  field (optional).  Permitted values depend on the
		  field's schema type, according to the table below.
		  Default values for union fields correspond to the
		  first schema in the union. Default values for bytes
		  and fixed fields are JSON strings, where Unicode
		  code points 0-255 are mapped to unsigned 8-bit byte
		  values 0-255.
		  <table class="right">
		    <caption>field default values</caption>
		    <tr><th>avro type</th><th>json type</th><th>example</th></tr>
		    <tr><td>null</td><td>null</td><td>null</td></tr>
		    <tr><td>boolean</td><td>boolean</td><td>true</td></tr>
		    <tr><td>int,long</td><td>integer</td><td>1</td></tr>
		    <tr><td>float,double</td><td>number</td><td>1.1</td></tr>
		    <tr><td>bytes</td><td>string</td><td>"\u00FF"</td></tr>
		    <tr><td>string</td><td>string</td><td>"foo"</td></tr>
		    <tr><td>record</td><td>object</td><td>{"a": 1}</td></tr>
		    <tr><td>enum</td><td>string</td><td>"FOO"</td></tr>
		    <tr><td>array</td><td>array</td><td>[1]</td></tr>
		    <tr><td>map</td><td>object</td><td>{"a": 1}</td></tr>
		    <tr><td>fixed</td><td>string</td><td>"\u00ff"</td></tr>
		  </table>
		</li>
		<li><code>order:</code> specifies how this field
		  impacts sort ordering of this record (optional).
		  Valid values are "ascending" (the default),
		  "descending", or "ignore".  For more details on how
		  this is used, see the the <a href="#order">sort
		  order</a> section below.</li>
		<li><code>aliases:</code> a JSON array of strings, providing
		  alternate names for this field (optional).</li>
	      </ul>
	    </li>
	  </ul>

	  <p>For example, a linked-list of 64-bit values may be defined with:</p>
	  <source>
{
  "type": "record",
  "name": "LongList",
  "aliases": ["LinkedLongs"],                      // old name for this
  "fields" : [
    {"name": "value", "type": "long"},             // each element has a long
    {"name": "next", "type": ["null", "LongList"]} // optional next element
  ]
}
	  </source>
	</section>

        <section>
          <title>Enums</title>

	  <p>Enums use the type name "enum" and support the following
	  attributes:</p>
	  <ul>
	    <li><code>name</code>: a JSON string providing the name
	    of the enum (required).</li>
	    <li><em>namespace</em>, a JSON string that qualifies the name;</li>
	    <li><code>aliases:</code> a JSON array of strings, providing
	      alternate names for this enum (optional).</li>
	    <li><code>doc</code>: a JSON string providing documentation to the
	    user of this schema (optional).</li>
	    <li><code>symbols</code>: a JSON array, listing symbols,
	    as JSON strings (required).  All symbols in an enum must
	    be unique; duplicates are prohibited.  Every symbol must
	    match the regular expression <code>[A-Za-z_][A-Za-z0-9_]*</code>
	    (the same requirement as for <a href="#names">names</a>).</li>
	    <li><code>default</code>: A default value for this
	      enumeration, used during resolution when the reader
	      encounters a symbol from the writer that isn't defined
	      in the reader's schema (optional).  The value provided
	      here must be a JSON string that's a member of
	      the <code>symbols</code> array.
	      See documentation on schema resolution for how this gets
	      used.</li>
	  </ul>
	  <p>For example, playing card suits might be defined with:</p>
	  <source>
{
  "type": "enum",
  "name": "Suit",
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}
	  </source>
	</section>

        <section>
          <title>Arrays</title>
          <p>Arrays use the type name <code>"array"</code> and support
          a single attribute:</p>
	  <ul>
            <li><code>items</code>: the schema of the array's items.</li>
	  </ul>
	  <p>For example, an array of strings is declared
	  with:</p>
    <source>
{
  "type": "array",
  "items" : "string",
  "default": []
}
    </source>
	</section>

        <section>
          <title>Maps</title>
          <p>Maps use the type name <code>"map"</code> and support
          one attribute:</p>
	  <ul>
            <li><code>values</code>: the schema of the map's values.</li>
	  </ul>
	  <p>Map keys are assumed to be strings.</p>
	  <p>For example, a map from string to long is declared
	  with:</p>
    <source>
{
  "type": "map",
  "items" : "long",
  "default": {}
}
    </source>
	</section>

        <section>
          <title>Unions</title>
          <p>Unions, as mentioned above, are represented using JSON
          arrays.  For example, <code>["null", "string"]</code>
          declares a schema which may be either a null or string.</p>
          <p>(Note that when a <a href="#schema_record">default
          value</a> is specified for a record field whose type is a
          union, the type of the default value must match the
          <em>first</em> element of the union.  Thus, for unions
          containing "null", the "null" is usually listed first, since
          the default value of such unions is typically null.)</p>
	  <p>Unions may not contain more than one schema with the same
	  type, except for the named types record, fixed and enum.  For
	  example, unions containing two array types or two map types
	  are not permitted, but two types with different names are
	  permitted.  (Names permit efficient resolution when reading
	  and writing unions.)</p>
	  <p>Unions may not immediately contain other unions.</p>
        </section>

        <section>
          <title>Fixed</title>
          <p>Fixed uses the type name <code>"fixed"</code> and supports
          two attributes:</p>
	  <ul>
	    <li><code>name</code>: a string naming this fixed (required).</li>
	    <li><em>namespace</em>, a string that qualifies the name;</li>
	    <li><code>aliases:</code> a JSON array of strings, providing
	      alternate names for this enum (optional).</li>
            <li><code>size</code>: an integer, specifying the number
            of bytes per value (required).</li>
	  </ul>
	  <p>For example, 16-byte quantity may be declared with:</p>
	  <source>{"type": "fixed", "size": 16, "name": "md5"}</source>
	</section>


      </section> <!-- end complex types -->

      <section id="names">
	<title>Names</title>
        <p>Record, enums and fixed are named types.  Each has
          a <em>fullname</em> that is composed of two parts;
          a <em>name</em> and a <em>namespace</em>.  Equality of names
          is defined on the fullname.</p>
	<p>The name portion of a fullname, record field names, and
	  enum symbols must:</p>
	<ul>
          <li>start with <code>[A-Za-z_]</code></li>
          <li>subsequently contain only <code>[A-Za-z0-9_]</code></li>
	</ul>
        <p>A namespace is a dot-separated sequence of such names.
        The empty string may also be used as a namespace to indicate the
        null namespace.
        Equality of names (including field names and enum symbols)
        as well as fullnames is case-sensitive.</p>
        <p>In record, enum and fixed definitions, the fullname is
        determined in one of the following ways:</p>
	<ul>
	  <li>A name and namespace are both specified.  For example,
	  one might use <code>"name": "X", "namespace":
	  "org.foo"</code> to indicate the
	  fullname <code>org.foo.X</code>.</li>
	  <li>A fullname is specified.  If the name specified contains
	  a dot, then it is assumed to be a fullname, and any
	  namespace also specified is ignored.  For example,
	  use <code>"name": "org.foo.X"</code> to indicate the
	  fullname <code>org.foo.X</code>.</li>
	  <li>A name only is specified, i.e., a name that contains no
	  dots.  In this case the namespace is taken from the most
	  tightly enclosing schema or protocol.  For example,
	  if <code>"name": "X"</code> is specified, and this occurs
	  within a field of the record definition
	  of <code>org.foo.Y</code>, then the fullname
	  is <code>org.foo.X</code>. If there is no enclosing
	  namespace then the null namespace is used.</li>
	</ul>
	<p>References to previously defined names are as in the latter
	two cases above: if they contain a dot they are a fullname, if
	they do not contain a dot, the namespace is the namespace of
	the enclosing definition.</p>
	<p>Primitive type names have no namespace and their names may
	not be defined in any namespace.</p>
	<p> A schema or protocol may not contain multiple definitions
	of a fullname.  Further, a name must be defined before it is
	used ("before" in the depth-first, left-to-right traversal of
	the JSON parse tree, where the <code>types</code> attribute of
	a protocol is always deemed to come "before" the
	<code>messages</code> attribute.)
	</p>
      </section>

      <section>
	<title>Aliases</title>
	<p>Named types and fields may have aliases.  An implementation
        may optionally use aliases to map a writer's schema to the
        reader's.  This faciliates both schema evolution as well as
        processing disparate datasets.</p>
	<p>Aliases function by re-writing the writer's schema using
        aliases from the reader's schema.  For example, if the
        writer's schema was named "Foo" and the reader's schema is
        named "Bar" and has an alias of "Foo", then the implementation
        would act as though "Foo" were named "Bar" when reading.
        Similarly, if data was written as a record with a field named
        "x" and is read as a record with a field named "y" with alias
        "x", then the implementation would act as though "x" were
        named "y" when reading.</p>
	<p>A type alias may be specified either as a fully
        namespace-qualified, or relative to the namespace of the name
        it is an alias for.  For example, if a type named "a.b" has
        aliases of "c" and "x.y", then the fully qualified names of
        its aliases are "a.c" and "x.y".</p>
      </section>

    </section> <!-- end schemas -->

    <section>
      <title>Data Serialization and Deserialization</title>

      <p>Binary encoded Avro data does not include type information or
      field names.  The benefit is that the serialized data is small, but
      as a result a schema must always be used in order to read Avro data
      correctly.  The best way to ensure that the schema is structurally
      identical to the one used to write the data is to use the exact same
      schema.</p>

      <p>Therefore, files or systems that store Avro data should always
      include the writer's schema for that data.  Avro-based remote procedure
      call (RPC) systems must also guarantee that remote recipients of data
      have a copy of the schema used to write that data.  In general, it is
      advisable that any reader of Avro data should use a schema that is
      the same (as defined more fully in
      <a href="#Parsing+Canonical+Form+for+Schemas">Parsing Canonical Form for
      Schemas</a>) as the schema that was used to write the data in order to
      deserialize it correctly. Deserializing data into a newer schema is
      accomplished by specifying an additional schema, the results of which are
      described in <a href="#Schema+Resolution">Schema Resolution</a>.</p>

      <p>In general, both serialization and deserialization proceed as a
      depth-first, left-to-right traversal of the schema, serializing or
      deserializing primitive types as they are encountered. Therefore, it is
      possible, though not advisable, to read Avro data with a schema that
      does not have the same Parsing Canonical Form as the schema with which
      the data was written. In order for this to work, the serialized primitive
      values must be compatible, in order value by value, with the items in the
      deserialization schema. For example, int and long are always serialized
      the same way, so an int could be deserialized as a long.  Since the
      compatibility of two schemas depends on both the data and the
      serialization format (eg. binary is more permissive than JSON because JSON
      includes field names, eg. a long that is too large will overflow an int),
      it is simpler and more reliable to use schemas with identical Parsing
      Canonical Form.</p>

      <section>
	<title>Encodings</title>
	<p>Avro specifies two serialization encodings: binary and
	  JSON.  Most applications will use the binary encoding, as it
	  is smaller and faster.  But, for debugging and web-based
	  applications, the JSON encoding may sometimes be
	  appropriate.</p>
      </section>

      <section id="binary_encoding">
        <title>Binary Encoding</title>
        <p>Binary encoding does not include field names, self-contained
          information about the types of individual bytes, nor field or
          record separators. Therefore readers are wholly reliant on
          the schema used when the data was encoded.</p>

	<section id="binary_encode_primitive">
          <title>Primitive Types</title>
          <p>Primitive types are encoded in binary as follows:</p>
          <ul>
            <li><code>null</code> is written as zero bytes.</li>
            <li>a <code>boolean</code> is written as a single byte whose
              value is either <code>0</code> (false) or <code>1</code>
              (true).</li>
            <li><code>int</code> and <code>long</code> values are written
              using <a href="ext:vint">variable-length</a>
	      <a href="ext:zigzag">zig-zag</a> coding.  Some examples:
	      <table class="right">
		<tr><th>value</th><th>hex</th></tr>
		<tr><td><code> 0</code></td><td><code>00</code></td></tr>
		<tr><td><code>-1</code></td><td><code>01</code></td></tr>
		<tr><td><code> 1</code></td><td><code>02</code></td></tr>
		<tr><td><code>-2</code></td><td><code>03</code></td></tr>
		<tr><td><code> 2</code></td><td><code>04</code></td></tr>
		<tr><td colspan="2"><code>...</code></td></tr>
		<tr><td><code>-64</code></td><td><code>7f</code></td></tr>
		<tr><td><code> 64</code></td><td><code>&nbsp;80 01</code></td></tr>
		<tr><td colspan="2"><code>...</code></td></tr>
	      </table>
	    </li>
            <li>a <code>float</code> is written as 4 bytes. The float is
              converted into a 32-bit integer using a method equivalent
              to <a href="https://java.sun.com/javase/6/docs/api/java/lang/Float.html#floatToIntBits%28float%29">Java's floatToIntBits</a> and then encoded
              in little-endian format.</li>
            <li>a <code>double</code> is written as 8 bytes. The double
              is converted into a 64-bit integer using a method equivalent
              to <a href="https://java.sun.com/javase/6/docs/api/java/lang/Double.html#doubleToLongBits%28double%29">Java's
		doubleToLongBits</a> and then encoded in little-endian
              format.</li>
            <li><code>bytes</code> are encoded as
              a <code>long</code> followed by that many bytes of data.
            </li>
            <li>a <code>string</code> is encoded as
              a <code>long</code> followed by that many bytes of UTF-8
              encoded character data.
              <p>For example, the three-character string "foo" would
              be encoded as the long value 3 (encoded as
              hex <code>06</code>) followed by the UTF-8 encoding of
              'f', 'o', and 'o' (the hex bytes <code>66 6f
              6f</code>):
              </p>
              <source>06 66 6f 6f</source>
            </li>
          </ul>

	</section>


	<section id="binary_encode_complex">
          <title>Complex Types</title>
          <p>Complex types are encoded in binary as follows:</p>

          <section id="record_encoding">
            <title>Records</title>
	    <p>A record is encoded by encoding the values of its
	      fields in the order that they are declared.  In other
	      words, a record is encoded as just the concatenation of
	      the encodings of its fields.  Field values are encoded per
	      their schema.</p>
	    <p>For example, the record schema</p>
	    <source>
	      {
	      "type": "record",
	      "name": "test",
	      "fields" : [
	      {"name": "a", "type": "long"},
	      {"name": "b", "type": "string"}
	      ]
	      }
	    </source>
	    <p>An instance of this record whose <code>a</code> field has
	      value 27 (encoded as hex <code>36</code>) and
	      whose <code>b</code> field has value "foo" (encoded as hex
	      bytes <code>06 66 6f 6f</code>), would be encoded simply
	      as the concatenation of these, namely the hex byte
	      sequence:</p>
	    <source>36 06 66 6f 6f</source>
	  </section>

          <section id="enum_encoding">
            <title>Enums</title>
            <p>An enum is encoded by a <code>int</code>, representing
              the zero-based position of the symbol in the schema.</p>
	    <p>For example, consider the enum:</p>
	    <source>
	      {"type": "enum", "name": "Foo", "symbols": ["A", "B", "C", "D"] }
	    </source>
	    <p>This would be encoded by an <code>int</code> between
	      zero and three, with zero indicating "A", and 3 indicating
	      "D".</p>
	  </section>


          <section id="array_encoding">
            <title>Arrays</title>
            <p>Arrays are encoded as a series of <em>blocks</em>.
              Each block consists of a <code>long</code> <em>count</em>
              value, followed by that many array items.  A block with
              count zero indicates the end of the array.  Each item is
              encoded per the array's item schema.</p>

            <p>If a block's count is negative, its absolute value is used,
              and the count is followed immediately by a <code>long</code>
              block <em>size</em> indicating the number of bytes in the
              block.  This block size permits fast skipping through data,
              e.g., when projecting a record to a subset of its fields.</p>

            <p>For example, the array schema</p>
            <source>{"type": "array", "items": "long"}</source>
            <p>an array containing the items 3 and 27 could be encoded
              as the long value 2 (encoded as hex 04) followed by long
              values 3 and 27 (encoded as hex <code>06 36</code>)
              terminated by zero:</p>
            <source>04 06 36 00</source>

            <p>The blocked representation permits one to read and write
              arrays larger than can be buffered in memory, since one can
              start writing items without knowing the full length of the
              array.</p>

          </section>

	  <section id="map_encoding">
            <title>Maps</title>
            <p>Maps are encoded as a series of <em>blocks</em>.  Each
              block consists of a <code>long</code> <em>count</em>
              value, followed by that many key/value pairs.  A block
              with count zero indicates the end of the map.  Each item
              is encoded per the map's value schema.</p>

            <p>If a block's count is negative, its absolute value is used,
              and the count is followed immediately by a <code>long</code>
              block <em>size</em> indicating the number of bytes in the
              block.  This block size permits fast skipping through data,
              e.g., when projecting a record to a subset of its fields.</p>

            <p>The blocked representation permits one to read and write
              maps larger than can be buffered in memory, since one can
              start writing items without knowing the full length of the
              map.</p>

	  </section>

          <section id="union_encoding">
            <title>Unions</title>
            <p>A union is encoded by first writing a <code>long</code>
              value indicating the zero-based position within the
              union of the schema of its value.  The value is then
              encoded per the indicated schema within the union.</p>
            <p>For example, the union
              schema <code>["null","string"]</code> would encode:</p>
            <ul>
              <li><code>null</code> as zero (the index of "null" in the union):
                <source>00</source></li>
              <li>the string <code>"a"</code> as one (the index of
                "string" in the union, encoded as hex <code>02</code>),
                followed by the serialized string:
                <source>02 02 61</source></li>
            </ul>
          </section>

          <section id="fixed_encoding">
            <title>Fixed</title>
            <p>Fixed instances are encoded using the number of bytes
              declared in the schema.</p>
          </section>

        </section> <!-- end complex types -->

      </section>

      <section id="json_encoding">
        <title>JSON Encoding</title>

        <p>Except for unions, the JSON encoding is the same as is used
        to encode <a href="#schema_record">field default
        values</a>.</p>

        <p>The value of a union is encoded in JSON as follows:</p>

        <ul>
          <li>if its type is <code>null</code>, then it is encoded as
          a JSON null;</li>
          <li>otherwise it is encoded as a JSON object with one
          name/value pair whose name is the type's name and whose
          value is the recursively encoded value.  For Avro's named
          types (record, fixed or enum) the user-specified name is
          used, for other types the type name is used.</li>
        </ul>

        <p>For example, the union
          schema <code>["null","string","Foo"]</code>, where Foo is a
          record name, would encode:</p>
        <ul>
          <li><code>null</code> as <code>null</code>;</li>
          <li>the string <code>"a"</code> as
            <code>{"string": "a"}</code>; and</li>
          <li>a Foo instance as <code>{"Foo": {...}}</code>,
          where <code>{...}</code> indicates the JSON encoding of a
          Foo instance.</li>
        </ul>

        <p>Note that the original schema is still required to correctly
        process JSON-encoded data.  For example, the JSON encoding does not
        distinguish between <code>int</code>
        and <code>long</code>, <code>float</code>
        and <code>double</code>, records and maps, enums and strings,
        etc.</p>

      </section>

      <section id="single_object_encoding">
        <title>Single-object encoding</title>

        <p>In some situations a single Avro serialized object is to be stored for a
        longer period of time. One very common example is storing Avro records
        for several weeks in an <a href="https://kafka.apache.org/">Apache Kafka</a> topic.</p>
        <p>In the period after a schema change this persistence system will contain records
        that have been written with different schemas. So the need arises to know which schema
        was used to write a record to support schema evolution correctly.
        In most cases the schema itself is too large to include in the message,
        so this binary wrapper format supports the use case more effectively.</p>

        <section id="single_object_encoding_spec">
          <title>Single object encoding specification</title>
          <p>Single Avro objects are encoded as follows:</p>
          <ol>
            <li>A two-byte marker, <code>C3 01</code>, to show that the message is Avro and uses this single-record format (version 1).</li>
            <li>The 8-byte little-endian CRC-64-AVRO <a href="#schema_fingerprints">fingerprint</a> of the object's schema</li>
            <li>The Avro object encoded using <a href="#binary_encoding">Avro's binary encoding</a></li>
          </ol>
        </section>

        <p>Implementations use the 2-byte marker to determine whether a payload is Avro.
          This check helps avoid expensive lookups that resolve the schema from a
          fingerprint, when the message is not an encoded Avro payload.</p>

      </section>

    </section>

    <section id="order">
      <title>Sort Order</title>

      <p>Avro defines a standard sort order for data.  This permits
        data written by one system to be efficiently sorted by another
        system.  This can be an important optimization, as sort order
        comparisons are sometimes the most frequent per-object
        operation.  Note also that Avro binary-encoded data can be
        efficiently ordered without deserializing it to objects.</p>

      <p>Data items may only be compared if they have identical
        schemas.  Pairwise comparisons are implemented recursively
        with a depth-first, left-to-right traversal of the schema.
        The first mismatch encountered determines the order of the
        items.</p>

      <p>Two items with the same schema are compared according to the
        following rules.</p>
      <ul>
        <li><code>null</code> data is always equal.</li>
        <li><code>boolean</code> data is ordered with false before true.</li>
        <li><code>int</code>, <code>long</code>, <code>float</code>
          and <code>double</code> data is ordered by ascending numeric
          value.</li>
        <li><code>bytes</code> and <code>fixed</code> data are
          compared lexicographically by unsigned 8-bit values.</li>
        <li><code>string</code> data is compared lexicographically by
          Unicode code point.  Note that since UTF-8 is used as the
          binary encoding for strings, sorting of bytes and string
          binary data is identical.</li>
        <li><code>array</code> data is compared lexicographically by
          element.</li>
        <li><code>enum</code> data is ordered by the symbol's position
          in the enum schema.  For example, an enum whose symbols are
          <code>["z", "a"]</code> would sort <code>"z"</code> values
          before <code>"a"</code> values.</li>
        <li><code>union</code> data is first ordered by the branch
          within the union, and, within that, by the type of the
          branch.  For example, an <code>["int", "string"]</code>
          union would order all int values before all string values,
          with the ints and strings themselves ordered as defined
          above.</li>
        <li><code>record</code> data is ordered lexicographically by
          field.  If a field specifies that its order is:
          <ul>
            <li><code>"ascending"</code>, then the order of its values
              is unaltered.</li>
            <li><code>"descending"</code>, then the order of its values
              is reversed.</li>
            <li><code>"ignore"</code>, then its values are ignored
              when sorting.</li>
          </ul>
        </li>
        <li><code>map</code> data may not be compared.  It is an error
          to attempt to compare data containing maps unless those maps
          are in an <code>"order":"ignore"</code> record field.
        </li>
      </ul>
    </section>

    <section>
      <title>Object Container Files</title>
      <p>Avro includes a simple object container file format.  A file
      has a schema, and all objects stored in the file must be written
      according to that schema, using binary encoding.  Objects are
      stored in blocks that may be compressed.  Syncronization markers
      are used between blocks to permit efficient splitting of files
      for MapReduce processing.</p>

      <p>Files may include arbitrary user-specified metadata.</p>

      <p>A file consists of:</p>
      <ul>
        <li>A <em>file header</em>, followed by</li>
        <li>one or more <em>file data blocks</em>.</li>
      </ul>

      <p>A file header consists of:</p>
      <ul>
        <li>Four bytes, ASCII 'O', 'b', 'j', followed by 1.</li>
        <li><em>file metadata</em>, including the schema.</li>
        <li>The 16-byte, randomly-generated sync marker for this file.</li>
      </ul>

      <p>File metadata is written as if defined by the following <a
      href="#map_encoding">map</a> schema:</p>
      <source>{"type": "map", "values": "bytes"}</source>

      <p>All metadata properties that start with "avro." are reserved.
      The following file metadata properties are currently used:</p>
      <ul>
        <li><strong>avro.schema</strong> contains the schema of objects
        stored in the file, as JSON data (required).</li>
        <li><strong>avro.codec</strong> the name of the compression codec
        used to compress blocks, as a string.  Implementations
        are required to support the following codecs: "null" and "deflate".
        If codec is absent, it is assumed to be "null".  The codecs
        are described with more detail below.</li>
      </ul>

      <p>A file header is thus described by the following schema:</p>
      <source>
{"type": "record", "name": "org.apache.avro.file.Header",
 "fields" : [
   {"name": "magic", "type": {"type": "fixed", "name": "Magic", "size": 4}},
   {"name": "meta", "type": {"type": "map", "values": "bytes"}},
   {"name": "sync", "type": {"type": "fixed", "name": "Sync", "size": 16}},
  ]
}
      </source>

      <p>A file data block consists of:</p>
      <ul>
        <li>A long indicating the count of objects in this block.</li>
        <li>A long indicating the size in bytes of the serialized objects
        in the current block, after any codec is applied</li>
        <li>The serialized objects.  If a codec is specified, this is
        compressed by that codec.</li>
        <li>The file's 16-byte sync marker.</li>
      </ul>
          <p>Thus, each block's binary data can be efficiently extracted or skipped without
          deserializing the contents.  The combination of block size, object counts, and
          sync markers enable detection of corrupt blocks and help ensure data integrity.</p>
      <section>
      <title>Required Codecs</title>
        <section>
        <title>null</title>
        <p>The "null" codec simply passes through data uncompressed.</p>
        </section>

        <section>
        <title>deflate</title>
        <p>The "deflate" codec writes the data block using the
        deflate algorithm as specified in
        <a href="https://www.isi.edu/in-notes/rfc1951.txt">RFC 1951</a>,
        and typically implemented using the zlib library.  Note that this
        format (unlike the "zlib format" in RFC 1950) does not have a
        checksum.
        </p>
        </section>
      </section>
      <section>
	<title>Optional Codecs</title>
        <section>
          <title>snappy</title>
          <p>The "snappy" codec uses
            Google's <a href="https://code.google.com/p/snappy/">Snappy</a>
            compression library.  Each compressed block is followed
            by the 4-byte, big-endian CRC32 checksum of the
            uncompressed data in the block.</p>
        </section>
      </section>
    </section>

    <section>
      <title>Protocol Declaration</title>
      <p>Avro protocols describe RPC interfaces.  Like schemas, they are
      defined with JSON text.</p>

      <p>A protocol is a JSON object with the following attributes:</p>
      <ul>
        <li><em>protocol</em>, a string, the name of the protocol
        (required);</li>
        <li><em>namespace</em>, an optional string that qualifies the name;</li>
        <li><em>doc</em>, an optional string describing this protocol;</li>
        <li><em>types</em>, an optional list of definitions of named types
          (records, enums, fixed and errors).  An error definition is
          just like a record definition except it uses "error" instead
          of "record".  Note that forward references to named types
          are not permitted.</li>
        <li><em>messages</em>, an optional JSON object whose keys are
          message names and whose values are objects whose attributes
          are described below.  No two messages may have the same
          name.</li>
      </ul>
      <p>The name and namespace qualification rules defined for schema objects
	apply to protocols as well.</p>

      <section>
        <title>Messages</title>
        <p>A message has attributes:</p>
        <ul>
          <li>a <em>doc</em>, an optional description of the message,</li>
          <li>a <em>request</em>, a list of named,
            typed <em>parameter</em> schemas (this has the same form
            as the fields of a record declaration);</li>
          <li>a <em>response</em> schema; </li>
          <li>an optional union of declared <em>error</em> schemas.
	    The <em>effective</em> union has <code>"string"</code>
	    prepended to the declared union, to permit transmission of
	    undeclared "system" errors.  For example, if the declared
	    error union is <code>["AccessError"]</code>, then the
	    effective union is <code>["string", "AccessError"]</code>.
	    When no errors are declared, the effective error union
	    is <code>["string"]</code>.  Errors are serialized using
	    the effective union; however, a protocol's JSON
	    declaration contains only the declared union.
	  </li>
          <li>an optional <em>one-way</em> boolean parameter.</li>
        </ul>
        <p>A request parameter list is processed equivalently to an
          anonymous record.  Since record field lists may vary between
          reader and writer, request parameters may also differ
          between the caller and responder, and such differences are
          resolved in the same manner as record field differences.</p>
	<p>The one-way parameter may only be true when the response type
	  is <code>"null"</code> and no errors are listed.</p>
      </section>
      <section>
        <title>Sample Protocol</title>
        <p>For example, one may define a simple HelloWorld protocol with:</p>
        <source>
{
  "namespace": "com.acme",
  "protocol": "HelloWorld",
  "doc": "Protocol Greetings",

  "types": [
    {"name": "Greeting", "type": "record", "fields": [
      {"name": "message", "type": "string"}]},
    {"name": "Curse", "type": "error", "fields": [
      {"name": "message", "type": "string"}]}
  ],

  "messages": {
    "hello": {
      "doc": "Say hello.",
      "request": [{"name": "greeting", "type": "Greeting" }],
      "response": "Greeting",
      "errors": ["Curse"]
    }
  }
}
        </source>
      </section>
    </section>

    <section>
      <title>Protocol Wire Format</title>

      <section>
        <title>Message Transport</title>
        <p>Messages may be transmitted via
        different <em>transport</em> mechanisms.</p>

        <p>To the transport, a <em>message</em> is an opaque byte sequence.</p>

        <p>A transport is a system that supports:</p>
        <ul>
          <li><strong>transmission of request messages</strong>
          </li>
          <li><strong>receipt of corresponding response messages</strong>
            <p>Servers may send a response message back to the client
            corresponding to a request message.  The mechanism of
            correspondance is transport-specific.  For example, in
            HTTP it is implicit, since HTTP directly supports requests
            and responses.  But a transport that multiplexes many
            client threads over a single socket would need to tag
            messages with unique identifiers.</p>
          </li>
        </ul>

	<p>Transports may be either <em>stateless</em>
        or <em>stateful</em>.  In a stateless transport, messaging
        assumes no established connection state, while stateful
        transports establish connections that may be used for multiple
        messages.  This distinction is discussed further in
        the <a href="#handshake">handshake</a> section below.</p>

        <section>
          <title>HTTP as Transport</title>
          <p>When
            <a href="https://www.w3.org/Protocols/rfc2616/rfc2616.html">HTTP</a>
            is used as a transport, each Avro message exchange is an
            HTTP request/response pair.  All messages of an Avro
            protocol should share a single URL at an HTTP server.
            Other protocols may also use that URL.  Both normal and
            error Avro response messages should use the 200 (OK)
            response code.  The chunked encoding may be used for
            requests and responses, but, regardless the Avro request
            and response are the entire content of an HTTP request and
            response.  The HTTP Content-Type of requests and responses
            should be specified as "avro/binary".  Requests should be
            made using the POST method.</p>
	  <p>HTTP is used by Avro as a stateless transport.</p>
        </section>
      </section>

      <section>
        <title>Message Framing</title>
        <p>Avro messages are <em>framed</em> as a list of buffers.</p>
        <p>Framing is a layer between messages and the transport.
        It exists to optimize certain operations.</p>

        <p>The format of framed message data is:</p>
        <ul>
          <li>a series of <em>buffers</em>, where each buffer consists of:
            <ul>
              <li>a four-byte, big-endian <em>buffer length</em>, followed by</li>
              <li>that many bytes of <em>buffer data</em>.</li>
            </ul>
          </li>
          <li>A message is always terminated by a zero-length buffer.</li>
        </ul>

        <p>Framing is transparent to request and response message
        formats (described below).  Any message may be presented as a
        single or multiple buffers.</p>

        <p>Framing can permit readers to more efficiently get
        different buffers from different sources and for writers to
        more efficiently store different buffers to different
        destinations.  In particular, it can reduce the number of
        times large binary objects are copied.  For example, if an RPC
        parameter consists of a megabyte of file data, that data can
        be copied directly to a socket from a file descriptor, and, on
        the other end, it could be written directly to a file
        descriptor, never entering user space.</p>

        <p>A simple, recommended, framing policy is for writers to
        create a new segment whenever a single binary object is
        written that is larger than a normal output buffer.  Small
        objects are then appended in buffers, while larger objects are
        written as their own buffers.  When a reader then tries to
        read a large object the runtime can hand it an entire buffer
        directly, without having to copy it.</p>
      </section>

      <section id="handshake">
        <title>Handshake</title>

	<p>The purpose of the handshake is to ensure that the client
        and the server have each other's protocol definition, so that
        the client can correctly deserialize responses, and the server
        can correctly deserialize requests.  Both clients and servers
        should maintain a cache of recently seen protocols, so that,
        in most cases, a handshake will be completed without extra
        round-trip network exchanges or the transmission of full
        protocol text.</p>

        <p>RPC requests and responses may not be processed until a
        handshake has been completed.  With a stateless transport, all
        requests and responses are prefixed by handshakes.  With a
        stateful transport, handshakes are only attached to requests
        and responses until a successful handshake response has been
        returned over a connection.  After this, request and response
        payloads are sent without handshakes for the lifetime of that
        connection.</p>

        <p>The handshake process uses the following record schemas:</p>

        <source>
{
  "type": "record",
  "name": "HandshakeRequest", "namespace":"org.apache.avro.ipc",
  "fields": [
    {"name": "clientHash",
     "type": {"type": "fixed", "name": "MD5", "size": 16}},
    {"name": "clientProtocol", "type": ["null", "string"]},
    {"name": "serverHash", "type": "MD5"},
    {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]}
  ]
}
{
  "type": "record",
  "name": "HandshakeResponse", "namespace": "org.apache.avro.ipc",
  "fields": [
    {"name": "match",
     "type": {"type": "enum", "name": "HandshakeMatch",
              "symbols": ["BOTH", "CLIENT", "NONE"]}},
    {"name": "serverProtocol",
     "type": ["null", "string"]},
    {"name": "serverHash",
     "type": ["null", {"type": "fixed", "name": "MD5", "size": 16}]},
    {"name": "meta",
     "type": ["null", {"type": "map", "values": "bytes"}]}
  ]
}
        </source>

        <ul>
          <li>A client first prefixes each request with
          a <code>HandshakeRequest</code> containing just the hash of
          its protocol and of the server's protocol
          (<code>clientHash!=null, clientProtocol=null,
          serverHash!=null</code>), where the hashes are 128-bit MD5
          hashes of the JSON protocol text. If a client has never
          connected to a given server, it sends its hash as a guess of
          the server's hash, otherwise it sends the hash that it
          previously obtained from this server.</li>

          <li>The server responds with
          a <code>HandshakeResponse</code> containing one of:
            <ul>
              <li><code>match=BOTH, serverProtocol=null,
              serverHash=null</code> if the client sent the valid hash
              of the server's protocol and the server knows what
              protocol corresponds to the client's hash. In this case,
              the request is complete and the response data
              immediately follows the HandshakeResponse.</li>

              <li><code>match=CLIENT, serverProtocol!=null,
              serverHash!=null</code> if the server has previously
              seen the client's protocol, but the client sent an
              incorrect hash of the server's protocol. The request is
              complete and the response data immediately follows the
              HandshakeResponse. The client must use the returned
              protocol to process the response and should also cache
              that protocol and its hash for future interactions with
              this server.</li>

              <li><code>match=NONE</code> if the server has not
              previously seen the client's protocol.
              The <code>serverHash</code>
              and <code>serverProtocol</code> may also be non-null if
              the server's protocol hash was incorrect.

              <p>In this case the client must then re-submit its request
              with its protocol text (<code>clientHash!=null,
              clientProtocol!=null, serverHash!=null</code>) and the
              server should respond with a successful match
              (<code>match=BOTH, serverProtocol=null,
              serverHash=null</code>) as above.</p>
              </li>
            </ul>
          </li>
        </ul>

        <p>The <code>meta</code> field is reserved for future
        handshake enhancements.</p>

      </section>

      <section>
        <title>Call Format</title>
        <p>A <em>call</em> consists of a request message paired with
        its resulting response or error message.  Requests and
        responses contain extensible metadata, and both kinds of
        messages are framed as described above.</p>

        <p>The format of a call request is:</p>
        <ul>
          <li><em>request metadata</em>, a map with values of
          type <code>bytes</code></li>
          <li>the <em>message name</em>, an Avro string,
          followed by</li>
          <li>the message <em>parameters</em>.  Parameters are
          serialized according to the message's request
          declaration.</li>
        </ul>

        <p>When the empty string is used as a message name a server
        should ignore the parameters and return an empty response.  A
        client may use this to ping a server or to perform a handshake
        without sending a protocol message.</p>

        <p>When a message is declared one-way and a stateful
        connection has been established by a successful handshake
        response, no response data is sent.  Otherwise the format of
        the call response is:</p>
        <ul>
          <li><em>response metadata</em>, a map with values of
          type <code>bytes</code></li>
          <li>a one-byte <em>error flag</em> boolean, followed by either:
            <ul>
              <li>if the error flag is false, the message <em>response</em>,
                serialized per the message's response schema.</li>
              <li>if the error flag is true, the <em>error</em>,
              serialized per the message's effective error union
              schema.</li>
            </ul>
          </li>
        </ul>
      </section>

    </section>

    <section>
      <title>Schema Resolution</title>

      <p>A reader of Avro data, whether from an RPC or a file, can
        always parse that data because the original schema must be
        provided along with the data.  However, the reader may be
        programmed to read data into a different schema.
        For example, if the data was written with a different version
        of the software than it is read, then fields may have been
        added or removed from records.  This section specifies how such
        schema differences should be resolved.</p>

      <p>We refer to the schema used to write the data as
        the <em>writer's</em> schema, and the schema that the
        application expects the <em>reader's</em> schema.  Differences
        between these should be resolved as follows:</p>

      <ul>
        <li><p>It is an error if the two schemas do not <em>match</em>.</p>
          <p>To match, one of the following must hold:</p>
          <ul>
            <li>both schemas are arrays whose item types match</li>
            <li>both schemas are maps whose value types match</li>
            <li>both schemas are enums whose (unqualified) names match</li>
            <li>both schemas are fixed whose sizes and (unqualified) names match</li>
            <li>both schemas are records with the same (unqualified) name</li>
            <li>either schema is a union</li>
            <li>both schemas have same primitive type</li>
            <li>the writer's schema may be <em>promoted</em> to the
              reader's as follows:
              <ul>
                <li>int is promotable to long, float, or double</li>
                <li>long is promotable to float or double</li>
                <li>float is promotable to double</li>
                <li>string is promotable to bytes</li>
                <li>bytes is promotable to string</li>
                </ul>
            </li>
          </ul>
        </li>

        <li><strong>if both are records:</strong>
          <ul>
            <li>the ordering of fields may be different: fields are
              matched by name.</li>

            <li>schemas for fields with the same name in both records
              are resolved recursively.</li>

            <li>if the writer's record contains a field with a name
              not present in the reader's record, the writer's value
              for that field is ignored.</li>

            <li>if the reader's record schema has a field that
              contains a default value, and writer's schema does not
              have a field with the same name, then the reader should
              use the default value from its field.</li>

            <li>if the reader's record schema has a field with no
              default value, and writer's schema does not have a field
              with the same name, an error is signalled.</li>
          </ul>
        </li>

        <li><strong>if both are enums:</strong>
          <p>if the writer's symbol is not present in the reader's
            enum and the reader has a <code>default</code> value, then
            that value is used, otherwise an error is signalled.</p>
        </li>

        <li><strong>if both are arrays:</strong>
          <p>This resolution algorithm is applied recursively to the reader's and
            writer's array item schemas.</p>
        </li>

        <li><strong>if both are maps:</strong>
          <p>This resolution algorithm is applied recursively to the reader's and
            writer's value schemas.</p>
        </li>

        <li><strong>if both are unions:</strong>
          <p>The first schema in the reader's union that matches the
            selected writer's union schema is recursively resolved
            against it.  if none match, an error is signalled.</p>
        </li>

        <li><strong>if reader's is a union, but writer's is not</strong>
          <p>The first schema in the reader's union that matches the
            writer's schema is recursively resolved against it.  If none
            match, an error is signalled.</p>
        </li>

        <li><strong>if writer's is a union, but reader's is not</strong>
          <p>If the reader's schema matches the selected writer's schema,
            it is recursively resolved against it.  If they do not
            match, an error is signalled.</p>
        </li>

      </ul>

      <p>A schema's "doc" fields are ignored for the purposes of schema resolution.  Hence,
        the "doc" portion of a schema may be dropped at serialization.</p>

    </section>

    <section>
      <title>Parsing Canonical Form for Schemas</title>

      <p>One of the defining characteristics of Avro is that a reader
      must use the schema used by the writer of the data in
      order to know how to read the data.  This assumption results in a data
      format that's compact and also amenable to many forms of schema
      evolution.  However, the specification so far has not defined
      what it means for the reader to have the "same" schema as the
      writer.  Does the schema need to be textually identical?  Well,
      clearly adding or removing some whitespace to a JSON expression
      does not change its meaning.  At the same time, reordering the
      fields of records clearly <em>does</em> change the meaning.  So
      what does it mean for a reader to have "the same" schema as a
      writer?</p>

      <p><em>Parsing Canonical Form</em> is a transformation of a
      writer's schema that let's us define what it means for two
      schemas to be "the same" for the purpose of reading data written
      against the schema.  It is called <em>Parsing</em> Canonical Form
      because the transformations strip away parts of the schema, like
      "doc" attributes, that are irrelevant to readers trying to parse
      incoming data.  It is called <em>Canonical Form</em> because the
      transformations normalize the JSON text (such as the order of
      attributes) in a way that eliminates unimportant differences
      between schemas.  If the Parsing Canonical Forms of two
      different schemas are textually equal, then those schemas are
      "the same" as far as any reader is concerned, i.e., there is no
      serialized data that would allow a reader to distinguish data
      generated by a writer using one of the original schemas from
      data generated by a writing using the other original schema.
      (We sketch a proof of this property in a companion
      document.)</p>

      <p>The next subsection specifies the transformations that define
      Parsing Canonical Form.  But with a well-defined canonical form,
      it can be convenient to go one step further, transforming these
      canonical forms into simple integers ("fingerprints") that can
      be used to uniquely identify schemas.  The subsection after next
      recommends some standard practices for generating such
      fingerprints.</p>

      <section>
        <title>Transforming into Parsing Canonical Form</title>

        <p>Assuming an input schema (in JSON form) that's already
        UTF-8 text for a <em>valid</em> Avro schema (including all
        quotes as required by JSON), the following transformations
        will produce its Parsing Canonical Form:</p>
        <ul>
          <li> [PRIMITIVES] Convert primitive schemas to their simple
          form (e.g., <code>int</code> instead of
          <code>{"type":"int"}</code>).</li>

          <li> [FULLNAMES] Replace short names with fullnames, using
          applicable namespaces to do so.  Then eliminate
          <code>namespace</code> attributes, which are now redundant.</li>

          <li> [STRIP] Keep only attributes that are relevant to
          parsing data, which are: <code>type</code>,
          <code>name</code>, <code>fields</code>,
          <code>symbols</code>, <code>items</code>,
          <code>values</code>, <code>size</code>.  Strip all others
          (e.g., <code>doc</code> and <code>aliases</code>).</li>

          <li> [ORDER] Order the appearance of fields of JSON objects
          as follows: <code>name</code>, <code>type</code>,
          <code>fields</code>, <code>symbols</code>,
          <code>items</code>, <code>values</code>, <code>size</code>.
          For example, if an object has <code>type</code>,
          <code>name</code>, and <code>size</code> fields, then the
          <code>name</code> field should appear first, followed by the
          <code>type</code> and then the <code>size</code> fields.</li>

          <li> [STRINGS] For all JSON string literals in the schema
          text, replace any escaped characters (e.g., \uXXXX escapes)
          with their UTF-8 equivalents.</li>

          <li> [INTEGERS] Eliminate quotes around and any leading
          zeros in front of JSON integer literals (which appear in the
          <code>size</code> attributes of <code>fixed</code> schemas).</li>

          <li> [WHITESPACE] Eliminate all whitespace in JSON outside of string literals.</li>
        </ul>
      </section>

      <section>
        <title>Standard Canonical Form for Schemas</title>

        <p>One of defined way to normalize the avro schema using
          <em>Standard Canonical Form Transformation</em>. This involves
          stripping unwanted properties and maintain same canonical
          ordering. The canonical ordering involves ordering avro
          reserved properties followed by custom properties if mentioned while
          transforming. Normalization schema which helps to reduce the
          total memory size of schema (removed unwanted properties and whitespace)
          while transfer avro schema between two system and also reduce the parsing
          time for compatibility check and schema evolution.
        </p>

        <p><em>Standard Canonical Form</em> is a transformation of a schema
          into standard canonical ordered. It contains only avro reserved
          properties <code>"name", "type", "fields", "symbols", "items", "values",
            "logicalType", "size", "order", "doc", "aliases", "default"</code>
          and <em>other (custom properties)</em> schema properties.
        </p>

        <section>
          <title>Transforming into Standard Canonical Form</title>

          <p>Assuming an input schema (in JSON form) that's already
            UTF-8 text for a <em>valid</em> Avro schema (including all
            quotes as required by JSON), the following transformations
            will produce its Standard Canonical Form:</p>
          <ul>
            <li> [PRIMITIVES] Convert primitive schemas to their simple
              form (e.g., <code>int</code> instead of
              <code>{"type":"int"}</code>).</li>

            <li> [FULLNAMES] Replace short names with fullnames, using
              applicable namespaces to do so.  Then eliminate
              <code>namespace</code> attributes, which are now redundant.</li>

            <li> [STRIP] Keep only attributes that are relevant to
              reserved properties, which are:
              <code>type</code>, <code>name</code>,
              <code>fields</code>, <code>symbols</code>,
              <code>items</code>, <code>values</code>,
              <code>logicalType</code>, <code>size</code>,
              <code>order</code>, <code>doc</code>
              <code>aliases</code> and <code>default</code>.
              Strip all others user defined properties (e.g., <code>format</code>).</li>

            <li> [ORDER] Order the appearance of fields of JSON objects
              as follows: <code>name</code>, <code>type</code>,
              <code>fields</code>, <code>symbols</code>,
              <code>items</code>, <code>values</code>,
              <code>logicalType</code>, <code>size</code>,
              <code>order</code>, <code>doc</code>,
              <code>aliases</code>, <code>default</code>.
              For example, if an object has <code>type</code>,
              <code>name</code>, and <code>size</code> fields, then the
              <code>name</code> field should appear first, followed by the
              <code>type</code> and then the <code>size</code> fields.</li>

            <li> [STRINGS] For all JSON string literals in the schema
              text, replace any escaped characters (e.g., \uXXXX escapes)
              with their UTF-8 equivalents.</li>

            <li> [INTEGERS] Eliminate quotes around and any leading
              zeros in front of JSON integer literals (which appear in the
              <code>size</code> attributes of <code>fixed</code> schemas).</li>

            <li> [WHITESPACE] Eliminate all whitespace in JSON outside of string literals.</li>
          </ul>
        </section>

        <section>
          <title>Transforming with Custom Properties</title>

          <p>In addition to the standard canonical form transformation, including
            <em>custom</em> <code>Schema</code> or <code>Field</code> properties by
            passing the properties names while transforming.
            For example, if an object has <code>format</code>, <code>type</code>,
            <code>name</code>, and <code>size</code> fields, then the
            <code>name</code> field should appear first, followed by the
            <code>type</code>, <code>size</code> and then <code>format</code>
            (custom properties) fields.
          </p>
        </section>
      </section>

      <section id="schema_fingerprints">
        <title>Schema Fingerprints</title>

        <p>"[A] fingerprinting algorithm is a procedure that maps an
        arbitrarily large data item (such as a computer file) to a
        much shorter bit string, its <em>fingerprint,</em> that
        uniquely identifies the original data for all practical
        purposes" (quoted from [<a
        href="https://en.wikipedia.org/wiki/Fingerprint_(computing)">Wikipedia</a>]).
        In the Avro context, fingerprints of Parsing Canonical Form
        can be useful in a number of applications; for example, to
        cache encoder and decoder objects, to tag data items with a
        short substitute for the writer's full schema, and to quickly
        negotiate common-case schemas between readers and writers.</p>

        <p>In designing fingerprinting algorithms, there is a
        fundamental trade-off between the length of the fingerprint
        and the probability of collisions.  To help application
        designers find appropriate points within this trade-off space,
        while encouraging interoperability and ease of implementation,
        we recommend using one of the following three algorithms when
        fingerprinting Avro schemas:</p>

        <ul>
          <li> When applications can tolerate longer fingerprints, we
          recommend using the <a
          href="https://en.wikipedia.org/wiki/SHA-2">SHA-256 digest
          algorithm</a> to generate 256-bit fingerprints of Parsing
          Canonical Forms.  Most languages today have SHA-256
          implementations in their libraries.</li>

          <li> At the opposite extreme, the smallest fingerprint we
          recommend is a 64-bit <a
          href="https://en.wikipedia.org/wiki/Rabin_fingerprint">Rabin
          fingerprint</a>.  Below, we provide pseudo-code for this
          algorithm that can be easily translated into any programming
          language.  64-bit fingerprints should guarantee uniqueness
          for schema caches of up to a million entries (for such a
          cache, the chance of a collision is 3E-8).  We don't
          recommend shorter fingerprints, as the chances of collisions
          is too great (for example, with 32-bit fingerprints, a cache
          with as few as 100,000 schemas has a 50% chance of having a
          collision).</li>

          <li>Between these two extremes, we recommend using the <a
          href="https://en.wikipedia.org/wiki/MD5">MD5 message
          digest</a> to generate 128-bit fingerprints.  These make
          sense only where very large numbers of schemas are being
          manipulated (tens of millions); otherwise, 64-bit
          fingerprints should be sufficient.  As with SHA-256, MD5
          implementations are found in most libraries today.</li>
        </ul>

        <p> These fingerprints are <em>not</em> meant to provide any
        security guarantees, even the longer SHA-256-based ones.  Most
        Avro applications should be surrounded by security measures
        that prevent attackers from writing random data and otherwise
        interfering with the consumers of schemas.  We recommend that
        these surrounding mechanisms be used to prevent collision and
        pre-image attacks (i.e., "forgery") on schema fingerprints,
        rather than relying on the security properties of the
        fingerprints themselves.</p>

        <p>Rabin fingerprints are <a
        href="https://en.wikipedia.org/wiki/Cyclic_redundancy_check">cyclic
        redundancy checks</a> computed using irreducible polynomials.
        In the style of the Appendix of <a
        href="https://www.ietf.org/rfc/rfc1952.txt">RFC&nbsp;1952</a>
        (pg 10), which defines the CRC-32 algorithm, here's our
        definition of the 64-bit AVRO fingerprinting algorithm:</p>

        <source>
long fingerprint64(byte[] buf) {
  if (FP_TABLE == null) initFPTable();
  long fp = EMPTY;
  for (int i = 0; i &lt; buf.length; i++)
    fp = (fp &gt;&gt;&gt; 8) ^ FP_TABLE[(int)(fp ^ buf[i]) &amp; 0xff];
  return fp;
}

static long EMPTY = 0xc15d213aa4d7a795L;
static long[] FP_TABLE = null;

void initFPTable() {
  FP_TABLE = new long[256];
  for (int i = 0; i &lt; 256; i++) {
    long fp = i;
    for (int j = 0; j &lt; 8; j++)
      fp = (fp &gt;&gt;&gt; 1) ^ (EMPTY &amp; -(fp &amp; 1L));
    FP_TABLE[i] = fp;
  }
}
        </source>

        <p>Readers interested in the mathematics behind this
          algorithm may want to read
        <a href="https://books.google.com/books?id=XD9iAwAAQBAJ&amp;pg=PA319"
          >Chapter 14 of the Second Edition of <em>Hacker's Delight</em></a>.
        (Unlike RFC-1952 and the book chapter, we prepend
        a single one bit to messages.  We do this because CRCs ignore
        leading zero bits, which can be problematic.  Our code
        prepends a one-bit by initializing fingerprints using
        <code>EMPTY</code>, rather than initializing using zero as in
        RFC-1952 and the book chapter.)</p>
      </section>
    </section>

    <section>
      <title>Logical Types</title>

      <p>A logical type is an Avro primitive or complex type with extra attributes to
        represent a derived type. The attribute <code>logicalType</code> must
        always be present for a logical type, and is a string with the name of one of
        the logical types listed later in this section. Other attributes may be defined
        for particular logical types.</p>

      <p>A logical type is always serialized using its underlying Avro type so
        that values are encoded in exactly the same way as the equivalent Avro
        type that does not have a <code>logicalType</code> attribute. Language
        implementations may choose to represent logical types with an
        appropriate native type, although this is not required.</p>

      <p>Language implementations must ignore unknown logical types when
        reading, and should use the underlying Avro type. If a logical type is
        invalid, for example a decimal with scale greater than its precision,
        then implementations should ignore the logical type and use the
        underlying Avro type.</p>

      <section>
        <title>Decimal</title>
        <p>The <code>decimal</code> logical type represents an arbitrary-precision signed
          decimal number of the form <em>unscaled &#215; 10<sup>-scale</sup></em>.</p>

        <p>A <code>decimal</code> logical type annotates Avro
          <code>bytes</code> or <code>fixed</code> types. The byte array must
          contain the two's-complement representation of the unscaled integer
          value in big-endian byte order. The scale is fixed, and is specified
          using an attribute.</p>

        <p>The following attributes are supported:</p>
        <ul>
          <li><code>scale</code>, a JSON integer representing the scale
            (optional). If not specified the scale is 0.</li>
          <li><code>precision</code>, a JSON integer representing the (maximum)
            precision of decimals stored in this type (required).</li>
        </ul>

        <p>For example, the following schema represents decimal numbers with a
          maximum precision of 4 and a scale of 2:</p>
        <source>
{
  "type": "bytes",
  "logicalType": "decimal",
  "precision": 4,
  "scale": 2
}
</source>

        <p>Precision must be a positive integer greater than zero. If the
          underlying type is a <code>fixed</code>, then the precision is
          limited by its size. An array of length <code>n</code> can store at
          most <em>floor(log_10(2<sup>8 &#215; n - 1</sup> - 1))</em>
          base-10 digits of precision.</p>

        <p>Scale must be zero or a positive integer less than or equal to the
          precision.</p>

        <p>For the purposes of schema resolution, two schemas that are
          <code>decimal</code> logical types <em>match</em> if their scales and
          precisions match.</p>

      </section>

      <section>
        <title>UUID</title>
        <p>
          The <code>uuid</code> logical type represents a random generated universally unique identifier (UUID).
        </p>

        <p>
          A <code>uuid</code> logical type annotates an Avro <code>string</code>. The string has to conform with <a href="https://www.ietf.org/rfc/rfc4122.txt">RFC-4122</a>
        </p>
      </section>

      <section>
        <title>Date</title>
        <p>
          The <code>date</code> logical type represents a date within the calendar, with no reference to a particular time zone or time of day.
        </p>
        <p>
          A <code>date</code> logical type annotates an Avro <code>int</code>, where the int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
        </p>
      </section>

      <section>
        <title>Time (millisecond precision)</title>
        <p>
          The <code>time-millis</code> logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond.
        </p>
        <p>
          A <code>time-millis</code> logical type annotates an Avro <code>int</code>, where the int stores the number of milliseconds after midnight, 00:00:00.000.
        </p>
      </section>

      <section>
        <title>Time (microsecond precision)</title>
        <p>
          The <code>time-micros</code> logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one microsecond.
        </p>
        <p>
          A <code>time-micros</code> logical type annotates an Avro <code>long</code>, where the long stores the number of microseconds after midnight, 00:00:00.000000.
        </p>
      </section>

      <section>
        <title>Timestamp (millisecond precision)</title>
        <p>
          The <code>timestamp-millis</code> logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one millisecond.
          Please note that time zone information gets lost in this process. Upon reading a value back, we can only reconstruct the instant, but not the original representation.
          In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment.
        </p>
        <p>
          A <code>timestamp-millis</code> logical type annotates an Avro <code>long</code>, where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC.
        </p>
      </section>

      <section>
        <title>Timestamp (microsecond precision)</title>
        <p>
          The <code>timestamp-micros</code> logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one microsecond.
          Please note that time zone information gets lost in this process. Upon reading a value back, we can only reconstruct the instant, but not the original representation.
          In practice, such timestamps are typically displayed to users in their local time zones, therefore they may be displayed differently depending on the execution environment.
        </p>
        <p>
          A <code>timestamp-micros</code> logical type annotates an Avro <code>long</code>, where the long stores the number of microseconds from the unix epoch, 1 January 1970 00:00:00.000000 UTC.
        </p>
      </section>

      <section>
        <title>Local timestamp (millisecond precision)</title>
        <p>
          The <code>local-timestamp-millis</code> logical type represents a timestamp in a local timezone, regardless of what specific time zone is considered local, with a precision of one millisecond.
        </p>
        <p>
          A <code>local-timestamp-millis</code> logical type annotates an Avro <code>long</code>, where the long stores the number of milliseconds, from 1 January 1970 00:00:00.000.
        </p>
      </section>

      <section>
        <title>Local timestamp (microsecond precision)</title>
        <p>
          The <code>local-timestamp-micros</code> logical type represents a timestamp in a local timezone, regardless of what specific time zone is considered local, with a precision of one microsecond.
        </p>
        <p>
          A <code>local-timestamp-micros</code> logical type annotates an Avro <code>long</code>, where the long stores the number of microseconds, from 1 January 1970 00:00:00.000000.
        </p>
      </section>

      <section>
        <title>Duration</title>
        <p>
          The <code>duration</code> logical type represents an amount of time defined by a number of months, days and milliseconds. This is not equivalent to a number of milliseconds, because, depending on the moment in time from which the duration is measured, the number of days in the month and number of milliseconds in a day may differ. Other standard periods such as years, quarters, hours and minutes can be expressed through these basic periods.
        </p>
        <p>
          A <code>duration</code> logical type annotates Avro <code>fixed</code> type of size 12, which stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds.
        </p>
      </section>

    </section>

  <p><em>Apache Avro, Avro, Apache, and the Avro and Apache logos are
   trademarks of The Apache Software Foundation.</em></p>

  </body>
</document>