Skip to content
Konstantin Baierer edited this page Mar 23, 2018 · 4 revisions

OCRD XML API

This document describes an application programming interface to the input and output format used for processes within the OCR-D project. The format itself is based on METS as a container and for descriptive metadata and PAGE XML for the content.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Input can be either a single METS XML file or a ZIP container with a single mets.xml plus referenced files

Conventions

fileGrp USE attribute

#9 #7

A METS file an have 1..n <fileGrp>. Their USE attribute MUST be one of

@USE Type of use for OCR-D
OCR-D-IMG The unmanipulated source images
OCR-D-IMG-BIN Black-and-White images
OCR-D-IMG-GRAY Gray images
OCR-D-IMG-CROP Cropped images
OCR-D-IMG-DESKEW Deskewed images
OCR-D-IMG-DESPECK Despeckled images
OCR-D-IMG-DEWARP Dewarped images
OCR-D-SEG-PAGE Page segmentation
OCR-D-SEG-BLOCK Block segmentation
OCR-D-SEG-LINE Line segmentation
OCR-D-OCR-TESS3 Tesseract 3.04 OCR
OCR-D-OCR-TESS4 Tesseract 4.00 OCR
OCR-D-OCR-ANY AnyOCR
OCR-D-COR-CIS CIS post-correction
OCR-D-COR-ASV ASV post-correction

Generated file ID attributes

The ID of the files produced SHOULD be <USE>_<INDEX>, where <USE> is the USE of surrounding <mets:fileGrp> and <INDEX> is the zero-padded four-digit index of the file within the group. This way, file ID are unique within the document.

Example:

<mets:fileGrp USE="OCR-D-SEG-LINE">
  <mets:file ID="OCR-D-SEG-LINE_0001>[...]</mets:file>
</mets:fileGrp>

One PAGE XML document per document page

A single PAGE XML file represents one page in the original document.

Every <pc:Page> element MUST have an attribute image which MUST always be the source image.

The PAGE XML root element <pc:PcGts> MUST have exactly one <pc:Page>.

Images and coordinates

Coordinates are always absolute, i.e. relative to extent defined in the imageWidth/imageHeight attribute of the nearest <pc:Page>.

When a processor wants to access the image of a layout element like a TextRegion or TextLine, the algorithm should be:

  • If the element in question has an attribute imageFilename, resolve this value
  • If the element has a <pc:Coords> subelement, resolve by passing the attribute imageFilename of the nearest <pc:Page> and the points attribute of the <pc:Coords> element

API

📦TODO📦 https://github.com/PRImA-Research-Lab/prima-core-libs and its apidocs.


Resolver

📦TODO📦 Describe

  • Data Repository
  • backend for the transparency in handling input and output
  • cutting out images
  • etc.

new Ocrd.Resolver()

Creates a resolver and sets e.g. the ZIP it should resolve file-URL in etc.

OcrdPage resolvePage(String url)

Resolve a URL to an OcrdPage.

OcrdMets resolveMets(String url)

Resolve a URL to an OcrdMets.

OcrdImage resolveImage(String url)

Resolve a URL to an OcrdImage.

OcrdImage resolveImage(String url, OcrdCoords coords)

Resolve a URL to an image, then crop it to the coordinates provided.


OcrdMets

Represents the METS file as used for input and output of the processors.

List<OcrdPage> listInputPages()

If fileGrp USE="INPUT" contains file mimetype="text/xml", parse them (OcrdPage) and list them.

Otherwise, if fileGrp USE="INPUT" contains file mimetype="image/*", generate empty PAGE XML from these by

  • Creating an pc:PcGts and therein
  • an empty pc:Page element with image="<URL>"

listVariants

📦TODO📦 Wrong here

Lists all variants, i.e. nested METS files used as INPUT. In the common case that there is no nesting, this will return just one variant with all the files listed in INPUT.

OcrdPage getInputPage(i)

List<OcrdPage> listOutputs()

addOutput(OcrdPage page)


OcrdPage

Should be generated by the resolver.

Image getImage()

Image getAlternativeImage(type)


TextRegion

Image getImage()


TextLine

Image getAlternativeImage(type)

Glossary

Processor

A processor is a tool that accepts METSPAGE input and produces METSPAGE output according to this spec.