GitHub - CUSAT/hocr-tools: Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.

hocr-tools

Tools for manipulating and evaluating the hOCR microformat for representing multi-lingual OCR results.

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits well-defined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognized text and OCR-related information co-exist in the same file and survives editing and manipulation. hOCR markup is independent of the presentation.

The programs

hocr-check file.html

Perform consistency checks on the hOCR file.

hocr-combine file1.html file2.html...

Combine the OCR pages contained in each HTML file into a single document. The document metadata is taken from the first file.

hocr-split file.html pattern

Split a multipage hOCR file into hOCR files containing one page each. The pattern should something like "base-%03d.html"

hocr-eval-lines [-v] true-lines.txt hocr-actual.html

Evaluate hOCR output against ASCII ground truth. This evaluation method requires that the line breaks in true-lines.txt and the ocr_line elements in hocr-actual.html agree (most ASCII output from OCR systems satisfies this requirement).

hocr-eval-geom [-e element-name] [-o overlap-threshold] hocr-truth hocr-actual

Compare the segmentations at the level of the element name (default: ocr_line). Computes undersegmentation, oversegmentation, and missegmentation.

hocr-eval hocr-true.html hocr-actual.html

Evaluate the actual OCR with respect to the ground truth. This outputs the number of OCR errors due to incorrect segmentation and the number of OCR errors due to character recognition errors.

It works by aligning segmentation components geometrically, and for each segmentation component that can be aligned, computing the string edit distance of the text the segmentation component contains.

hocr-merge-dc dc.xml hocr.html > hocr-new.html

Merges the Dublin Core metadata into the hOCR file by encoding the data in its header.

About the code

Each command line program is self contained; if you have the right Python packages installed, it should just work. (Unfortunately, that means some code duplication; we may revisit this issue in later revisions.)

Pointers

The format itself is defined here:

https://github.com/CUSAT/hocr-tools/blob/master/format/format.pdf?raw=true

The project is hosted here:

https://cusat.github.com/hocr-tools

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
format		format
README.md		README.md
dcsample.xml		dcsample.xml
dcsample2.xml		dcsample2.xml
hocr-check		hocr-check
hocr-combine		hocr-combine
hocr-eval		hocr-eval
hocr-eval-geom		hocr-eval-geom
hocr-eval-lines		hocr-eval-lines
hocr-extract-g1000		hocr-extract-g1000
hocr-extract-images		hocr-extract-images
hocr-lines		hocr-lines
hocr-merge-dc		hocr-merge-dc
hocr-split		hocr-split
sample.html		sample.html
sample.txt		sample.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hocr-tools

The programs

About the code

Pointers

About

Releases

Packages

Languages

CUSAT/hocr-tools

Folders and files

Latest commit

History

Repository files navigation

hocr-tools

The programs

About the code

Pointers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages