Skip to content

jcarlen/Rtesseract

 
 

Repository files navigation

Rtesseract package

This is an R interface to the tesseract OCR (Optical Character Recognition) system.

tesseract is available at https://code.google.com/p/tesseract-ocr/.

More recent versions are available on github https://github.com/tesseract-ocr/tesseract

Installing tesseract involves first installing leptonica http:https://www.leptonica.com/.

This is currently a basic interface to the essential functionality, with some added R functionality to visualize the results.

  1. Of course, the package provides functionality to get the recognized text. However, it also allows us to do this at various different levels, e.g. word, character, line
  2. We can create a searchable and selectable PDF version of the image(s).
  3. We can output the results of the OCR to a tab-separated-value file, an HTML (hocr) file, a BoxText, a UNLV, or a OSD file.
  4. We can also use different page segmentation modes so that we can detect/recognize lines on the image which is useful for processing tables where the lines separate rows or columns
  5. We can get the confidence for each recognized text element to understand whether it is a good match or not.
  6. We can get the location and dimensions of each of the text elements. Again, this is necessary for processing tables and other structured content.
  7. We can display the matched text, the associated confidences to see spatial patterns. Also, we can overlay this on the original image to see patterns.
  8. We can restrict the recognition to a sub-rectangle of the image.
  9. The package provides lower-level access to the C++ API, allowing for more fine-grained and efficient use and flexible programmatic access.
  10. We can set and query many variables cotrolling tesseract's behaviour.
  11. We can query details about the image.
  12. We can query the metadata about the version of tesseract, the supported image formats, etc.

We can machine generate the interface to the other methods and classes in the tesseract API/library.

History

We - Matt Espe & Duncan Temple Lang - started developing this package in April 2015.

About

Interface to tesseract OCR system.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 48.0%
  • C++ 25.2%
  • PostScript 18.8%
  • TeX 3.9%
  • M4 1.9%
  • C 1.5%
  • Other 0.7%