Rtesseract package

This is an R interface to the tesseract OCR (Optical Character Recognition) system.

Installing tesseract involves first installing leptonica http:https://www.leptonica.com/.

This is currently a basic interface to the essential functionality, with some added R functionality to visualize the results.

Of course, the package provides functionality to get the recognized text. However, it also allows us to do this at various different levels, e.g. word, character, line
We can create a searchable and selectable PDF version of the image(s).
We can output the results of the OCR to a tab-separated-value file, an HTML (hocr) file, a BoxText, a UNLV, or a OSD file.
We can also use different page segmentation modes so that we can detect/recognize lines on the image which is useful for processing tables where the lines separate rows or columns
We can get the confidence for each recognized text element to understand whether it is a good match or not.
We can get the location and dimensions of each of the text elements. Again, this is necessary for processing tables and other structured content.
We can display the matched text, the associated confidences to see spatial patterns. Also, we can overlay this on the original image to see patterns.
We can restrict the recognition to a sub-rectangle of the image.
The package provides lower-level access to the C++ API, allowing for more fine-grained and efficient use and flexible programmatic access.
We can set and query many variables cotrolling tesseract's behaviour.
We can query details about the image.
We can query the metadata about the version of tesseract, the supported image formats, etc.

We can machine generate the interface to the other methods and classes in the tesseract API/library.

History

We - Matt Espe & Duncan Temple Lang - started developing this package in April 2015.

Name		Name	Last commit message	Last commit date
Latest commit History 618 Commits
Experiments		Experiments
R		R
TU		TU
inst		inst
man		man
src		src
testRexit		testRexit
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
Changelog		Changelog
DESCRIPTION		DESCRIPTION
INSTALL.md		INSTALL.md
INSTALL.windows		INSTALL.windows
InstallingTesseract.md		InstallingTesseract.md
Installing_Rtesseract.md		Installing_Rtesseract.md
NAMESPACE		NAMESPACE
Note		Note
README.md		README.md
TODO.md		TODO.md
TODO.win		TODO.win
cleanup		cleanup
config.R.win		config.R.win
configure		configure
configure.ac		configure.ac
configure.win		configure.win
createSamplePNG.R		createSamplePNG.R
findLines_notes		findLines_notes
imageCapabilities.R.win		imageCapabilities.R.win
lines.R		lines.R
lines2.R		lines2.R
readImg.cpp		readImg.cpp
sampleImage.bmp		sampleImage.bmp
sampleImage.gif		sampleImage.gif
sampleImage.jp2		sampleImage.jp2
sampleImage.jpg		sampleImage.jpg
sampleImage.png		sampleImage.png
sampleImage.pnm		sampleImage.pnm
sampleImage.ps		sampleImage.ps
sampleImage.spix		sampleImage.spix
sampleImage.webp		sampleImage.webp