OCR images of old allometry data.
Trying to OCR the images directly has not worked well. The pages are old and never were of high quality to begin with.
- Train some models to recognize the distorted and broken characters. (MNIST style)
- Dissect images into lines and then into characters on each line.
- For every model, read the allometry sheets character by character and run each character through the model.
- Create a single "best" version by using the models as an ensemble.
- Cleanup text. See if we can differentiate 0s from Os and 1s from Is etc. Also use spell checking and other heuristics.
- Write the data to CSV files. Try to recreate that tables as closely as possible. For other reports formats separate the labels and values.
The notebooks directory contain experiments where I figure out how to approach things. The code is not expected to be useful outside that.
This code is in the history directory, so you don't have to go Git spelunking to find them.
We tried running these through the tesseract OCR program in various configurations. It's a great program, but it's not designed to work with "distressed" fonts.
We're going to break this into three steps.
- Clean up the images before the OCR step.
- Remove stray marks and fix problems with fonts and bad printing, etc.
- We're going to train a neural net to do this, either a denoising autoencoder or a U-net.
- We will generate and save formatted text for the pages then we can generate an image for the text as ground truth and then dirty the image. We can then use these images to train the net.
- OCR the images.
- The plan is to use tesseract to do this.
- Clean up the text after the OCR.
- Compare the OCR output to the text generated in step 1.
- If they compare favorably then just use the OCR output.
- If they do not, then we will need to do our best to correct the problems.
No matter how much the images were cleaned they just would not work with the major open source OCR programs.