Allometry Data

OCR images of old allometry data.

Trying to OCR the images directly has not worked well. The pages are old and never were of high quality to begin with.

Train some models to recognize the distorted and broken characters. (MNIST style)
Dissect images into lines and then into characters on each line.
For every model, read the allometry sheets character by character and run each character through the model.
Create a single "best" version by using the models as an ensemble.
Cleanup text. See if we can differentiate 0s from Os and 1s from Is etc. Also use spell checking and other heuristics.
Write the data to CSV files. Try to recreate that tables as closely as possible. For other reports formats separate the labels and values.

Experiments

The notebooks directory contain experiments where I figure out how to approach things. The code is not expected to be useful outside that.

Failed Approaches

This code is in the history directory, so you don't have to go Git spelunking to find them.

Failed approach #1

We tried running these through the tesseract OCR program in various configurations. It's a great program, but it's not designed to work with "distressed" fonts.

Failed approach #2

We're going to break this into three steps.

Clean up the images before the OCR step.
1. Remove stray marks and fix problems with fonts and bad printing, etc.
2. We're going to train a neural net to do this, either a denoising autoencoder or a U-net.
3. We will generate and save formatted text for the pages then we can generate an image for the text as ground truth and then dirty the image. We can then use these images to train the net.
OCR the images.
1. The plan is to use tesseract to do this.
Clean up the text after the OCR.
1. Compare the OCR output to the text generated in step 1.
2. If they compare favorably then just use the OCR output.
3. If they do not, then we will need to do our best to correct the problems.

No matter how much the images were cleaned they just would not work with the major open source OCR programs.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
allometry		allometry
args		args
config		config
fonts		fonts
history		history
scratch		scratch
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dev_env.bash		dev_env.bash
ensemble.py		ensemble.py
output_results.py		output_results.py
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
run_all_models.bash		run_all_models.bash
run_model.py		run_model.py
sheet_dissection.py		sheet_dissection.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Allometry Data

Experiments

Failed Approaches

Failed approach #1

Failed approach #2

About

Releases

Packages

Languages

License

rafelafrance/allometry_data

Folders and files

Latest commit

History

Repository files navigation

Allometry Data

Experiments

Failed Approaches

Failed approach #1

Failed approach #2

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages