Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation: tesseract5.3.0 vs ocrd/all:2022-08-15 #346

Open
jbarth-ubhd opened this issue Mar 1, 2023 · 10 comments
Open

segmentation: tesseract5.3.0 vs ocrd/all:2022-08-15 #346

jbarth-ubhd opened this issue Mar 1, 2023 · 10 comments

Comments

@jbarth-ubhd
Copy link

I'm just wondering a bit about different recognition results using tesseract5.3.0 and OCR-D with ocrd-olena-binarize && ocrd-tesserocr-segment.

Original TIF: https://digi.ub.uni-heidelberg.de/diglitData/v/heidelberg1592_-_04manual.tif

Result using tesseract5.3.0 -l Fraktur_GT4Hist... (right column = ground truth)
image

and using tesserocr-segment and calamari-recognize (fraktur_historical1.0) with OCR-D:
image

and using tesserocr-segment and tesserocr-recognize (Fraktur_GT4Hist...) with OCR-D:
image

It seems that OCR-D-"tesserocr" segmentation is somewhat different to OCR-D segmentation (perhaps because olena-binarize?), but I can't find a big change in line/region/segmentation etc. in the tesseract changelog the last year.

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2023

Hard to tell from a diff tool that I don't know and data I cannot see. Looks like in ocrd-tesserocr two lines are duplicate.

Binarization will have an impact, yes – both on segmentation and recognition. (For recognition, we don't currently pass the raw images, because we don't know what the model "wants". The only way to ensure raw recognition is either not have binarization in the workflow at all, or removing the respective annotation in the fileGrp that is used as input for recogition, for example via ocrd-page-transform -P xsl page-remove-alternativeimages -P xslt-params "-s which=binarized".)

Mind that ocrd-tesserocr-segment plus ocrd-tesserocr-recognize is not recommended as it needlessly throws away internal information of the Tesseract layout analysis. (You can do segmentation and recognition in one pass with ocrd-tesserocr-recognize.)

Standalone Tesseract is another beast entirely. It always uses the raw image for recognition. Also, you can now choose some new adaptive binarization via -c thresholding_method=1 (or 2 for Sauvola). Also comes with its own parameters (thresholding_window_size, thresholding_kfactor, thresholding_tile_size, thresholding_score_fraction).

@jbarth-ubhd
Copy link
Author

ok ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR gives exactly the same results as tesseract5.3.0 (for this example).

@jbarth-ubhd
Copy link
Author

Ok and for completeness tesserocr-segment+calamari+qurator-gt4histocr1.0:

image

@bertsky
Copy link
Collaborator

bertsky commented Mar 1, 2023

thanks @jbarth-ubhd for checking thoroughly!

BTW, if you want to try any of the better Calamari 2 models here and there (probably also here), you currently have to switch to Calamari 2 on the standalone CLI. (In a OCR-D Workflow, this can be integrated by first exporting line images with ocrd-segment-extract-lines -I SEG -O LINES -P output-types '["text"]', then predicting with calamari-predict --checkpoint path/to/best.ckpt.json --data.pred_extension .pred.txt --data.images "LINES/*.png" and finally importing the text back in via ocrd-segment-replace-text -I SEG -O OCR -P file_glob "LINES/*.pred.txt".)

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Mar 2, 2023

Step 2 does not work (all pip modules installed without any conflict):

> calamari-predict --checkpoint /home/jb/calamari-models-v2/gt4histocr/*.ckpt* --data.images OCR-D-LINES/*.png 
2>&1|egrep -vi 'libnv|cuda|Nvidia'|fold -s -w 110

2023-03-02 10:39:28.644398: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is 
optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in 
performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO     2023-03-02 10:39:30,011     tfaip.device.device_config: Setting up device config 
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, 
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
INFO     2023-03-02 10:39:30,011 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,021 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,025 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,028 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,032 calamari_ocr.ocr.savedmodel.sa: Checkpoint version 5 is up-to-date.
INFO     2023-03-02 10:39:30,054     tfaip.device.device_config: Setting up device config 
DeviceConfigParams(gpus=None, gpu_auto_tune=False, gpu_memory=None, soft_device_placement=True, 
dist_strategy=<DistributionStrategy.DEFAULT: 'default'>)
CRITICAL 2023-03-02 10:39:30,061             tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
  File "/home/jb/.local/bin/calamari-predict", line 8, in <module>
    sys.exit(main())
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 191, in main
    run(args.root)
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/scripts/predict.py", line 119, in run
    predictor = MultiPredictor.from_paths(
  File "/home/jb/.local/lib/python3.8/site-packages/calamari_ocr/ocr/predict/predictor.py", line 53, in 
from_paths
    multi_predictor = super(MultiPredictor, cls).from_paths(
  File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 107, in 
from_paths
    models = [
  File "/home/jb/.local/lib/python3.8/site-packages/tfaip/predict/multimodelpredictor.py", line 108, in 
<listcomp>
    keras.models.load_model(model, compile=False, custom_objects=scenario.model_cls().all_custom_objects())
  File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jb/.local/lib/python3.8/site-packages/keras/utils/generic_utils.py", line 103, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

BTW the md5sum of all *.json is the same?

jb@pers16:~/calamari-models-v2/gt4histocr> md5sum *
dbcf154171b4d98eea43eebfeb808d2f  0.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  0.ckpt.json
95cc6d142e33f7ac1a3eb44413a71d03  1.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  1.ckpt.json
ed09d330c603958e3c89a0b46218420c  2.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  2.ckpt.json
ec1c9457824c1679e1b4cc2d49343b43  3.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  3.ckpt.json
8c7b560f08625b3f01974199f8a5921a  4.ckpt.h5
c7c6af20930f84bf2f677e8713f25752  4.ckpt.json

But the error message ValueError: bad marshal data (unknown type code) is the same for deep3...

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Mar 2, 2023

tesseract5.3.0 -l Fraktur_GT4HistOCR on manually cropped https://digi.ub.uni-heidelberg.de/diglitData/v/04-manual-crop.tif but perspective not corrected (very minimal perspective distortion)

Note that there are more errors than in the first ocr-comparison-Image here #346 (comment) , but the base image is almost the same

grafik

@bertsky
Copy link
Collaborator

bertsky commented Mar 2, 2023

But the error message ValueError: bad marshal data (unknown type code) is the same for deep3...

Ouch. With Python>=3.8 we are now heavily being hit by Calamari-OCR/calamari#78. The solution is to convert the models from HDF5 to SavedFormat, but you need a Python+TF version where it still loads in the first place. As a workaround, you can try in Python 3.7 or 3.6.

@bertsky
Copy link
Collaborator

bertsky commented Mar 2, 2023

on manually cropped [...] but perspective not corrected (very minimal perspective distortion)

Note that there are more errors than in the first ocr-comparison-Image here

Hard to tell. Tesseract LA is very buggy (I would even say fragile) and the legacy code has not been touched (maintained) for years...

@jbarth-ubhd
Copy link
Author

quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"

@bertsky
Copy link
Collaborator

bertsky commented Mar 13, 2023

ok ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR6 -P segmentation_level region -P textequiv_level word -P find_tables true -P overwrite_segments true -P model Fraktur_GT4HistOCR gives exactly the same results as tesseract5.3.0 (for this example).

quote from Stefan Weil: »We have clear evidence that it is extremely important to have line images for recognition which are similar to those used for training.«"

Yes, that's obviously true. But as a user you have no way of knowing what the model expects (raw or bin, what kind of bin). There's no model metadata in Tesseract. (And in Calamari, it could be stored in the model metadata, but the trainer does not do that.)

The model's publisher (in this case, @stweil) must document what the model was trained on (both what kind of material and in what digital form). The Tesseract models from Mannheim are usually documented on the tesstrain Wiki. Their Kraken models however point to the wiki pages of the respective GT repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants