Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Wrapper around training #508

Closed
forzagreen opened this issue Aug 25, 2023 · 3 comments
Closed

[Feature Request] Wrapper around training #508

forzagreen opened this issue Aug 25, 2023 · 3 comments

Comments

@forzagreen
Copy link

It would be great if pytesseract offers a wrapper around the training functionalities of Tesseract (https://github.com/tesseract-ocr/tesstrain)
Since the training is not done often in Tesseract, the option can be added as a package extras, e.g. installed as pip install pytesseract[training]

@stefan6419846
Copy link
Contributor

What exactly are you looking for?

For the training with artificial data, there already is a Python package (https://github.com/tesseract-ocr/tesstrain/tree/main/src, tesstrain on PyPI with some smaller modifications, currently maintained/owned by me in a fork of the original code).

For the training with real data, there currently mostly is a Makefile. If I remember the discussions in some PRs correctly, one collaborator has some plans about moving everything to Python and providing it in one package, but there are no results for this at the moment.

That being said, I see no real value in pytesseract adding functionality like this.

@forzagreen
Copy link
Author

Hi @stefan6419846 , thank you for sharing these information.
I didn't know about this pypi package and the python code behind it.

The documentation of training is confusing and scattered between 3 repos (tesseract, tessdoc and tesstrain). It documentes only Makefiles. It's worth documenting the python options.

Thanks again. Closing this issue.

@stefan6419846
Copy link
Contributor

tessdoc documents the training process with the Python package in a basic manner without any actual references to tesstrain (or the tesstrain.sh script, which was the old way): https://tesseract-ocr.github.io/tessdoc/tess5/TrainingTesseract-5.html But yes, I already mentioned the not very clear docs in the past, but priority does not seem to be high for it and my experience is rather restricted to the training with artificial data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants