Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract_get_version returns cached result after version change #411

Closed
JohnReid opened this issue Feb 4, 2022 · 8 comments · Fixed by #505
Closed

tesseract_get_version returns cached result after version change #411

JohnReid opened this issue Feb 4, 2022 · 8 comments · Fixed by #505

Comments

@JohnReid
Copy link

JohnReid commented Feb 4, 2022

The @run_once decorator on pytesseract.tesseract_get_version() means that if the version is switched via

pytesseract.pytesseract.tesseract_cmd = 'tesseract5'

for example, after pytesseract.tesseract_get_version() has been called, it no longer returns correct results.

As a complete example:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = 'Tesseract/tesseract4'
print(pytesseract.get_tesseract_version())
pytesseract.pytesseract.tesseract_cmd = 'Tesseract/tesseract5'
print(pytesseract.get_tesseract_version())

gives the following (even though I have switched from version 4 to version 5)

4.1.3
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.1.2

4.1.3
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.1.2```
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Feb 4, 2022

This behavior is as intended.
You can use the non-cached version by accessing pytesseract.tesseract_get_version.__wrapped__()
This is a legacy thing with the pytesseract cli -- and to be honest, it should be extremely rare to switch between different tesseract versions during a single session.

@JohnReid
Copy link
Author

JohnReid commented Feb 4, 2022

Thanks for the quick response and the work-around.

I can see the behaviour is intended, it is just undocumented and totally counterintuitive. Memoising the function return value conditioned on the value of tesseract_cmd would satisfy your caching needs and also provide a consistent, intuitive interface.

Also, I don't think it is as rare as you think to switch between versions in a session, e.g. #279

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Feb 4, 2022

I see what you mean, but this still could be ambiguous. For example if the user has two different tesseract versions in a different path but with the same binary name. I don't see how we are going to distinguish, between both of them if we don't make the expensive call in a sub shell (keep in mind that get_tesseract_version is used for some of the image_to_ methods)

Maybe we can just add a cache flag and if someone needs the non-cached version, they can make the function call with this flag.

@JohnReid
Copy link
Author

JohnReid commented Feb 4, 2022

I see what you mean as well. Your last suggestion sounds the simplest and the best.

@elvinagam
Copy link

How do you actually update to tesseract 5.0?
I am using pytesseract on colab, and when I pip installed it there, I got 4.0.0. How do I switch to updated tesseract? @int3l

@stefan6419846
Copy link
Contributor

stefan6419846 commented Aug 18, 2022

As pytesseract just calls the OS package, you probably want to update Tesseract there. It seems like Google Colab is Ubuntu-based, which can be verified using !cat /etc/*release. According to https://packages.ubuntu.com/search?keywords=tesseract&searchon=names&suite=all&section=all, you will need at least Ubuntu 22.10 to get version 5 from the official package repositories.

Otherwise, you might be able to build the package yourself, which is documented in https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation.html. With some searching on the web, you will find some custom PPAs for older Ubuntu versions which provide Tesseract 5 as well, although I will not link them here as especially with random PPAs you should have a good look at whether you can trust it or not.

@elvinagam
Copy link

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
It actually is. Thanks, I'll look for it.

btw, Is it the same procedure for windows to update to tesseract 5.0.0 though?

@stefan6419846
Copy link
Contributor

Apart from the fact that Windows usually has a different package management, yes. All downloads/installers seem to be unofficial though in this case: https://tesseract-ocr.github.io/tessdoc/Downloads.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants