Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

Closed
klavdijS opened this issue Feb 1, 2022 · 7 comments

Comments

@klavdijS
Copy link

klavdijS commented Feb 1, 2022

Hey, firstly thanks for the great wrapper around tesseract, makes the usage much more convenient.

Operating system: macOS Monetery 12.1
Tesseract version: 5.0.1

Upon trying to execute image_to_osd with an opened PIL Image, I always get the same result:
pytesseract.pytesseract.TesseractError: (1, 'UZN file /var/folders/z7/6mpq4jhn3g96kcp8wd5_fzrm0000gn/T/tess__pdgxjh0 loaded. Estimating resolution as 146 UZN file /var/folders/z7/6mpq4jhn3g96kcp8wd5_fzrm0000gn/T/tess__pdgxjh0 loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.

OSD is never returned.
A short script for reproduction:

from PIL import Image
image_to_test = Image.open("out1.png")
osd = pytesseract.image_to_osd(image_to_test)
print(osd)

out1.png is a random document with text on it.

Upon further investigation it looks like tesseract tries to open and execute the temporary file which is meant to be used for saving the process output (as seen from the pasted error message).
Temporary saved files in Finder:
image
Problematic is the save context manager which creates the temporary files (starting line 188):

def save(image):
    try:
        with NamedTemporaryFile(prefix='tess_', delete=False) as f:
            if isinstance(image, str):
                yield f.name, realpath(normpath(normcase(image)))
                return
            image, extension = prepare(image)
            input_file_name = f.name + extsep + extension
            image.save(input_file_name, format=image.format)
            yield f.name, input_file_name
    finally:
        cleanup(f.name)

Currently, I fixed it by changing the input_file_name variable to the following:
input_file_name = f"{f.name}_input" + extsep + extension

This is also my proposed solution, I believe it does not break anything.

I can create a pull request if needed.

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Feb 2, 2022

Hi @klavdijS
This looks strange, because the unit tests pass for the image_to_osd.
Did you check if the test image works directly with tesseract itself?
Also, you can try to save the image by yourself to disk and just pass the image file path as string to pytesseract -- it should work without making the temp files.

@klavdijS
Copy link
Author

klavdijS commented Feb 3, 2022

Hi @int3l.
Yes, I did check, invoking it with tesseract directly works as expected.
Example of a test image:
test1
Running the following command:
tesseract test1.png out -l osd --psm 0 (generated cmd from image_to_osd)
Produces expected file:
out.txt
Running it with the script provided in the first example produces the reported error.
Yes, I understand I can do that myself - if I provide the path it works as expected. I was trying to do it directly with PIL images since my case involves working directly with them.
I found out in the end that this approach is slower by 30% because of file manipulation, so it probably will not be feasible for me in the end. However, I believe this is not the expected behavior and that is why I reported this.

@bozhodimitrov
Copy link
Collaborator

PIL (Pillow) itself does non-lossless conversions for the most part -- this is why there is no way to perfectly avoid such errors.
If you know a workaround, let us know and/or feel free to create PR.

@bwakkie
Copy link

bwakkie commented Mar 6, 2022

I have the same problem and did some digging with the tmp files:

I created a watch on the tmp folder and copy the tmp files to a tess_test folder on arrival:

inotifywait -m -e close_write /tmp/ | gawk '{print $1$3; fflush()}' | xargs -I % sh -c 'echo %; cp % tess_test'
Setting up watches.
Watches established.
/tmp/tess_rz4bpdjb.PNG
/tmp/tess_rz4bpdjb.osd
/tmp/tess_rz4bpdjb
cp: cannot stat '/tmp/tess_rz4bpdjb': No such file or directory

in my flask app I get the following error:

pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_rz4bpdjb loaded. UZN file /tmp/tess_rz4bpdjb loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

As you can see the /tmp/tess_rz4bpdjb file is created and removed directly before it can be copied while the /tmp/tess_rz4bpdjb.PNG file folds my image and the /tmp/tess_rz4bpdjb.osd file is empty.

But the created PIL image does work directly in tesseract:

$ tesseract --psm 0 -l osd /tmp/tess_test/tess_rz4bpdjb.PNG stdout
Estimating resolution as 388
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0 
Rotate: 0
Orientation confidence: 28.76
Script: Latin
Script confidence: 15.00

Im pytesseract with the generated tmp image directly I get the same error:

Python 3.10.2 (main, Jan 15 2022, 19:56:27) [GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pytesseract
>>> from PIL import Image
>>> print(pytesseract.image_to_osd(Image.open('/tmp/tess_test/tess_rz4bpdjb.PNG')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 545, in image_to_osd
    return {
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 548, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 286, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 262, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_zoc937hs loaded. Estimating resolution as 388 UZN file /tmp/tess_zoc937hs loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

klavdijS's solution worked for me! Thank you

using: pytesseract==0.3.9, Pillow==9.0.1, Python 3.10.2 and tesseract 5.0.1 on Linux server 5.13.19-2-MANJARO

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Mar 7, 2022

Oh, I was looking into the history of pytesseract and it seems to be a regression from 2019.
The problem here is, that I wanted to use the NamedTemporaryFile as name generator for filenames and not to actually create the file.

But also, this seems like a deeper problem.
To me, it looks like that pytesseract doesn't wait until the tesseract execution ends.
This is one posibility.

The other one is hinted by the actual message: UZN file loaded.
Tessract is trying to load files without extension as UZN files (segmentation/zone files) ???
And I guess that changing the base filename disables loading the temp file as UZN file.
Can someone test if it makes a difference if changing the 'tess_' prefix into a suffix? @bwakkie or @klavdijS

@bwakkie
Copy link

bwakkie commented Mar 7, 2022

I tried your suggestion but no difference:
line 192: with NamedTemporaryFile(suffix='_tess', delete=False) as f:

/tmp/tmp5ltmdm2m_tess.PNG
/tmp/tmp5ltmdm2m_tess.osd
/tmp/tmp5ltmdm2m_tess
cp: cannot stat '/tmp/tmp5ltmdm2m_tess': No such file or directory

pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tmp5ltmdm2m_tess loaded. Estimating resolution as 373 UZN file /tmp/tmp5ltmdm2m_tess loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

@bozhodimitrov
Copy link
Collaborator

Ok, thank you for reporting this issue, you can test the master revision if you want.
This change will be available with the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants