Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

klavdijS · 2022-02-01T12:50:28Z

Hey, firstly thanks for the great wrapper around tesseract, makes the usage much more convenient.

Operating system: macOS Monetery 12.1
Tesseract version: 5.0.1

Upon trying to execute image_to_osd with an opened PIL Image, I always get the same result:
pytesseract.pytesseract.TesseractError: (1, 'UZN file /var/folders/z7/6mpq4jhn3g96kcp8wd5_fzrm0000gn/T/tess__pdgxjh0 loaded. Estimating resolution as 146 UZN file /var/folders/z7/6mpq4jhn3g96kcp8wd5_fzrm0000gn/T/tess__pdgxjh0 loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.

OSD is never returned.
A short script for reproduction:

from PIL import Image
image_to_test = Image.open("out1.png")
osd = pytesseract.image_to_osd(image_to_test)
print(osd)

out1.png is a random document with text on it.

Upon further investigation it looks like tesseract tries to open and execute the temporary file which is meant to be used for saving the process output (as seen from the pasted error message).
Temporary saved files in Finder:

Problematic is the save context manager which creates the temporary files (starting line 188):

def save(image):
    try:
        with NamedTemporaryFile(prefix='tess_', delete=False) as f:
            if isinstance(image, str):
                yield f.name, realpath(normpath(normcase(image)))
                return
            image, extension = prepare(image)
            input_file_name = f.name + extsep + extension
            image.save(input_file_name, format=image.format)
            yield f.name, input_file_name
    finally:
        cleanup(f.name)

Currently, I fixed it by changing the input_file_name variable to the following:
input_file_name = f"{f.name}_input" + extsep + extension

This is also my proposed solution, I believe it does not break anything.

I can create a pull request if needed.

The text was updated successfully, but these errors were encountered:

bozhodimitrov · 2022-02-02T15:24:05Z

Hi @klavdijS
This looks strange, because the unit tests pass for the image_to_osd.
Did you check if the test image works directly with tesseract itself?
Also, you can try to save the image by yourself to disk and just pass the image file path as string to pytesseract -- it should work without making the temp files.

klavdijS · 2022-02-03T08:52:05Z

Hi @int3l.
Yes, I did check, invoking it with tesseract directly works as expected.
Example of a test image:

Running the following command:
tesseract test1.png out -l osd --psm 0 (generated cmd from image_to_osd)
Produces expected file:
out.txt
Running it with the script provided in the first example produces the reported error.
Yes, I understand I can do that myself - if I provide the path it works as expected. I was trying to do it directly with PIL images since my case involves working directly with them.
I found out in the end that this approach is slower by 30% because of file manipulation, so it probably will not be feasible for me in the end. However, I believe this is not the expected behavior and that is why I reported this.

bozhodimitrov · 2022-02-04T16:37:34Z

PIL (Pillow) itself does non-lossless conversions for the most part -- this is why there is no way to perfectly avoid such errors.
If you know a workaround, let us know and/or feel free to create PR.

bwakkie · 2022-03-06T12:06:03Z

I have the same problem and did some digging with the tmp files:

I created a watch on the tmp folder and copy the tmp files to a tess_test folder on arrival:

inotifywait -m -e close_write /tmp/ | gawk '{print $1$3; fflush()}' | xargs -I % sh -c 'echo %; cp % tess_test'
Setting up watches.
Watches established.
/tmp/tess_rz4bpdjb.PNG
/tmp/tess_rz4bpdjb.osd
/tmp/tess_rz4bpdjb
cp: cannot stat '/tmp/tess_rz4bpdjb': No such file or directory

in my flask app I get the following error:

pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_rz4bpdjb loaded. UZN file /tmp/tess_rz4bpdjb loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

As you can see the /tmp/tess_rz4bpdjb file is created and removed directly before it can be copied while the /tmp/tess_rz4bpdjb.PNG file folds my image and the /tmp/tess_rz4bpdjb.osd file is empty.

But the created PIL image does work directly in tesseract:

$ tesseract --psm 0 -l osd /tmp/tess_test/tess_rz4bpdjb.PNG stdout
Estimating resolution as 388
Warning. Invalid resolution 0 dpi. Using 70 instead.
Page number: 0
Orientation in degrees: 0 
Rotate: 0
Orientation confidence: 28.76
Script: Latin
Script confidence: 15.00

Im pytesseract with the generated tmp image directly I get the same error:

Python 3.10.2 (main, Jan 15 2022, 19:56:27) [GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pytesseract
>>> from PIL import Image
>>> print(pytesseract.image_to_osd(Image.open('/tmp/tess_test/tess_rz4bpdjb.PNG')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 545, in image_to_osd
    return {
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 548, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 286, in run_and_get_output
    run_tesseract(**kwargs)
  File "/home/me/.local/lib/python3.10/site-packages/pytesseract/pytesseract.py", line 262, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tess_zoc937hs loaded. Estimating resolution as 388 UZN file /tmp/tess_zoc937hs loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

klavdijS's solution worked for me! Thank you

using: pytesseract==0.3.9, Pillow==9.0.1, Python 3.10.2 and tesseract 5.0.1 on Linux server 5.13.19-2-MANJARO

bozhodimitrov · 2022-03-07T00:10:59Z

Oh, I was looking into the history of pytesseract and it seems to be a regression from 2019.
The problem here is, that I wanted to use the NamedTemporaryFile as name generator for filenames and not to actually create the file.

But also, this seems like a deeper problem.
To me, it looks like that pytesseract doesn't wait until the tesseract execution ends.
This is one posibility.

The other one is hinted by the actual message: UZN file loaded.
Tessract is trying to load files without extension as UZN files (segmentation/zone files) ???
And I guess that changing the base filename disables loading the temp file as UZN file.
Can someone test if it makes a difference if changing the 'tess_' prefix into a suffix? @bwakkie or @klavdijS

bwakkie · 2022-03-07T11:13:45Z

I tried your suggestion but no difference:
line 192: with NamedTemporaryFile(suffix='_tess', delete=False) as f:

/tmp/tmp5ltmdm2m_tess.PNG
/tmp/tmp5ltmdm2m_tess.osd
/tmp/tmp5ltmdm2m_tess
cp: cannot stat '/tmp/tmp5ltmdm2m_tess': No such file or directory

pytesseract.pytesseract.TesseractError: (1, 'UZN file /tmp/tmp5ltmdm2m_tess loaded. Estimating resolution as 373 UZN file /tmp/tmp5ltmdm2m_tess loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')

bozhodimitrov · 2022-03-07T17:27:04Z

Ok, thank you for reporting this issue, you can test the master revision if you want.
This change will be available with the next release.

bozhodimitrov closed this as completed in a515a26 Mar 7, 2022

bozhodimitrov mentioned this issue Mar 14, 2022

image_to_osd with PIL.Image argument raises TesseractError for tesseract 5.0.1 #416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

klavdijS commented Feb 1, 2022

bozhodimitrov commented Feb 2, 2022 •

edited

Loading

klavdijS commented Feb 3, 2022

bozhodimitrov commented Feb 4, 2022

bwakkie commented Mar 6, 2022 •

edited

Loading

bozhodimitrov commented Mar 7, 2022 •

edited

Loading

bwakkie commented Mar 7, 2022

bozhodimitrov commented Mar 7, 2022

Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

Using image_to_osd with PIL Image directly does not return results due to temporary file name clashing #408

Comments

klavdijS commented Feb 1, 2022

bozhodimitrov commented Feb 2, 2022 • edited Loading

klavdijS commented Feb 3, 2022

bozhodimitrov commented Feb 4, 2022

bwakkie commented Mar 6, 2022 • edited Loading

bozhodimitrov commented Mar 7, 2022 • edited Loading

bwakkie commented Mar 7, 2022

bozhodimitrov commented Mar 7, 2022

bozhodimitrov commented Feb 2, 2022 •

edited

Loading

bwakkie commented Mar 6, 2022 •

edited

Loading

bozhodimitrov commented Mar 7, 2022 •

edited

Loading