Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError tmp file #454

Closed
filirnd opened this issue Sep 30, 2022 · 10 comments · Fixed by #503
Closed

FileNotFoundError tmp file #454

filirnd opened this issue Sep 30, 2022 · 10 comments · Fixed by #503

Comments

@filirnd
Copy link

filirnd commented Sep 30, 2022

Hi,
I'm working on a Fedora 32 distro with tesseract 5.0.0-alpha-20201224 and pytesseract Version: 0.3.10.

when I call this function
image_to_string(image, lang="letsgodigital", config="--oem 4 --psm 100 -c tessedit_char_whitelist=.0123456789")
I received this error:

Traceback (most recent call last):
  File "main2.py", line 8, in <module>
    str_Res = digital_display_ocr.ocr_image(image)
  File "/home/fili/workspace/python/digit_recognition/digital_display_ocr.py", line 107, in ocr_image
    return image_to_string(otsu_thresh_image, lang="letsgodigital", config="--oem 4 --psm 100 -c tessedit_char_whitelist=.0123456789")
  File "/home/fili/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 423, in image_to_string
    return {
  File "/home/fili/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 426, in <lambda>
    Output.STRING: lambda: run_and_get_output(*args),
  File "/home/fili/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 290, in run_and_get_output
    with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] File o directory non esistente: '/tmp/tess_4p1bawg8.txt'

How can I resolve this?

@stefan6419846
Copy link
Contributor

Could you please try to quote the tessedit_char_whitelist value, id est config="--oem 4 --psm 100 -c tessedit_char_whitelist='.0123456789'"?

@filirnd
Copy link
Author

filirnd commented Sep 30, 2022

Could you please try to quote the tessedit_char_whitelist value, id est config="--oem 4 --psm 100 -c tessedit_char_whitelist='.0123456789'"?

Same error.
It seems like pytesseract can't write the output file.

@stefan6419846
Copy link
Contributor

stefan6419846 commented Sep 30, 2022

pytesseract will never actually write to the output file itself, but the underlying Tesseract binary. Do you get any output if you call Tesseract manually with your options? What is the final cmd_args in your case in pytesseract.pytesseract.run_tesseract?

Additionally: It seems like you are using a pre-release version of Tesseract. Are you able to test this with a regular release, as Tesseract 5.2.0 has been released about half a year ago?

@jeb2112
Copy link

jeb2112 commented Dec 23, 2022

I got the same error on the image_to_pdf_or_hocr method and saw at line 290 of pytesseract.py (and as is seen above in quoted error message from filirnd), that the output tmp file is being opened as 'rb', when it seems like it should be 'wb'.

@stefan6419846
Copy link
Contributor

I got the same error on the image_to_pdf_or_hocr method [...]

As mentioned in my previous comment: What command is sent to Tesseract? Which Tesseract version are you using? What happens if you call Tesseract on this file directly? Do you have a reproducer? Nevertheless, this probably is a Tesseract issue, not a pytesseract one.

[...] and saw at line 290 of pytesseract.py (and as is seen above in quoted error message from filirnd), that the output tmp file is being opened as 'rb', when it seems like it should be 'wb'.

rb is completely fine here, as the corresponding file should be created by Tesseract. If it is not being created, this seems to be a Tesseract issue which might be related to your input data (see the first part of my comment).

@jeb2112
Copy link

jeb2112 commented Dec 27, 2022

rb is completely fine here, as the corresponding file should be created by Tesseract. If it is not being created, this seems to be a Tesseract issue which might be related to your input data (see the first part of my comment).

OK, this info helped me dig into it a bit farther, as I certainly hadn't understood the underlying call to tesseract binary, but yes of course I now see that is 'rb' for an input file. So eventually I found instead that I had a missing argument to the pytesseract.image_to_pdf_or_hocr. Whereas I had copied this from the example on the pytesseract doc page at https://pypi.org/project/pytesseract/:
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')
what was missing was the additional config argument:
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr', config='-c tessedit_create_hocr=1)
without this additional argument, the tesseract binary is creating a .txt extension by some kind of default, even though the extension arg is being correctly assigned as 'hocr'.

@stefan6419846
Copy link
Contributor

I cannot reproduce this issue on Linux running Tesseract 4.1.0 (shortened output):

stefan@localhost:~/tmp$ python script.py 
b'<?xml version="1.0" encoding="UTF-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n    "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n <head>\n  <title></title>\n  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>\n  <meta name=\'ocr-system\' content=\'tesseract 4.1.0\' />\n  [...]
stefan@localhost:~/tmp$ vi script.py
stefan@localhost:~/tmp$ cat script.py 
import pytesseract

hocr = pytesseract.image_to_pdf_or_hocr('file.png', extension='pdf')

print(hocr)

stefan@localhost:~/tmp$ python script.py 
b'%PDF-1.5\n%\xde\xad\xbe\xeb\n1 0 obj\n<<\n  /Type /Catalog\n  /Pages 2 0 [...]

Could you please provide some more details about your setup and versions?

@DrPlanecraft
Copy link

DrPlanecraft commented Jun 20, 2023

Hi! I got the same issue, my setup versions are as follows:
Python :

3.11.4

Tesseract:

tesseract 5.3.0
 leptonica-1.78.0 (Apr  9 2021, 08:55:04) [MSC v.1916 LIB Release x64]
  libjpeg 9d : libpng 1.6.39 : libtiff 4.4.0 : zlib 1.2.13
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.2.6 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2

Command Being sent:

['tesseract', 'C:\\Users\\DRPLAN~1\\AppData\\Local\\Temp\\tess_45thtfem_input.PNG', 'C:\\Users\\DRPLAN~1\\AppData\\Local\\Temp\\tess_45thtfem', 'batch.nochop', 'makebox']

In my case I can never see the tess_<temp code>.box file being generated

this is my error message: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\DRPLAN~1\\AppData\\Local\\Temp\\tess_r3wvoqfl.box'

@stefan6419846
Copy link
Contributor

Have you tried running the same command (with the temporary files replaced by "real" files) with Tesseract directly? Does this generate the correct files?

@DrPlanecraft
Copy link

DrPlanecraft commented Jun 22, 2023

I have tried, and it does generate the correct files, but the issue was eventually linked to me not including a config = " -c tessedit_create_boxfile=1",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants