Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image_to_boxes crashing #106

Closed
tlcyr4 opened this issue Mar 10, 2018 · 25 comments · Fixed by #504
Closed

image_to_boxes crashing #106

tlcyr4 opened this issue Mar 10, 2018 · 25 comments · Fixed by #504

Comments

@tlcyr4
Copy link

tlcyr4 commented Mar 10, 2018

When I run image_to_string, it works great, but when I run either image_to_boxes or image_to_data, I get an error message like this:

IOError: [Errno 2] No such file or directory: 'c:\users\tlcyr\appdata\local\temp\tess_kqx1fs_out.box'

with some random text in place of 'kqx1fs' each time I run it.

I have tesseract 3.05.01 installed on Windows.

@bozhodimitrov
Copy link
Collaborator

Hi @tlcyr4 , can you try the new version 4.x of Tesseract for Windows?

@bozhodimitrov
Copy link
Collaborator

Please feel free to reopen if you have problems with the new 4.x version.
It will be a good idea if you can provide a sample image for testing the problem.

@trehman65
Copy link

trehman65 commented Apr 1, 2018

image_to_boxes is not working for me either. I have tesseract 4.0 on macOS.

from PIL import Image
from pytesseract import pytesseract
import argparse
import cv2
import os

construct the argument parse and parse the arguments

ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
ap.add_argument("-p", "--preprocess", type=str, default="thresh",
help="type of preprocessing to be done")
args = vars(ap.parse_args())

load the example image and convert it to grayscale

image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

check to see if we should apply thresholding to preprocess the image

if args["preprocess"] == "thresh":
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]

make a check to see if median blurring should be done to remove noise

elif args["preprocess"] == "blur":
gray = cv2.medianBlur(gray, 3)

write the grayscale image to disk as a temporary file so we can apply OCR to it

filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, gray)

load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file

text = pytesseract.image_to_boxes(Image.open(filename))

os.remove(filename)
print(text)`

Error I am getting is:

IOError: [Errno 2] No such file or directory: '/var/folders/gh/ytdtnjmx6t7dwc325f3xsky80000gn/T/tess_3zJn_y_out.box'

Tesseract version:

talha (tess *) VisionxNLTK-v2.0 $ tesseract -v
tesseract 4.00.00alpha
leptonica-1.74.4
libjpeg 9b : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11
Found AVX2
Found AVX
Found SSE

The image I am using is:
temp

@bozhodimitrov
Copy link
Collaborator

Hi @trehman65 - did you tested the same options directly with tesseract itself?

@trehman65
Copy link

trehman65 commented Apr 1, 2018

You mean on command line? I am sorry I am bit of a noob. Can you tell me the command for it?

@bozhodimitrov
Copy link
Collaborator

You can patch the pytesseract.py library temporarily on line 133 and you can print the command with:

print(' '.join(command))

In order to find the full pytesseract.py library file path, you need the following snippet of code:

import pytesseract
print(pytesseract.__file__)

@trehman65
Copy link

trehman65 commented Apr 1, 2018

The command that printed by patching pytesseract.py is:

tesseract /var/folders/gh/ytdtnjmx6t7dsky80000gn/T/tess_UhXf0J.PNG /var/folders/gh/ytdtnjmx6t7dwc325f3xsky80000gn/T/tess_UhXf0J_out batch.nochop makebox

This command is not working.

tesseract temp.jpg out makebox

The error is:

read_params_file: Can't open makebox
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

The following command works but it only shows the text, not the boxes.

tesseract temp.jpg out

@bozhodimitrov
Copy link
Collaborator

And what about the tesseract temp.jpg out batch.nochop makebox - what is the error of that?

@trehman65
Copy link

This is the error:

talha (tess *) VisionxNLTK-v2.0 $ tesseract temp.jpg out batch.nochop makebox
read_params_file: Can't open batch.nochop
read_params_file: Can't open makebox
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica

@bozhodimitrov
Copy link
Collaborator

Thank you for the feedback. Can you report your OS version and how you installed tesseract.
Maybe this 4.00.00alpha build of tesseract is a bit problematic.

@trehman65
Copy link

My OS version is macOS High Sierra version 10.13.2. I built tesseract from source code, by cloning the git repo.

@qnkhuat
Copy link

qnkhuat commented Aug 3, 2018

same issue.

@qnkhuat
Copy link

qnkhuat commented Aug 3, 2018

I'm able to run with tesseract itself but still get this error while running pytesseract

@chahna107
Copy link

Is there any further update on this issue? I am having the same problem with Tesseract 4.0.

@jxu
Copy link

jxu commented Jun 3, 2019

I have Tesseract 4.0.0.20190314 installed but I replaced the eng.traineddata with the one from here https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata to support Tesseract v3 and I also have a barebones tessdata folder with no other files besides eng.traineddata.
With the default tessdata folder everything works fine.

@HongChow
Copy link

When I run image_to_string, it works great, but when I run either image_to_boxes or image_to_data, I get an error message like this:

IOError: [Errno 2] No such file or directory: 'c:\users\tlcyr\appdata\local\temp\tess_kqx1fs_out.box'

with some random text in place of 'kqx1fs' each time I run it.

I have tesseract 3.05.01 installed on Windows.

I have the same problem with Ubuntu18 and Tesseract4.0 .
Have anyone fixed this ?

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Sep 17, 2019

I can't reproduce this issue. I am using the sample image from this issue and it works as expected within the official python docker container.

Tested with:
Python 3.7.4
pytesseract 0.3.0
tesseract 4.0.0 (libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0)

I tested with both image_to_boxes and image_to_data

@HongChow try to execute this command in your terminal in order to check if it works:

tesseract /test.jpg /tmp/test_output_file batch.nochop makebox

PS: It also works ok with:
Python 3.6.8 ( Ubuntu 18.04.3 LTS )
pytesseract 0.3.0
tesseract 4.0.0-beta.1 (leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0)

@amtam0
Copy link

amtam0 commented Oct 18, 2019

I had the same issue yesterday. I think it is more a Tesseract config issue.

You maybe need to setup configs and tessconfigs folders under .../tesseract/share/data/

image_to_boxes() use batch.nochop and makebox configs. Check the link for download

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Oct 18, 2019

@HongChow take a look at the above ^
@hazimora33d thank you for clarifying that - I can add additional documentation about this in the README. The other option is to extract every specific option out of the tessconfigs and hard code it into pytesseract.

@JoelStansbury
Copy link

JoelStansbury commented Mar 29, 2020

This fixed it for me. pytesseract.image_to_boxes(myImg, config = " -c tessedit_create_boxfile=1")

For whatever reason, my installation of tesseract 4.1.1 from conda-forge needs this argument to be set explicitly in order for the tesseract ... call to generate a .box file. Injecting this into the subprocess call feels real hacky though so it's very possible that a future update would break this work-around

EDIT:

Note the <SPACE> in front of -c and tessedit.... Those are very important

I found this setting by looking through the output of tesseract --print-parameters

@bozhodimitrov
Copy link
Collaborator

@JoelStansbury thank for reporting the workaround.
I think that the conda-forge packages have GitHub repositories (just like pytesseract has a conda-forge repo), so we can file an issue there.
But I am not sure for the name of the conda-forge tesseract package.

@JoelStansbury
Copy link

@int3l No problem! Thanks for working on pytesseract!
Here is the tesseract page if you're curious https://anaconda.org/conda-forge/tesseract. I don't know enough about the cause to justify starting a new issue, just wanted to share for future victims. If I find out enough to point out a flaw I will definitely let them know

@eveningkid
Copy link

eveningkid commented Oct 25, 2020

Same thing happened to me, running macOS 10.15.6 and tesseract 4.1.1.

@JoelStansbury workaround worked for me, thank you. Very odd!

@deduble
Copy link

deduble commented Jun 11, 2021

@JoelStansbury I am making this issue come up alive once again since it is still there for Python 3.7.3 latest pytesseract and tesseract 5.0.0. It wasn't privileges in my case. But your workaround fixes the problem for me as well. Were you able to find what is causing this issue?

@JoelStansbury
Copy link

JoelStansbury commented Jun 11, 2021

@deduble No not really. This config option looks suspicious to me. Maybe it should be "tessedit_create_boxfile 1" as "tessedit_create_wordstrbox" doesn't seem to be a valid config option
https://github.com/tesseract-ocr/tesseract/blob/7a308edcb1fc7455008b531bc2a49de583d7b171/tessdata/configs/wordstrbox

pure speculation though. I havent tested this at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.