Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytesseract.image_to_string not preserving interword spaces #482

Closed
willdave865 opened this issue Apr 6, 2023 · 12 comments
Closed

pytesseract.image_to_string not preserving interword spaces #482

willdave865 opened this issue Apr 6, 2023 · 12 comments

Comments

@willdave865
Copy link

Using Tesseract

PS C:\Program Files\Tesseract-OCR> .\tesseract --version
tesseract v5.3.0.20221222
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

I have tested Tesseract successfully on command line:

PS C:\Program Files\Tesseract-OCR> .\tesseract C:\ocr\target\31832_226140__0001-00002b.jpg C:\ocr\results\31832_226140__0001-00002bb6523dpi300fullest --dpi 300 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist='abcdefghijklm
nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '

Partial output

269 Wellington Road Wainumomats Marned       101 ARNOLD. Frank Witham ...............................15 Rossiter Avenue.Lower Hutt. Butcher
002 ANKER. Doreen Akson .............................4 Bledisioe Crescent. Wamuiomata. Teacher       102 ARONA. Amosa ...............0000...........3 Donnelley Drve.Wasnuiomata.Pub. Servant
004 ANKER. Robert James ..........................269 Wellington Road.WainuiomataBank Off       104 ARPS. Velde Lucia ................ ..........53 Westminster Road Wamnuomata Resch Intvr
005 ANNESLEV. Boyne Evan .............................. 13 Manurewa GroveWainwomata Clerk       105 ARPS. Wilkem David ..........................53 Westmnster Road. Waimuomata.Foreman
006 ANNESLEY. Janet Maree ....................13 Manurewa Grove Wainuomats Housewite       106 ARROWSMITH. Margaret Bessie .... ... . 4 Isabel Grove. Wainuiomata. Mamed
007 ANSELL. Anme Ena Elizabeth .........................3 Lewghton Av. Lower Hutt. Homemaker       107 ARROWSMITH. Morns Anthony ................ 

I need to process hundreds of files so I downloaded and installed pytesseract.
Successfully installed pytesseract-0.3.10
I have run tox

PS C:\Program Files\Tesseract-OCR> tox
←[1m←[35mROOT:←[0m←[36m No tox.ini or setup.cfg or pyproject.toml found, assuming empty tox.ini at C:\Program Files\Tesseract-OCR←[0m
  py: OK (4.34 seconds)
  congratulations :) (4.67 seconds)

However when I run the following, same path-to-exe, python script interword spacing is not preserved.

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
image = 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '))

Partial output

.269WellngtonRoadWainumomatsMarned       101ARNOLD.FrankWitham...............................15RossiterAvenue.LowerHutt.Butcher       
002ANKER.DoreenAkson.............................4BledisioeCrescent.Wamuiomata.Teacher       102ARONA.Amosa...............0000...........3DonnelleyDrve.Wasnuiomata.Pub.Servant      
004ANKER.RobertJames..........................269WellingtonRoad.WainuiomataBankOff       104ARPS.ValdaLucis..........................53WestminsterRoadWamnuomataReschIntvr
005ANNESLEV.BoyneEvan..............................13ManurewaGroveWainwomataClerk       105ARPS.WilkemDavid..........................53WestmnsterRoad.Waimuomata.Foreman
006ANNESLEY.JanotMaree....................13ManurewaGroveWainuomatsHousewite       106ARROWSMITH.MargaretBessie........4IsabelGrove.Wainuiomata.Mamed
007ANSELL.AnmeEnaElizabeth.........................3LewghtonAv.LowerHutt.Homemaker       107ARROWSMITH.MornsAnthony.................4IsabelGrWamuomata.FetryMagr
O008ANMGELL.Ebsebyyceeseceereeess76BellRoad.LowerHutt.Housewrfe       108ARTHUR.BruceJames....................65MoohanStreet.WainuomataApp.Mouider 

This issue github.com/tesseract-ocr/tesseract/issues/781 was fixed in 2017. It may have been broken again.

@stefan6419846
Copy link
Contributor

You are simply disallowing any whitespace character on your output with your whitelist.

In your direct call to Tesseract, you have -c tessedit_char_whitelist='abcdefghijklm nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. ', while you basically have -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. in your pytesseract call (for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract). Using

print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))

(or any variation which properly quotes the allowed characters) will generate the correct Tesseract call and correct output accordingly.

@stefan6419846
Copy link
Contributor

I have run tox

Just out of curiosity: Why? Where did you read that this would be required?

@willdave865
Copy link
Author

willdave865 commented Apr 7, 2023

Thank you. You are correct. For me the issue was the use of single/double quotes. I used double because I want to include a single quote in my whitelist. For names like O'Leary. So I guess I need to escape my single quote to include it.

I ran tox because somewhere in the install I was advised to. I thought it checked the system configuration.

I don't know the syntax for when you say "for debugging, try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract" I haven't got that far in the manual yet.

You can close this issue or let me know if I should.

@willdave865
Copy link
Author

willdave865 commented Apr 8, 2023

Sorry. You may have found some of my previous comments confusing.

Now that I have paid closer attention to what you have written. Please consider the following:

  • The issue of interword spaces not being preserved is non-existent based on your example producing the required spaces
  • My misunderstanding of the syntax is in regard to the use of single and double quotation marks
  • My reference to using double quotes around the literal single quote for the name O'Leary has come later and should not be considered a part of this discussion

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

@stefan6419846
Copy link
Contributor

Thanks for the explanations and glad to see that this could be solved.

Regarding tox. Checking back in the docs I am now reminded it is a test suite. I had forgotten this. So please disregard.

No problem.

I'm sorry, even after reading the docs, I don't know what you mean when you write "try printing the cmd_args parameter in pytesseract.pytesseract.run_tesseract"

This basically should show what commands are being passed to Tesseract and requires modifying the distribution file to include the corresponding print statement. In your case, this enabled me to see that parsing the parameters would indeed drop the desired whitespace characters due to the invalid/missing quoting/escaping. A possible debugging solution can be seen in #483.

You can close this issue or let me know if I should.

As I am just a Tesseract/pytesseract user myself who just looks at the issues here and tries to debug usage errors, I cannot close this issue myself. If it is resolved for you, feel free to close it yourself to keep the issue list clean.

@willdave865
Copy link
Author

User error was the problem.

@willdave865
Copy link
Author

willdave865 commented Apr 9, 2023 via email

@stefan6419846
Copy link
Contributor

I tried your code & included my own path\to\tesseract.exe (which seemed to be missing from your #483 example)

This is correct, as I am on Linux where the system-wide installation of the binary is callable out-of-the-box.

Received no warning level debug message - just the same output with missing spaces

You should receive a debug-level message, not a warning-level one. The first logging configuration in my example code just sets the default logging level (of the root logger) to WARNING to only receive debug-level messages from pytesseract.

If the log messages are still not visible after applying the patch from #483, your execution environment might filter the output further, for example when used inside the interactive Python shell. Plain python script.py should usually show these messages in this case.

It seems to me that you are talking about recompiling the tesseract.exe with an extra print statement for debugging purposes? A little clarification would be appreciated.

You have to differentiate between the Tesseract CLI (tesseract.exe) and the pytesseract Python package files here. I am only talking about pytesseract - no need to recompile anything here.

With #483 being merged now, you have (at least) two options:

  1. Install the package from GitHub using the latest source (see https://github.com/madmaze/pytesseract#installation)
  2. Edit the installed Python package files manually. This is only recommended if you have at least some further knowledge on this topic. Determine the location of the installed package files (Location field of pip show pytesseract) and edit the pytesseract/pytesseract.py file in there.

@willdave865
Copy link
Author

willdave865 commented Apr 9, 2023 via email

@stefan6419846
Copy link
Contributor

I don't see what part of this debug logging shows me which CLI arguments pytesseract passes to Tesseract itself. What I see is extra escaping of path components, a reference to tess_1zkcz_p9 and finally a reference to 'txt'?

The log entry shows the parameters for the actual subprocess call. It consists of the following parts:

  • C:\\\\Program Files\\\\Tesseract-OCR\\\\tesseract is the actual Tesseract binary, which you specify by tesseract_cmd. Due to your value being a raw string (r prefix), the regular string will escape each backslash as \ will point to an (possibly invalid) escape sequence otherwise.
  • C:\\ocr\\target\\31832_226140__0001-00002b.jpg is your input file you have specified. As you already escaped your backslashes, no transformations have been done.
  • C:\\Users\\david\\AppData\\Local\\Temp\\tess_1zkcz_p9 is a temporary file which acts as the reference name/basename for any output files Tesseract will generate.
  • The DPI, PSM and configuration parameters (-c) are the parsed version of the config parameter you are passing.
  • txt tells Tesseract which configuration file to use - in this case to generate a plain text .txt file (due to pytesseract.image_to_string). See the CONFIGFILE section of https://manpages.ubuntu.com/manpages/jammy/man1/tesseract.1.html for example to see common values.

This list is passed to subprocess.Popen (could be subprocess.run as well) and evaluated there. The final internal system call basically joins the components together using ' '.join(), while correctly adding quotes around components where necessary.

@willdave865
Copy link
Author

willdave865 commented Apr 9, 2023 via email

@stefan6419846
Copy link
Contributor

Yes, for the tesseract_cmd, your double backslashes are not required due to the raw string. The translation essentially is:

>>> r'C:\Program Files\Tesseract-OCR\tesseract' == 'C:\\Program Files\\Tesseract-OCR\\tesseract'
True

The image path is no raw string in this case, thus the \t and \31 escape sequences have to be escaped properly - with additional backslashes as you do. You could write this as raw strings as well, avoiding the escapes:

>>> 'C:\ocr\target\31832_226140__0001-00002b.jpg'
'C:\\ocr\target\x19832_226140__0001-00002b.jpg'
>>> print('C:\ocr\target\31832_226140__0001-00002b.jpg')
C:\ocr	arget832_226140__0001-00002b.jpg
>>> r'C:\ocr\target\31832_226140__0001-00002b.jpg' == 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants