Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think you need to improve character recognition by using and implementing ChatGPT in OCR #525

Closed
me-suzy opened this issue Nov 22, 2023 · 3 comments

Comments

@me-suzy
Copy link

me-suzy commented Nov 22, 2023

This is a screenshot of a page from the Internet Archive. I tested Mathpix, and it successfully recognizes characters to about 55%. It's ok but I think you should incorporate AI, such as ChatGPT, into the character recognition process to correct words whose characters haven't been recognized correctly. Because ChatGPT or BARD or other AIs know how to recognize language and words and are capable of recognizing and correcting misspelled words. In essence, if a word is missing a few letters or if certain letters are not easily distinguishable, ChatGPT can reconstruct the word and, at the same time, correctly add diacritics.

image

@stefan6419846
Copy link
Contributor

pytesseract is a simple wrapper around Tesseract (which itself already uses a trainable neural net and thus ML as you propose), thus you should probably redirect your enhancement proposal there. Personally, I do not think that LLMs will be incorporated directly as you propose, but instead should be part of your own implementation relying on Tesseract.

With some custom training, you will already be able to improve the overall detection quality for specific use cases, including extended handling of ligatures etc. Even without training, Tesseract and pytesseract already provide you "low level" access to the OCR results, for example TSV data including confidences or hOCR output, which you can use to feed it into any other post-processing step you like.

@me-suzy
Copy link
Author

me-suzy commented Nov 24, 2023

also, Tesseract cannot read (OCR) very good this kind of documents, especially if the writing is slanted.

https://archive.org/details/florenski-pavel-iconostasul-scan_202311/page/n33/mode/2up

@stefan6419846
Copy link
Contributor

Sorry, but this still is out of scope for pytesseract. Please discuss such issues on the Tesseract mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants