Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve paragraph detection #502

Open
Pawwwle opened this issue Sep 8, 2023 · 4 comments
Open

Improve paragraph detection #502

Pawwwle opened this issue Sep 8, 2023 · 4 comments
Labels
enhancement New feature or request ocr
Milestone

Comments

@Pawwwle
Copy link

Pawwwle commented Sep 8, 2023

What happened?

The application does not use paragraphs (newlines). Please fix it because inserting paragraphs manually is very cumbersome.

Zrzut ekranu 2023-08-27 195351

How did you install NormCap?

MSI installer (Windows)

Operating System + Version?

Windows 10/11

[Linux only] Display Server (DS) + Desktop environment (DE)?

No response

Debug log output?*

No response

@Pawwwle Pawwwle added bug Something isn't working triage Needs confirmation and priotization labels Sep 8, 2023
@dynobo dynobo changed the title The application does not use paragraphs (newlines) Improve paragraph detection Sep 19, 2023
@dynobo dynobo added enhancement New feature or request ocr and removed bug Something isn't working triage Needs confirmation and priotization labels Sep 19, 2023
@dynobo
Copy link
Owner

dynobo commented Sep 19, 2023

Hi @Pawwwle, what you experience is NormCap's "parse" mode, which tries to detect certain common text layouts and automatically (re-)format the text accordingly.

In your example, NormCap does detect the selected text as a "Paragraph" (multiple lines of continuous texts). In such a situation, you usually don't want to preserve the line breaks, that's why they get removed.

However, in your case, this is a false detection, instead NormCap should have detected the text as "Multiline" (multiple lines of text, not continuous, like lists). Then it would have preserved the line breaks.

Workaround:
As a short term solution, when the "parse" mode fails and returns unexpected results, try switching NormCap's "Capture Mode" to "raw", which will output the text exactly as detected by Tesseract (including line breaks):



The desired solution: (to be implemented)
The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

  • Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
  • Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

@Pawwwle
Copy link
Author

Pawwwle commented Sep 19, 2023

I ran further tests. Unfortunately, neither mode reflects the original text layout.

Zrzut ekranu 2023-09-19 201415

Zrzut ekranu 2023-09-19 201516

@dynobo dynobo added this to the 0.5.0 milestone Oct 2, 2023
@dynobo
Copy link
Owner

dynobo commented Oct 8, 2023

Thanks for trying, @Pawwwle. This in deed seems odd, I'll take a look at it and hopefully get it improved a bit for the 0.5.0 final version 🙂

@dynobo
Copy link
Owner

dynobo commented Nov 5, 2023

Is was able to identify an issue in "raw"-mode which caused some missing line-breaks. Also, I was able to improve the paragraph parsing a bit:

With 8ad2f6a, when detecting the image ...
test image

... then "raw"-mode comes quite close to original layout:

The desired solution: (to be implemented)
The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

« Lines of similar length should indicate "Paragraphs”, different length indicate "Multilines"

« Relatively small gaps between lines should indicate "Paragraphs”, larger gaps between lines indicate "Multilines"

... while "parse"-mode does swallow the first line-break:

The desired solution: (to be implemented) The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:
« Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
« Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

This is not ideal here, but in most cases you don't want to preserve such intra-paragraph line-breaks, as those should be added by the application you are pasting the text into, depending on the supported line-width. So I guess this is fine.

The results for the example picture in the initial bug report are also a bit better, but Tesseract detects a intra-paragraph line-break between 1. and 2. bullet point, while a paragraph break is detected between 2. and 3. bullet point:

"parse"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie
E kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy G sprzedajesz z darmową dostawą do Paczkomatów
0 zwiększasz atrakcyjność swoich ogłoszeń, dzięki oznaczeniu Smart!

"raw"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie

Eb kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy
G sprzedajesz z darmową dostawą do Paczkomatów

9 zwiększasz atrakcyjność swoich ogłoszeń, dzieki oznaczeniu Smart!

I'm afraid, this is as good as I can get it for now, as this is a limitation of Tesseract... 🙁

@dynobo dynobo modified the milestones: 0.5.0b2, 0.5.0 Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ocr
Projects
None yet
Development

No branches or pull requests

2 participants