Improve paragraph detection #502

Pawwwle · 2023-09-08T07:41:22Z

What happened?

The application does not use paragraphs (newlines). Please fix it because inserting paragraphs manually is very cumbersome.

How did you install NormCap?

MSI installer (Windows)

Operating System + Version?

Windows 10/11

[Linux only] Display Server (DS) + Desktop environment (DE)?

No response

Debug log output?*

No response

dynobo · 2023-09-19T16:21:18Z

Hi @Pawwwle, what you experience is NormCap's "parse" mode, which tries to detect certain common text layouts and automatically (re-)format the text accordingly.

In your example, NormCap does detect the selected text as a "Paragraph" (multiple lines of continuous texts). In such a situation, you usually don't want to preserve the line breaks, that's why they get removed.

However, in your case, this is a false detection, instead NormCap should have detected the text as "Multiline" (multiple lines of text, not continuous, like lists). Then it would have preserved the line breaks.

Workaround:
As a short term solution, when the "parse" mode fails and returns unexpected results, try switching NormCap's "Capture Mode" to "raw", which will output the text exactly as detected by Tesseract (including line breaks):

The desired solution: (to be implemented)
The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

Pawwwle · 2023-09-19T18:41:29Z

I ran further tests. Unfortunately, neither mode reflects the original text layout.

dynobo · 2023-10-08T21:34:39Z

Thanks for trying, @Pawwwle. This in deed seems odd, I'll take a look at it and hopefully get it improved a bit for the 0.5.0 final version 🙂

dynobo · 2023-11-05T02:37:45Z

Is was able to identify an issue in "raw"-mode which caused some missing line-breaks. Also, I was able to improve the paragraph parsing a bit:

With 8ad2f6a, when detecting the image ...

... then "raw"-mode comes quite close to original layout:

The desired solution: (to be implemented)
The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:

« Lines of similar length should indicate "Paragraphs”, different length indicate "Multilines"

« Relatively small gaps between lines should indicate "Paragraphs”, larger gaps between lines indicate "Multilines"

... while "parse"-mode does swallow the first line-break:

The desired solution: (to be implemented) The "Paragraph" heuristic should be improved by taking the dimensions of the detection boxes into account:
« Lines of similar length should indicate "Paragraphs", different length indicate "Multilines"
« Relatively small gaps between lines should indicate "Paragraphs", larger gaps between lines indicate "Multilines"

This is not ideal here, but in most cases you don't want to preserve such intra-paragraph line-breaks, as those should be added by the application you are pasting the text into, depending on the supported line-width. So I guess this is fine.

The results for the example picture in the initial bug report are also a bit better, but Tesseract detects a intra-paragraph line-break between 1. and 2. bullet point, while a paragraph break is detected between 2. and 3. bullet point:

"parse"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie
E kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy G sprzedajesz z darmową dostawą do Paczkomatów
0 zwiększasz atrakcyjność swoich ogłoszeń, dzięki oznaczeniu Smart!

"raw"-mode:

Bądź Smart! i kupuj oraz sprzedawaj z darmową dostawą na Allegro Lokalnie

Eb kupujesz bez kosztów przesytki przy zakupie za min. 45 zł od jednego sprzedawcy
G sprzedajesz z darmową dostawą do Paczkomatów

9 zwiększasz atrakcyjność swoich ogłoszeń, dzieki oznaczeniu Smart!

I'm afraid, this is as good as I can get it for now, as this is a limitation of Tesseract... 🙁

Pawwwle added bug Something isn't working triage Needs confirmation and priotization labels Sep 8, 2023

dynobo changed the title ~~The application does not use paragraphs (newlines)~~ Improve paragraph detection Sep 19, 2023

dynobo added enhancement New feature or request ocr and removed bug Something isn't working triage Needs confirmation and priotization labels Sep 19, 2023

dynobo added this to the 0.5.0 milestone Oct 2, 2023

dynobo mentioned this issue Oct 8, 2023

Please help by testing 0.5.0-beta1 #536

Closed

dynobo modified the milestones: 0.5.0, 0.5.0b2 Oct 25, 2023

dynobo modified the milestones: 0.5.0b2, 0.5.0 Nov 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve paragraph detection #502

Improve paragraph detection #502

Pawwwle commented Sep 8, 2023

dynobo commented Sep 19, 2023

Pawwwle commented Sep 19, 2023

dynobo commented Oct 8, 2023 •

edited

Loading

dynobo commented Nov 5, 2023

Improve paragraph detection #502

Improve paragraph detection #502

Comments

Pawwwle commented Sep 8, 2023

What happened?

How did you install NormCap?

Operating System + Version?

[Linux only] Display Server (DS) + Desktop environment (DE)?

Debug log output?*

dynobo commented Sep 19, 2023

Pawwwle commented Sep 19, 2023

dynobo commented Oct 8, 2023 • edited Loading

dynobo commented Nov 5, 2023

dynobo commented Oct 8, 2023 •

edited

Loading