Poetry from dirty OCR

eliaspro · on March 23, 2023

This reminds me of the experiment to run paint splatters through OCR and check, whether the result is valid Perl code (spoiler: 93% evaluated just fine).

https://www.mcmillen.dev/sigbovik/

vintermann · on March 23, 2023

OCR is hard, but maybe we can make some real progress on it now with modern AI. A context-smart church records handwriting transcriber would be pretty great.

msla · on March 23, 2023

The problem with that is the same as the problem with all modern AI, it seems: AI hallucinations, which get more plausible the better the model is. Humans see things, too, but humans can have the insight to know when they're unsure and leave lacunae in the transcription with a note to come back later and discuss it with others. AIs don't have that ability, and so always seem certain.

vintermann · on March 24, 2023

No, they already have the ability to be uncertain if you are careful what you ask for. And I don't buy that it's some fundamental weakness: I'm sure we'll be much better at dealing with uncertainty a month and a year from now.

But of course, they will always sometimes be wrong, just like humans.

flangola7 · on March 24, 2023

It's not cheap but GPT-4 handles it. I'm in the image processing private beta and its failure rate on text is well below the human baseline. Cursive, damaged, pixelated, 5 year old hand writing, weird lighting or angles, highly stylized, artificially distorted or obstructed, poor contrast. Doesn't matter.

My instincts tell me they haven't made it public yet because it will end captchas for good and they're uneasy about rug pulling the entire public internet. Any image obfuscated to a point of defeating the LLM will also defeat the majority of humans.

vintermann · on March 24, 2023

Can I suggest some truly horrible church record handwriting for you to try?

Here's a fairly moderate example:

https://media.digitalarkivet.no/view/7442/155

conceptme · on March 23, 2023

https://readcoop.eu transckribus is quite nice for handwritten ocr.

driscoll42 · on March 23, 2023

Know of any good handwritten OCR libraries that are FOSS? And ideally can be called through python? TrOCR is the best I've seen so far though it's not amazing.

version_five · on March 23, 2023

  I've poured over ((ok, grepped) ~500GB of Chroincling America data to find lines that meet my low standard for nonsene, basically ones that match egrep "[^a-zA-Z0-9 ]{3,}"

I'm super curious to know fast this was. grep is generally very fast and this should be doable on a normal computer, though it might take a little while

totetsu · on March 24, 2023

There are faster tools than grep for dealing with large files. ag the silver searcher workes okay.

version_five · on March 24, 2023

  silver searcher

Thanks for the reminder, I knew something like this existed but I couldn't remember what it was called!

tgv · on March 23, 2023

Would the typos (Chroincling, orthe, etc.) be on purpose?

riceart · on March 23, 2023

That regex does not match typos.

chaps · on March 23, 2023

Spent a load of time doing OCR and dealing with its failures... this is absolutely wonderful, thanks for sharing!

BubbleRings · on March 23, 2023

Yes, sir, we got a parrot.