Hacker News new | past | comments | ask | show | jobs | submit login
Poetry from dirty OCR (github.com/bibliotechy)
63 points by MaysonL on March 23, 2023 | hide | past | favorite | 15 comments



This reminds me of the experiment to run paint splatters through OCR and check, whether the result is valid Perl code (spoiler: 93% evaluated just fine).

https://www.mcmillen.dev/sigbovik/


OCR is hard, but maybe we can make some real progress on it now with modern AI. A context-smart church records handwriting transcriber would be pretty great.


The problem with that is the same as the problem with all modern AI, it seems: AI hallucinations, which get more plausible the better the model is. Humans see things, too, but humans can have the insight to know when they're unsure and leave lacunae in the transcription with a note to come back later and discuss it with others. AIs don't have that ability, and so always seem certain.


No, they already have the ability to be uncertain if you are careful what you ask for. And I don't buy that it's some fundamental weakness: I'm sure we'll be much better at dealing with uncertainty a month and a year from now.

But of course, they will always sometimes be wrong, just like humans.


It's not cheap but GPT-4 handles it. I'm in the image processing private beta and its failure rate on text is well below the human baseline. Cursive, damaged, pixelated, 5 year old hand writing, weird lighting or angles, highly stylized, artificially distorted or obstructed, poor contrast. Doesn't matter.

My instincts tell me they haven't made it public yet because it will end captchas for good and they're uneasy about rug pulling the entire public internet. Any image obfuscated to a point of defeating the LLM will also defeat the majority of humans.


Can I suggest some truly horrible church record handwriting for you to try?

Here's a fairly moderate example:

https://media.digitalarkivet.no/view/7442/155


https://readcoop.eu transckribus is quite nice for handwritten ocr.


Know of any good handwritten OCR libraries that are FOSS? And ideally can be called through python? TrOCR is the best I've seen so far though it's not amazing.


  I've poured over ((ok, grepped) ~500GB of Chroincling America data to find lines that meet my low standard for nonsene, basically ones that match egrep "[^a-zA-Z0-9 ]{3,}"
I'm super curious to know fast this was. grep is generally very fast and this should be doable on a normal computer, though it might take a little while


There are faster tools than grep for dealing with large files. ag the silver searcher workes okay.


  silver searcher
Thanks for the reminder, I knew something like this existed but I couldn't remember what it was called!


Would the typos (Chroincling, orthe, etc.) be on purpose?


That regex does not match typos.


Spent a load of time doing OCR and dealing with its failures... this is absolutely wonderful, thanks for sharing!


Yes, sir, we got a parrot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: