OCR-free Document Understanding Transformer

Kim, Geewook; Hong, Teakgyu; Yim, Moonbin; Nam, Jeongyeon; Park, Jinyoung; Yim, Jinyeong; Hwang, Wonseok; Yun, Sangdoo; Han, Dongyoon; Park, Seunghyun

Computer Science > Machine Learning

arXiv:2111.15664v2 (cs)

[Submitted on 30 Nov 2021 (v1), revised 21 Jul 2022 (this version, v2), latest version 6 Oct 2022 (v5)]

Title:OCR-free Document Understanding Transformer

Authors:Geewook Kim, Teakgyu Hong, Moonbin Yim, Jeongyeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park

View PDF

Abstract:Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. The code, trained model and synthetic data are available at this https URL.

Comments:	ECCV 2022
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2111.15664 [cs.LG]
	(or arXiv:2111.15664v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2111.15664

Submission history

From: Geewook Kim [view email]
[v1] Tue, 30 Nov 2021 18:55:19 UTC (4,992 KB)
[v2] Thu, 21 Jul 2022 16:10:17 UTC (5,924 KB)
[v3] Tue, 23 Aug 2022 10:30:19 UTC (5,924 KB)
[v4] Tue, 4 Oct 2022 13:34:02 UTC (5,928 KB)
[v5] Thu, 6 Oct 2022 06:50:39 UTC (5,928 KB)

Computer Science > Machine Learning

Title:OCR-free Document Understanding Transformer

Submission history

Access Paper:

References & Citations

3 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:OCR-free Document Understanding Transformer

Submission history

Access Paper:

References & Citations

3 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators