Document Understanding Evaluation Framework

This repository provides a structured framework for evaluating the performance of Document Understanding models on various tasks including Layout, Text Extraction (OCR), Table Detection, and Table Extraction. The current models supported are the Azure Layout model, Surya model, and the Table Transformer model from Microsoft Research. OCR on documents is fundamental to the functioning of many apps. Given the high costs associated with commercial models and the existence of accurate open-source models, benchmarking document understanding models on a variety of tasks and datasets is important in helping organizations optimize the performance of their systems while managing their resources effectively. Evaluating the performance of document understanding models can be challenging due to the lack of standardized output formats and the variety of tasks involved. This repository aims to address these challenges by providing a canonical structure for evaluating models across different tasks, datasets, and metrics.

Contributing

This repo is a work in progress, so contributions to this evaluation framework are welcome! To add a model:

Examine how models for the same task are defined.
Implement the following methods in the model class:
- load_models
- run_ocr
- results_transform
Check the docstring of other model classes for guidance on the output format of results_transform.
Import the model in the models/__init__.py file.
Alter the prepare_models method in the relevant task evaluation script in the evaluations folder to initialize and run the model.
Edit the main.py file to allow a new model_name to be inputted.

To add a new metric:

Add the metric to the run_eval method in the evaluation class.

To add a new dataset:

Convert the dataset to the canonical form in the evaluation class.

Other Potential Models to Test

Consider testing the following models:

Tesseract OCR (in progress)
LayoutLMv3 (I looked into this. I believe this requires bboxes and text, so I'm not sure how to fairly evaluate it.)
TrOCR (requires line detection. The current text extraction evaluation combines line detection and OCR)
VGT (High reported performance on publaynet.)
VSR (High reported performance on publaynet.)
Textract(commercial)
PaddleOCR
EasyOCR
OCR-D
Calamari OCR

Datasets to Test

Consider testing the following datasets:

Looking for other text extraction datasets. There are a number of datasets for layout and tables, but fewer for text. (perhaps because its deemed to be easier?)

Models

The framework currently supports the following models:

Azure

The Azure Layout model is part of Azure Document Intelligence, and it performs all of the above tasks. It does not allow for a separation of tasks.

Tasks: Layout Analysis, Text Extraction, Table Detection, Table Extraction

Surya

Surya is a toolkit of open-source models (weights are conditionally licenced for commercial use depending on revenue) developed by Vik Paruchuri. It offers accurate text extraction and layout analysis capabilities.

Tasks: Layout Analysis, Text Extraction

Table Transformer

Table Transformer is an open-source model designed for table detection and extraction built by Microsoft Research. It is trained on the PubTables 1M dataset and the FinTabNet datasets.

Tasks: Table Detection, Table Extraction

Usage

Installation

Clone the repository:
Install the required dependencies:

pip install -r requirements.txt

Download relevant datasets, and run relevant processing scripts. (only relevant for FinTabNet)
If using commercial models like Azure, create .env file in doceval folder with API keys. For example, for azure, the code searches for keys as follows

endpoint = str(os.getenv('AZURE_API_URL'))
key = str(os.getenv('AZURE_API_KEY'))

Clone the Table-Transformer repo into the doceval folder:

https://github.com/microsoft/table-transformer.git

Running Evaluations

To run evaluations, use the following command:

python -m doceval.main --evals <task> --model_names <models> --dataset_gt_name <dataset>

Replace <task>, <models>, and <dataset> with the desired values.

Additional flags:

--dataset_root_dir: Specify the root directory for the dataset. Will search for default directory.
--model_weights_dir: Specify the directory containing model weights.
--metrics: Specify the evaluation metrics (e.g., precision, recall, text similarity).
--max_doc: Specify the maximum number of documents to evaluate.
--visualize: Enable visualization of evaluation results. Example usages:

python -m doceval.main --evals layout --model_names Azure,Surya --dataset_gt_name publaynet --metrics precision,recall --max_doc 10 --visualize

python -m doceval.main --evals text_extraction --model_names Azure,Surya --dataset_gt_name vik_text_extraction_bench  --max_doc 10

python -m doceval.main --evals table_detection --model_names Azure,Table_Transformer --dataset_gt_name pubtables --max_doc 10 --visualize

python -m doceval.main --evals table_extraction --model_names Azure,Table_Transformer --dataset_gt_name fintabnet --max_doc 10 --visualize

NOTE: the evaluation will look in the data_ocr_results folder for OCR results. If a pkl file is found for the model+task, it will load in the file rather than rerun OCR.

Data

Clone the repo.
Download and process the relevant datasets:

Layout and text extraction datasets:
- Run the load_data script in the data folder.
- Specify the destination directories as:
  - data/layout_bench/publaynet
  - data/text_extraction_bench/vik_text_extraction
Table detection dataset:
- Download the testing set of PubTables-1M dataset by following the directions here.
- Save the PubTables-1M-Detection folder in data/table_detection_bench/pubtables.
Table extraction dataset:
- Download the testing set of the FinTabNet dataset by following the directions here.
- Save the dataset to data/table_extraction_bench/fintabnet/fintabnet_raw.
- Run the utils/process_fintabnet.py file (from Microsoft).
- Save the results to data/table_extraction_bench/fintabnet/fintabnet_processed.

Table Transformer Model and Weights

If you want to use the Table Transformer model:

Clone repo into the doceval folder:

bash git clone https://github.com/microsoft/table-transformer.git

Download the weights for the detection model here and the structure model here
Save the weights in the model_weights/table-transformer folder with the following names:
- detection_pubtables1m.pth
- extraction_fintabnet.pth

Results

The evaluation results, including performance metrics and visualizations, will be stored in the results directory. The results are stored in a json file, and if the visualize flag is set, the resulting jpgs with bboxes will be saved in the benchmark directory. For table detection, follow the directions here

If you want to skip OCR and just load results from a file, you can download these files (100 samples each)and put them in data/ocr_results. Azure_layout_results.pkl Surya_layout_results.pkl text_extraction_Azure_results.pkl text_extraction_Surya_results.pkl table_detection_Azure_results.pkl table_detection_Table_Transformer_results.pkl

Layout

Azure Average:

Metric	Figures	Tables	Text	Titles	Total
Precision	1.0	1.0	0.897	0.955	0.914
Recall	1.0	1.0	0.941	0.929	0.941

Surya Average:

Metric	Figures	Tables	Text	Titles	Total
Precision	0.956	1.0	0.917	0.941	0.925
Recall	1.0	1.0	0.959	0.893	0.947

Text Extraction

Model	Text Similarity
Azure	0.957
Surya	0.934

Table Detection

Model	Average Precision (50%)	Average Recall (50%)
Table Transformer	1.0	1.0
Azure	0.992	1.0

Table Extraction

Additional Information

Tasks

Layout Analysis

Layout analysis focuses on identifying and extracting the structural elements of a document, such as text blocks, figures, tables, and headings. Current metrics supported are precision and recall under a default of 50% coverage. Some challenges to canonicalization include differences in layout categories and how they can be accessed, differences in bbox units/format, and the amount of text contained within a text bbox. The only dataset currently supported is PublayNet. Downloaded subset from here

Text Extraction

Text extraction involves recognizing and extracting the textual content from an image or document. This task is fundamental to OCR and enables the conversion of visual information into machine-readable text. Some challenges to canonicalization include differences in the amount of text extracted (some models ignore headers, for example), and also how reading order can affect text similarity metrics. We use the Smith Waterman algorithm and fuzzy string matching to circumvent this. The only dataset currently supported was made by Vik Paruchuri here.

Table Detection

Table detection aims to locate and identify the presence of tables within a document. Currently supported metrics are precision and recall under a default of 50% coverage. In addition to differences in bbox calculations, one issue to canonicalization stems from detecting tables split across consecutive pages. The only dataset currently supported is PubTables 1M. Follow directions here to download. I only downloaded the test set.

Table Extraction

Table extraction focuses on extracting the content and structure of identified tables. This task involves recognizing table cells, rows, columns, and their relationships. Specifically, we focus on three aspects of table extraction detailed in GriTS: location, topology, and content. Cell topology recognition considers the layout of the cells, specifically the rows and columns each cell occupies over a two-dimensional grid. Cell content recognition considers the layout of cells and the text content of each cell. Cell location recognition considers the layout of cells and the absolute coordinates of each cell within a document. This is the hardest output to standardize since not all models even extract table information details like the row and column span of a cell, models can make different decisions on what consistutes a row or column (both of which may be correct interpretations), and models differ in the size of the table cells they predict. Some models will default to predicting a grid while others will predict smaller bboxes around the cell content. Aligning cells in such a way that allows for a fair comparison of table contents is also challenging. The only dataset currently supported is FinTabNet from IBM. Follow directions here to download. I only downloaded the test set.

To Do

Make evaluations faster
Improve table extraction
Add models, datasets, and evaluation tasks

Acknowledgements

This repo builds upon work in Surya, and Microsoft Table Transformer

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
doceval		doceval
read_me_images		read_me_images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Understanding Evaluation Framework

Contributing

Other Potential Models to Test

Datasets to Test

Models

Azure

Tasks: Layout Analysis, Text Extraction, Table Detection, Table Extraction

Surya

Tasks: Layout Analysis, Text Extraction

Table Transformer

Tasks: Table Detection, Table Extraction

Usage

Installation

Running Evaluations

Data

Table Transformer Model and Weights

Results

Layout

Text Extraction

Table Detection

Table Extraction

Additional Information

Tasks

Layout Analysis

Text Extraction

Table Detection

Table Extraction

To Do

Acknowledgements

About

Releases

Packages

Languages

fleet-ai/DocEval

Folders and files

Latest commit

History

Repository files navigation

Document Understanding Evaluation Framework

Contributing

Other Potential Models to Test

Datasets to Test

Models

Azure

Tasks: Layout Analysis, Text Extraction, Table Detection, Table Extraction

Surya

Tasks: Layout Analysis, Text Extraction

Table Transformer

Tasks: Table Detection, Table Extraction

Usage

Installation

Running Evaluations

Data

Table Transformer Model and Weights

Results

Layout

Text Extraction

Table Detection

Table Extraction

Additional Information

Tasks

Layout Analysis

Text Extraction

Table Detection

Table Extraction

To Do

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages