Name	Name	Last commit message	Last commit date
parent directory ..
examples	examples
layoutlmft	layoutlmft
.gitignore	.gitignore
README.md	README.md
architecture.png	architecture.png
requirements.txt	requirements.txt
setup.cfg	setup.cfg
setup.py	setup.py

LayoutLMv3 (Document Foundation Model)

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

Installation

conda create --name layoutlmv3 python=3.7
conda activate layoutlmv3
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutlmv3
pip install -r requirements.txt
# install pytorch, torchvision refer to https://pytorch.org/get-started/locally/
pip install torch==1.10.0+cu111 torchvision==0.11.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
# install detectron2 refer to https://detectron2.readthedocs.io/en/latest/tutorials/install.html
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html
pip install -e .

Pre-trained Models

Model	Model Name (Path)
layoutlmv3-base	microsoft/layoutlmv3-base
layoutlmv3-large	microsoft/layoutlmv3-large
layoutlmv3-base-chinese	microsoft/layoutlmv3-base-chinese

Fine-tuning Examples

We provide some fine-tuned models and their train/test logs.

Form Understanding on FUNSD

Train

python -m torch.distributed.launch \
  --nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \
  --dataset_name funsd \
  --do_train --do_eval \
  --model_name_or_path microsoft/layoutlmv3-base \
  --output_dir /path/to/layoutlmv3-base-finetuned-funsd \
  --segment_level_layout 1 --visual_embed 1 --input_size 224 \
  --max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 100 \
  --learning_rate 1e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \
  --dataloader_num_workers 8

Test

python -m torch.distributed.launch \
  --nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \
  --dataset_name funsd \
  --do_eval \
  --model_name_or_path HYPJUDY/layoutlmv3-base-finetuned-funsd \
  --output_dir /path/to/layoutlmv3-base-finetuned-funsd \
  --segment_level_layout 1 --visual_embed 1 --input_size 224 \
  --dataloader_num_workers 8

Model on FUNSD	precision	recall	f1
layoutlmv3-base-finetuned-funsd	0.8955	0.9165	0.9059
layoutlmv3-large-finetuned-funsd	0.9219	0.9210	0.9215

Document Layout Analysis on PubLayNet

Please follow unilm/dit/object_detection to prepare data and read more details about this task. In the folder of layoutlmv3/examples/object_detecion:

Train

Please firstly download the pre-trained models to /path/to/microsoft/layoutlmv3-base, then run:

python train_net.py --config-file cascade_layoutlmv3.yaml --num-gpus 16 \
        MODEL.WEIGHTS /path/to/microsoft/layoutlmv3-base/pytorch_model.bin \
        OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet

Test

If you want to test the layoutlmv3-base-finetuned-publaynet model, please download it to /path/to/layoutlmv3-base-finetuned-publaynet, then run:
```
python train_net.py --config-file cascade_layoutlmv3.yaml --eval-only --num-gpus 8 \
        MODEL.WEIGHTS /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pth \
        OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet
```
Model on PubLayNet Text Title List Table Figure Overall

layoutlmv3-base-finetuned-publaynet 94.5 90.6 95.5 97.9 97.0 95.1

Model on PubLayNet	Text	Title	List	Table	Figure	Overall
layoutlmv3-base-finetuned-publaynet	94.5	90.6	95.5	97.9	97.0	95.1

Form Understanding on XFUND

An example for the LayoutLMv3 Chinese model to train and evaluate model.

Data Preparation

Download the chinese data in XFUND from this link. The resulting directory structure looks like the following:

│── data
│   ├── zh.train.json
│   ├── zh.val.json
│   └── images
│      ├── zh_train_*.jpg
│      └── zh_val_*.jpg

Train

  python -m torch.distributed.launch \
    --nproc_per_node=8 --master_port 4398 examples/run_xfund.py \
    --data_dir data --language zh \
    --do_train --do_eval \
    --model_name_or_path microsoft/layoutlmv3-base-chinese \
    --output_dir path/to/output \
    --segment_level_layout 1 --visual_embed 1 --input_size 224 \
    --max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 20 \
    --learning_rate 7e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \
    --dataloader_num_workers 8

Test

python -m torch.distributed.launch \
  --nproc_per_node=8 --master_port 4398 examples/run_xfund.py \
  --data_dir data --language zh \
  --do_eval \
  --model_name_or_path path/to/model \
  --output_dir /path/to/output \
  --segment_level_layout 1 --visual_embed 1 --input_size 224 \
  --dataloader_num_workers 8

Pre-trained Model	precision	recall	f1
layoutlmv3-base-chinese	0.8980	0.9435	0.9202

We also fine-tune the LayoutLMv3 Chinese model on EPHOIE for reference.

Pre-trained Model	Subject	Test Time	Name	School	Examination Number	Seat Number	Class	Student Number	Grade	Score	Mean
layoutlmv3-base-chinese	98.99	100	99.77	99.2	100	100	98.82	99.78	98.31	97.27	99.21

Citation

If you find LayoutLMv3 helpful, please cite us:

@inproceedings{huang2022layoutlmv3,
  author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
  title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  year={2022}
}

Acknowledgement

Portions of the source code are based on the transformers, layoutlmv2, layoutlmft, beit, dit and Detectron2 projects. We sincerely thank them for their contributions!

License

The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Contact

For help or issues using LayoutLMv3, please email Yupan Huang or submit a GitHub issue.

For other communications related to LayoutLM, please contact Lei Cui or Furu Wei.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layoutlmv3

layoutlmv3

README.md

LayoutLMv3 (Document Foundation Model)

Installation

Pre-trained Models

Fine-tuning Examples

Form Understanding on FUNSD

Document Layout Analysis on PubLayNet

Form Understanding on XFUND

Data Preparation

Citation

Acknowledgement

License

Contact

Files

layoutlmv3

Directory actions

More options

Directory actions

More options

Latest commit

History

layoutlmv3

Folders and files

parent directory

README.md

LayoutLMv3 (Document Foundation Model)

Installation

Pre-trained Models

Fine-tuning Examples

Form Understanding on FUNSD

Document Layout Analysis on PubLayNet

Form Understanding on XFUND

Data Preparation

Citation

Acknowledgement

License

Contact