Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.
conda create --name layoutlmv3 python=3.7
conda activate layoutlmv3
git clone https://github.com/microsoft/unilm.git
cd unilm/layoutlmv3
pip install -r requirements.txt
# install pytorch, torchvision refer to https://pytorch.org/get-started/locally/
pip install torch==1.10.0+cu111 torchvision==0.11.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
# install detectron2 refer to https://detectron2.readthedocs.io/en/latest/tutorials/install.html
python -m pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html
pip install -e .
Model | Model Name (Path) |
---|---|
layoutlmv3-base | microsoft/layoutlmv3-base |
layoutlmv3-large | microsoft/layoutlmv3-large |
layoutlmv3-base-chinese | microsoft/layoutlmv3-base-chinese |
We provide some fine-tuned models and their train/test logs.
- Train
python -m torch.distributed.launch \ --nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \ --dataset_name funsd \ --do_train --do_eval \ --model_name_or_path microsoft/layoutlmv3-base \ --output_dir /path/to/layoutlmv3-base-finetuned-funsd \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 100 \ --learning_rate 1e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \ --dataloader_num_workers 8
- Test
python -m torch.distributed.launch \ --nproc_per_node=8 --master_port 4398 examples/run_funsd_cord.py \ --dataset_name funsd \ --do_eval \ --model_name_or_path HYPJUDY/layoutlmv3-base-finetuned-funsd \ --output_dir /path/to/layoutlmv3-base-finetuned-funsd \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --dataloader_num_workers 8
Model on FUNSD precision recall f1 layoutlmv3-base-finetuned-funsd 0.8955 0.9165 0.9059 layoutlmv3-large-finetuned-funsd 0.9219 0.9210 0.9215
Please follow unilm/dit/object_detection to prepare data and read more details about this task. In the folder of layoutlmv3/examples/object_detecion:
-
Train
Please firstly download the pre-trained models to
/path/to/microsoft/layoutlmv3-base
, then run:python train_net.py --config-file cascade_layoutlmv3.yaml --num-gpus 16 \ MODEL.WEIGHTS /path/to/microsoft/layoutlmv3-base/pytorch_model.bin \ OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet
-
Test
If you want to test the layoutlmv3-base-finetuned-publaynet model, please download it to
/path/to/layoutlmv3-base-finetuned-publaynet
, then run:python train_net.py --config-file cascade_layoutlmv3.yaml --eval-only --num-gpus 8 \ MODEL.WEIGHTS /path/to/layoutlmv3-base-finetuned-publaynet/model_final.pth \ OUTPUT_DIR /path/to/layoutlmv3-base-finetuned-publaynet
Model on PubLayNet Text Title List Table Figure Overall layoutlmv3-base-finetuned-publaynet 94.5 90.6 95.5 97.9 97.0 95.1
An example for the LayoutLMv3 Chinese model to train and evaluate model.
Download the chinese data in XFUND from this link. The resulting directory structure looks like the following:
│── data
│ ├── zh.train.json
│ ├── zh.val.json
│ └── images
│ ├── zh_train_*.jpg
│ └── zh_val_*.jpg
-
Train
python -m torch.distributed.launch \ --nproc_per_node=8 --master_port 4398 examples/run_xfund.py \ --data_dir data --language zh \ --do_train --do_eval \ --model_name_or_path microsoft/layoutlmv3-base-chinese \ --output_dir path/to/output \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --max_steps 1000 --save_steps -1 --evaluation_strategy steps --eval_steps 20 \ --learning_rate 7e-5 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 \ --dataloader_num_workers 8
-
Test
python -m torch.distributed.launch \ --nproc_per_node=8 --master_port 4398 examples/run_xfund.py \ --data_dir data --language zh \ --do_eval \ --model_name_or_path path/to/model \ --output_dir /path/to/output \ --segment_level_layout 1 --visual_embed 1 --input_size 224 \ --dataloader_num_workers 8
Pre-trained Model precision recall f1 layoutlmv3-base-chinese 0.8980 0.9435 0.9202
We also fine-tune the LayoutLMv3 Chinese model on EPHOIE for reference.
Pre-trained Model | Subject | Test Time | Name | School | Examination Number | Seat Number | Class | Student Number | Grade | Score | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|
layoutlmv3-base-chinese | 98.99 | 100 | 99.77 | 99.2 | 100 | 100 | 98.82 | 99.78 | 98.31 | 97.27 | 99.21 |
If you find LayoutLMv3 helpful, please cite us:
@inproceedings{huang2022layoutlmv3,
author={Yupan Huang and Tengchao Lv and Lei Cui and Yutong Lu and Furu Wei},
title={LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
year={2022}
}
Portions of the source code are based on the transformers, layoutlmv2, layoutlmft, beit, dit and Detectron2 projects. We sincerely thank them for their contributions!
The content of this project itself is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
For help or issues using LayoutLMv3, please email Yupan Huang or submit a GitHub issue.
For other communications related to LayoutLM, please contact Lei Cui or Furu Wei.