[Update] Update MAE Pretraining part

Mountchicken · Jun 2, 2023 · 0b37d73 · 0b37d73
1 parent a2c174f
commit 0b37d73
Show file tree

Hide file tree

Showing 9 changed files with 185 additions and 54 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,4 +7,6 @@ baselines/
 dataset_funcs/
 mmocr-dev-1.x/work_dirs
 add_data/
-mmocr-0.x/
+mmocr-0.x/
+mae/output_dir
+*.pyc
diff --git a/README.md b/README.md
@@ -1,13 +1,24 @@
-/# Union14M Dataset
+<div align=center
 
+# Rethinking Scene Text Recognition: A Data Perspective
+
+</div>
 <div align=center>
  <img src='github/cover.png' width=600 >
 </div>
 <div align=center>
  <p >Union14M is a large scene text recognition (STR) dataset collected from 17 publicly available datasets, which contains 4M of labeled data (Union14M-L) and 10M of unlabeled data (Union14M-U), intended to provide a more profound analysis for the STR community</p>
+
+<div align=center>
+
+[![arXiv preprint](http:https://img.shields.io/badge/arXiv-2207.06966-b31b1b)](https://arxiv.org/abs/2207.06966) [![Gradio demo](https://img.shields.io/badge/%F0%9F%A4%97%20demo-Gradio-ff7c00)](https://huggingface.co/spaces/baudm/PARSeq-OCR) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bipinKrishnan/fastai_course/blob/master/bear_classifier.ipynb) 
+
+
+</div>
+
+
 </div>
 <p align="center">
- <strong><a href="#sota">arXiv </a></strong> •
  <strong><a href="#1-introduction">Introduction </a></strong> •
  <strong><a href="#34-download">Download </a></strong> •
  <strong><a href="#5-maerec">MAERec</a></strong> •
@@ -26,21 +37,26 @@
 - To explore the challenges that STR models still face, we consolidate a large-scale STR dataset for analysis and identified seven open challenges. Furthermore, we propose a challenge-driven benchmark to facilitate the future development of STR. Additionally, we reveal that the utilization of massive unlabeled data through self-supervised pre-training can remarkably enhance the performance of the STR model in real-world scenarios, suggesting a practical solution for STR from a data perspective. We hope this work can spark future research beyond the realm of existing data paradigms.
 
 ## 2. Contents
-- [1. Introduction](#1-introduction)
-- [2. Contents](#2-contents)
-- [3. Union14M Dataset](#3-union14m-dataset)
- - [3.1. Union14M-L](#31-union14m-l)
- - [3.2. Union14M-U](#32-union14m-u)
- - [3.3. Union14M-Benchmark](#33-union14m-benchmark)
- - [3.4. Download](#34-download)
-- [4. STR Models trained on Union14M-L](#4-str-models-trained-on-union14m-l)
- - [4.1. Checkpoints](#41-checkpoints)
-- [5. MAERec](#5-maerec)
- - [5.1. Pre-training](#51-pre-training)
- - [5.2. Fine-tuning](#52-fine-tuning)
- - [5.3 Inferencing](#53-inferencing)
-- [6. QAs](#6-qas)
-- [7. License](#7-license)
+- [Rethinking Scene Text Recognition: A Data Perspective](#rethinking-scene-text-recognition-a-data-perspective)
+ - [1. Introduction](#1-introduction)
+ - [2. Contents](#2-contents)
+ - [3. Union14M Dataset](#3-union14m-dataset)
+ - [3.1. Union14M-L](#31-union14m-l)
+ - [3.2. Union14M-U](#32-union14m-u)
+ - [3.3. Union14M-Benchmark](#33-union14m-benchmark)
+ - [3.4. Download](#34-download)
+ - [4. STR Models trained on Union14M-L](#4-str-models-trained-on-union14m-l)
+ - [4.1. Checkpoints](#41-checkpoints)
+ - [5. MAERec](#5-maerec)
+ - [5.1. Pre-training](#51-pre-training)
+ - [5.2. Fine-tuning](#52-fine-tuning)
+ - [5.3. Evaluation](#53-evaluation)
+ - [5.4. Inferencing](#54-inferencing)
+ - [5.4. ONNX Conversion](#54-onnx-conversion)
+ - [6. QAs](#6-qas)
+ - [7. License](#7-license)
+ - [8. Acknowledgement](#8-acknowledgement)
+ - [9. Citation](#9-citation)
 
 ## 3. Union14M Dataset
 ### 3.1. Union14M-L
@@ -73,6 +89,7 @@
  | Union14M-U (36.63GB) | [Google Drive (8 GB)]() | [Baidu Netdisk]() |
  | 6 Common Benchmarks (17.6MB) | [Google Drive (8 GB)]() | [Baidu Netdisk](https://pan.baidu.com/s/1XifQS0v-0YxEXkGTfWMDWQ?pwd=35cz) |
 
+<!-- TODO: Add Google Drive Links -->
 
 - The Structure of Union14M will be organized as follows:
 
@@ -109,7 +126,7 @@
  <details close>
  <summary><strong>Structure of Union14M-U</strong></summary>
 
- We store images in LMDB format, and the structure of Union14M-U will be organized as belows. Here is an example of using [LMDB Example]()
+ We store images in [LMDB](https://github.com/Mountchicken/Efficient-Deep-Learning/blob/main/Efficient_DataProcessing.md#21-efficient-data-storage-methods) format, and the structure of Union14M-U will be organized as belows. Here is an example of using [LMDB Example]()
  ```text
  |--Union14M-U
  |--book32_lmdb
@@ -122,7 +139,7 @@
 - We train serval STR models on Union14M-L using [MMOCR-1.0](https://github.com/open-mmlab/mmocr/tree/dev-1.x)
 
 ### 4.1. Checkpoints
-- Evaluated on both common benchmarks and Union14M-Benchmark. Accuracy (WAICS) in $\color{grey}{grey}$ are original implementation (Trained on synthtic datasest), and accuracay in $\color{green}{green}$ are trained on Union14M-L. Our models are trained to predict **upper & lower text, symbols and space.**
+- Evaluated on both common benchmarks and Union14M-Benchmark. Accuracy (WAICS) in $\color{grey}{grey}$ are original implementation (Trained on synthtic datasest), and accuracay in $\color{green}{green}$ are trained on Union14M-L. All the re-trained models are trained to predict **upper & lower text, symbols and space.**
 
  | Models | Checkpoint | IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CUTE80 | Avg. |
  | :---------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: | :--------------------------------------------: |
@@ -155,29 +172,42 @@
 
 
 ### 5.1. Pre-training 
-- Pre-trained ViT
+- ViT pretrained on Union14M-U.
 
- | Variants  | Input Size | Patch Size | Embedding | Depth | Heads | Parameters | Download |
- | --------- | ---------- | ---------- | --------- | ----- | ----- | ---------- | --------------------------------------------------------------------------------------- |
- | ViT-Small | 32x128 | 4x4 | 384 | 12 | 6 |   | [Google Drive]() / [BaiduYun](https://pan.baidu.com/s/1nZL5veMyWhxpk8DGj0UZMw?pwd=xecv) |
- | ViT-Base | 32x128 | 4x4 | 768 | 12 | 12 |   | [Google Drive]() / [BaiduYun](https://pan.baidu.com/s/17CjAOV-1kf1__a2RBo9NUg?pwd=3rvx) |
+ | Variants | Input Size | Patch Size | Embedding | Depth | Heads | Parameters | Download |
+ | -------- | ---------- | ---------- | --------- | ----- | ----- | ---------- | --------------------------------------------------------------------------------------- |
+ | ViT-S | 32x128 | 4x4 | 384 | 12 | 6 | 21M | [Google Drive]() / [BaiduYun](https://pan.baidu.com/s/1nZL5veMyWhxpk8DGj0UZMw?pwd=xecv) |
+ | ViT-B | 32x128 | 4x4 | 768 | 12 | 12 | 85M | [Google Drive]() / [BaiduYun](https://pan.baidu.com/s/17CjAOV-1kf1__a2RBo9NUg?pwd=3rvx) |
 - If you want to pre-train the ViT backbone on your own dataset, check [pre-training](docs/pretrain.md)
 
+<!-- TODO: Add Google Drive Link -->
+
 ### 5.2. Fine-tuning 
-- Fine-tuned MAERec
+- MAERec finetuned on Union14M-L
 
- | Variants  | Acc on Common Benchmarks | Acc on Union14M-Benchmarks | Download |
- | ------------ | ------------------------ | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
- | MAERec-Small | 95.1 | 78.6 | [Google Drive](https://drive.google.com/file/d/1dKLS_r3_ysWK155pSmkm7NBf5ALsEJYd/view?usp=sharing) / [BaiduYun](https://pan.baidu.com/s/1wFhLQLrn9dm77TMpdxyNAg?pwd=trg4) |
- | MAERec-Base  | 96.2 | 85.2 | [Google Drive](https://drive.google.com/file/d/13E0cmvksKwvjNuR62xZhwkg8eQJfb_Hp/view?usp=sharing) / [BaiduYun](https://pan.baidu.com/s/1EhoJ-2WqkzOQFCNg55-KcA?pwd=5yx1) |
+ | Variants | Acc on Common Benchmarks | Acc on Union14M-Benchmarks | Download |
+ | -------- | ------------------------ | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+ | MAERec-S | 95.1 | 78.6 | [Google Drive](https://drive.google.com/file/d/1dKLS_r3_ysWK155pSmkm7NBf5ALsEJYd/view?usp=sharing) / [BaiduYun](https://pan.baidu.com/s/1wFhLQLrn9dm77TMpdxyNAg?pwd=trg4) |
+ | MAERec-B | 96.2 | 85.2 | [Google Drive](https://drive.google.com/file/d/13E0cmvksKwvjNuR62xZhwkg8eQJfb_Hp/view?usp=sharing) / [BaiduYun](https://pan.baidu.com/s/1EhoJ-2WqkzOQFCNg55-KcA?pwd=5yx1) |
 
 - If you want to fine-tune MAERec on your own dataset, check [fine-tuning](docs/finetune.md)
 
-### 5.3 Inferencing
+### 5.3. Evaluation
+- If you want to evaluate MAERec on benchmarks, check [evaluation](docs/evaluation.md)
+
+### 5.4. Inferencing
 - If you want to inferencing MAERec on your raw pictures, check [inferencing](docs/inferencing.md)
 
+
+### 5.4. ONNX Conversion
+
 ## 6. QAs
 
 
 ## 7. License
+- The repository is released under the [MIT license](LICENSE).
+
+## 8. Acknowledgement
+- We sincerely thank all the constructors of the 17 datasets used in Union14M, and also the developers of MMOCR, which is a powerful toolbox for OCR research.
 
+## 9. Citation
diff --git a/docs/pretrain.md b/docs/pretrain.md
@@ -1,29 +1,54 @@
 ## Pre-training Using MAE
 We adopt the framework of [MAE](http:https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Learners_CVPR_2022_paper.html) for pre-training. The code is heavily borrowed from [Masked Autoencoders: A PyTorch Implementation](https://github.com/facebookresearch/mae).
 
-### 1. Install
+### 1. Installation
 ```bash
-conda create -n mae python=3.7
+cd mae/
+conda create -n mae python=3.8
 conda activate mae
 pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
 pip install -r requirements.txt
 ```
-- **Attention**: This repo is based on `timm==0.3.2`, for which a [fix](https://github.com/huggingface/pytorch-image-models/issues/420#issuecomment-776459842) is needed to work with PyTorch 1.8.1+.
+- **Attention**: The pre-training code is based on `timm==0.3.2`, for which a [fix](https://github.com/huggingface/pytorch-image-models/issues/420#issuecomment-776459842) is needed to work with PyTorch 1.8.1+. Add the below code to `timm/models/layers/helpers.py`:
+ ```python
+ import torch
 
-### 2. Prepare dataset
-- You need to prepare the dataset(s) in torchvision.datasets.ImageFolder format. The basic structure of the dataset is as follows:
- ```text
- |--dataset
- |--subfolder1
- |--image1.jpg
- |--image2.jpg
- |--...
- |--subfolder2
- |--image1.jpg
- |--image2.jpg
- |--...
+ TORCH_MAJOR = int(torch.__version__.split('.')[0])
+ TORCH_MINOR = int(torch.__version__.split('.')[1])
+
+ if TORCH_MAJOR == 1 and TORCH_MINOR < 8:
+ from torch._six import container_abcs
+ else:
+ import collections.abc as container_abcs
  ```
-- You can aslo use Union14M-U for pre-training, which is organized in ImageFolder format.
+
+### 2. Prepare dataset
+- We support two types of datasets: ImageFolder and LMDB.
+ - torchvision.datasets.ImageFolder format:
+ ```text
+ |--dataset
+ |--book32
+ |--image1.jpg
+ |--image2.jpg
+ |--...
+ |--openvino
+ |--image1.jpg
+ |--image2.jpg
+ |--...
+ ```
+ - LMDB format. To know more about LMDB structure and how to create LMDB, you should not miss this [repo](https://github.com/Mountchicken/Efficient-Deep-Learning/blob/main/Efficient_DataProcessing.md#21-efficient-data-storage-methods).
+ ```text
+ |--dataset
+ |--book32
+ |--data.mdb
+ |--lock.mdb
+ |--openvino
+ |--data.mdb
+ |--lock.mdb
+ |--cc
+ |--data.mdb
+ |--lock.mdb
+ ```
 
 ### 3. Pre-training
 - Pre-training ViT-Small on Union14M-U with 4 gpus:
@@ -38,8 +63,9 @@ pip install -r requirements.txt
  --norm_pix_loss \
  --blr 1.5e-4 \
  --weight_decay 0.05 \
- --data_path Union14M-U/book32 Union14M-U/openvino /Union14M-U/CC
+ --data_path ../data/Union14M-U/book32_lmdb ../data/Union14M-U/cc_lmdb ../data/Union14M-U/openvino_lmdb 
  ```
+- To pretrain ViT-Base, use `--model mae_vit_base_patch4`.
 - Here the effective batch size is 256 (batch_size per gpu) * 1 (nodes) * 4 (gpus per node) = 1024. If memory or # gpus is limited, use --accum_iter to maintain the effective batch size, which is batch_size (per gpu) * nodes * 8 (gpus per node) * accum_iter.
 - Here we use --norm_pix_loss as the target for better representation learning. To train a baseline model (e.g., for visualization), use pixel-based construction and turn off --norm_pix_loss.
 - To train ViT-Base set --model mae_vit_base_patch4

diff --git a/mae/datasets/lmdb_dataset.py b/mae/datasets/lmdb_dataset.py
@@ -0,0 +1,56 @@
+import lmdb
+import sys
+import six
+from torch.utils.data import Dataset
+from PIL import Image
+
+
+class lmdbDataset(Dataset):
+ """LMDB dataset for raw images.
+
+ Args:
+ root (str): Root path for lmdb files.
+ transform (callable, optional): A function/transform that takes in an
+ PIL image and returns a transformed version.
+ """
+
+ def __init__(self, root: str = None, transform=None):
+ self.env = lmdb.open(
+ root,
+ max_readers=1,
+ readonly=True,
+ lock=False,
+ readahead=False,
+ meminit=False)
+
+ if not self.env:
+ print('cannot creat lmdb from %s' % (root))
+ sys.exit(0)
+
+ with self.env.begin(write=False) as txn:
+ nSamples = int(txn.get('num-samples'.encode()))
+ self.nSamples = nSamples
+ self.transform = transform
+
+ def __len__(self):
+ return self.nSamples
+
+ def __getitem__(self, index):
+ assert index <= len(self), 'index range error'
+ index += 1
+ with self.env.begin(write=False) as txn:
+ img_key = 'image-%09d' % index
+ imgbuf = txn.get(img_key.encode())
+
+ buf = six.BytesIO()
+ buf.write(imgbuf)
+ buf.seek(0)
+ try:
+ img = Image.open(buf).convert('RGB')
+ except IOError:
+ print('Corrupted image for %d' % index)
+ return self[index + 1]
+
+ img = self.transform(img)
+
+ return img, 'test'
diff --git a/mae/main_pretrain.py b/mae/main_pretrain.py
@@ -21,6 +21,7 @@
 import util.misc as misc
 from engine_pretrain import train_one_epoch
 from util.misc import NativeScalerWithGradNormCount as NativeScaler
+from datasets.lmdb_dataset import lmdbDataset
 
 assert timm.__version__ == "0.3.2" # version check
 
@@ -172,15 +173,27 @@ def main(args):
  transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
  ])
 
+ # check if it is lmdb dataset
  if isinstance(args.data_path, list):
- dataset_train = datasets.ImageFolder(args.data_path[0],
- transform_train)
+ files = os.listdir(args.data_path[0])
+ else:
+ files = os.listdir(args.data_path)
+ for f in files:
+ if '.mdb' in f:
+ dataset_type = lmdbDataset
+ break
+ if os.path.isdir(os.path.join(args.data_path, f)):
+ dataset_type = datasets.ImageFolder
+ break
+
+ if isinstance(args.data_path, list):
+ dataset_train = dataset_type(args.data_path[0], transform_train)
  for p in args.data_path[1:]:
  dataset_train = torch.utils.data.ConcatDataset(
  [dataset_train,
- datasets.ImageFolder(p, transform_train)])
+ dataset_type(p, transform_train)])
  else:
- dataset_train = datasets.ImageFolder(
+ dataset_train = dataset_type(
  os.path.join(args.data_path), transform=transform_train)
  print(dataset_train)
 
@@ -273,8 +286,10 @@ def main(args):
  epoch=epoch)
 
  log_stats = {
- **{f'train_{k}': v
- for k, v in train_stats.items()},
+ **{
+ f'train_{k}': v
+ for k, v in train_stats.items()
+ },
  'epoch': epoch,
  }
 

diff --git a/mae/requirements.txt b/mae/requirements.txt
@@ -1,2 +1,4 @@
 timm==0.3.2
 tensorboard==2.11.0
+lmdb==1.4.1
+numpy<=1.23.0
diff --git a/mae/util/__pycache__/lr_sched.cpython-38.pyc b/mae/util/__pycache__/lr_sched.cpython-38.pyc
diff --git a/mae/util/__pycache__/misc.cpython-38.pyc b/mae/util/__pycache__/misc.cpython-38.pyc
diff --git a/mae/util/__pycache__/pos_embed.cpython-38.pyc b/mae/util/__pycache__/pos_embed.cpython-38.pyc