Skip to content

Commit

Permalink
update doc for rec and training
Browse files Browse the repository at this point in the history
  • Loading branch information
tink2123 committed Sep 6, 2021
1 parent 91f8478 commit ef9101b
Show file tree
Hide file tree
Showing 5 changed files with 286 additions and 114 deletions.
2 changes: 1 addition & 1 deletion doc/doc_ch/recognition.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec
根据配置文件中设置的的 `save_model_dir``save_epoch_step` 字段,会有以下几种参数被保存下来:

```
seed_ch/
output/rec/
├── best_accuracy.pdopt
├── best_accuracy.pdparams
├── best_accuracy.states
Expand Down
18 changes: 9 additions & 9 deletions doc/doc_ch/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@

同时会简单介绍PaddleOCR模型训练数据的组成部分,以及如何在垂类场景中准备数据finetune模型。

### 基本概念
### 1. 基本概念

OCR(Optical Character Recognition,光学字符识别)是指对图像进行分析识别处理,获取文字和版面信息的过程,是典型的计算机视觉任务,
通常由文本检测和文本识别两个子任务构成。

模型调优时需要关注以下参数:

- 学习率
#### 1.1 学习率

学习率是训练神经网络的重要超参数之一,它代表在每一次迭代中梯度向损失函数最优解移动的步长。
在PaddleOCR中提供了多种学习率更新策略,可以通过配置文件修改,例如:
Expand All @@ -30,7 +30,7 @@ Piecewise 代表分段常数衰减,在不同的学习阶段指定不同的学
warmup_epoch 代表在前5个epoch中,学习率将逐渐从0增加到base_lr。全部策略可以参考代码[learning_rate.py](../../ppocr/optimizer/learning_rate.py)


- 正则化
#### 1.2 正则化

正则化可以有效的避免算法过拟合,PaddleOCR中提供了L1、L2正则方法,L1 和 L2 正则化是最常用的正则化方法。L1 正则化向目标函数添加正则化项,以减少参数的绝对值总和;而 L2 正则化中,添加正则化项的目的在于减少参数平方的总和。配置方法如下:

Expand All @@ -43,15 +43,15 @@ Optimizer:
```


- 评估指标:
#### 1.3 评估指标:

(1)检测阶段:先按照检测框和标注框的IOU评估,IOU大于某个阈值判断为检测准确。这里检测框和标注框不同于一般的通用目标检测框,是采用多边形进行表示。检测准确率:正确的检测框个数在全部检测框的占比,主要是判断检测指标。检测召回率:正确的检测框个数在全部标注框的占比,主要是判断漏检的指标。

(2)识别阶段: 字符识别准确率,即正确识别的文本行占标注的文本行数量的比例,只有整行文本识别对才算正确识别。

(3)端到端统计: 端对端召回率:准确检测并正确识别文本行在全部标注文本行的占比; 端到端准确率:准确检测并正确识别文本行在 检测到的文本行数量 的占比; 准确检测的标准是检测框与标注框的IOU大于某个阈值,正确识别的的检测框中的文本与标注的文本相同。

### 常见问题
### 2. 常见问题

**Q**: 基于深度学习的文字检测方法有哪几种?各有什么优缺点?

Expand All @@ -77,9 +77,9 @@ Optimizer:
(2)统计训练样本文字数目。最长字符数目的选取考虑满足80%的训练样本。然后中文字符长宽比近似认为是1,英文认为3:1,预估一个最长宽度。


### 数据与垂类场景
### 3. 数据与垂类场景

- 训练数据:
#### 3.1 训练数据:
目前开源的模型,数据集和量级如下:

- 检测:
Expand All @@ -94,12 +94,12 @@ Optimizer:
其中,公开数据集都是开源的,用户可自行搜索下载,也可参考[中文数据集](./datasets.md),合成数据暂不开源,用户可使用开源合成工具自行合成,可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer)[SynthText](https://github.com/ankush-me/SynthText)[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。


- 垂类场景
#### 3.2 垂类场景

PaddleOCR主要聚焦通用OCR,如果有垂类需求,您可以用PaddleOCR+垂类数据自己训练;
如果缺少带标注的数据,或者不想投入研发成本,建议直接调用开放的API,开放的API覆盖了目前比较常见的一些垂类。

- 自己构建数据集
#### 3.3 自己构建数据集

在构建数据集时有几个经验可供参考:

Expand Down
83 changes: 83 additions & 0 deletions doc/doc_en/config_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,86 @@ In ppocr, the network is divided into four stages: Transform, Backbone, Neck and
| batch_size_per_card | Single card batch size during training | 256 | \ |
| drop_last | Whether to discard the last incomplete mini-batch because the number of samples in the data set cannot be divisible by batch_size | True | \ |
| num_workers | The number of sub-processes used to load data, if it is 0, the sub-process is not started, and the data is loaded in the main process | 8 | \ |


## MULTILINGUAL CONFIG FILE GENERATION

PaddleOCR currently supports 80 (except Chinese) language recognition. A multi-language configuration file template is
provided under the path `configs/rec/multi_languages`: [rec_multi_language_lite_train.yml](../../configs/rec/multi_language/rec_multi_language_lite_train.yml)

There are two ways to create the required configuration file::

### Automatically generated by script

[generate_multi_language_configs.py](../../configs/rec/multi_language/generate_multi_language_configs.py) Can help you generate configuration files for multi-language models

- Take Italian as an example, if your data is prepared in the following format:
```
|-train_data
|- it_train.txt # train_set label
|- it_val.txt # val_set label
|- data
|- word_001.jpg
|- word_002.jpg
|- word_003.jpg
| ...
```

You can use the default parameters to generate a configuration file:

```bash
# The code needs to be run in the specified directory
cd PaddleOCR/configs/rec/multi_language/
# Set the configuration file of the language to be generated through the -l or --language parameter.
# This command will write the default parameters into the configuration file
python3 generate_multi_language_configs.py -l it
```

- If your data is placed in another location, or you want to use your own dictionary, you can generate the configuration file by specifying the relevant parameters:

```bash
# -l or --language field is required
# --train to modify the training set
# --val to modify the validation set
# --data_dir to modify the data set directory
# --dict to modify the dict path
# -o to modify the corresponding default parameters
cd PaddleOCR/configs/rec/multi_language/
python3 generate_multi_language_configs.py -l it \ # language
--train {path/of/train_label.txt} \ # path of train_label
--val {path/of/val_label.txt} \ # path of val_label
--data_dir {train_data/path} \ # root directory of training data
--dict {path/of/dict} \ # path of dict
-o Global.use_gpu=False # whether to use gpu
...

```
Italian is made up of Latin letters, so after executing the command, you will get the rec_latin_lite_train.yml.

### Manually modify the configuration file

You can also manually modify the following fields in the template:

```
Global:
use_gpu: True
epoch_num: 500
...
character_type: it # language
character_dict_path: {path/of/dict} # path of dict
Train:
dataset:
name: SimpleDataSet
data_dir: train_data/ # root directory of training data
label_file_list: ["./train_data/train_list.txt"] # train label path
...
Eval:
dataset:
name: SimpleDataSet
data_dir: train_data/ # root directory of val data
label_file_list: ["./train_data/val_list.txt"] # val label path
...
```
Loading

0 comments on commit ef9101b

Please sign in to comment.