update doc for rec and training

Evezerest · Sep 6, 2021 · ef9101b · ef9101b
1 parent 91f8478
commit ef9101b
Show file tree

Hide file tree

Showing 5 changed files with 286 additions and 114 deletions.
diff --git a/doc/doc_ch/recognition.md b/doc/doc_ch/recognition.md
@@ -375,7 +375,7 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec
 根据配置文件中设置的的 `save_model_dir` 和 `save_epoch_step` 字段，会有以下几种参数被保存下来：
 
 ```
-seed_ch/ 
+output/rec/
 ├── best_accuracy.pdopt 
 ├── best_accuracy.pdparams 
 ├── best_accuracy.states 

diff --git a/doc/doc_ch/training.md b/doc/doc_ch/training.md
@@ -4,14 +4,14 @@
 
 同时会简单介绍PaddleOCR模型训练数据的组成部分，以及如何在垂类场景中准备数据finetune模型。
 
-### 基本概念
+### 1. 基本概念
 
 OCR(Optical Character Recognition,光学字符识别)是指对图像进行分析识别处理，获取文字和版面信息的过程，是典型的计算机视觉任务，
 通常由文本检测和文本识别两个子任务构成。
 
 模型调优时需要关注以下参数：
 
-- 学习率
+#### 1.1 学习率
 
 学习率是训练神经网络的重要超参数之一，它代表在每一次迭代中梯度向损失函数最优解移动的步长。
 在PaddleOCR中提供了多种学习率更新策略,可以通过配置文件修改，例如：
@@ -30,7 +30,7 @@ Piecewise 代表分段常数衰减，在不同的学习阶段指定不同的学
 warmup_epoch 代表在前5个epoch中，学习率将逐渐从0增加到base_lr。全部策略可以参考代码[learning_rate.py](../../ppocr/optimizer/learning_rate.py) 。
 
 
-- 正则化
+#### 1.2 正则化
 
 正则化可以有效的避免算法过拟合，PaddleOCR中提供了L1、L2正则方法，L1 和 L2 正则化是最常用的正则化方法。L1 正则化向目标函数添加正则化项，以减少参数的绝对值总和；而 L2 正则化中，添加正则化项的目的在于减少参数平方的总和。配置方法如下：
 
@@ -43,15 +43,15 @@ Optimizer:
 ```
 
 
-- 评估指标：
+#### 1.3 评估指标：
 
 （1）检测阶段：先按照检测框和标注框的IOU评估，IOU大于某个阈值判断为检测准确。这里检测框和标注框不同于一般的通用目标检测框，是采用多边形进行表示。检测准确率：正确的检测框个数在全部检测框的占比，主要是判断检测指标。检测召回率：正确的检测框个数在全部标注框的占比，主要是判断漏检的指标。
 
 （2）识别阶段： 字符识别准确率，即正确识别的文本行占标注的文本行数量的比例，只有整行文本识别对才算正确识别。
 
 （3）端到端统计： 端对端召回率：准确检测并正确识别文本行在全部标注文本行的占比； 端到端准确率：准确检测并正确识别文本行在 检测到的文本行数量 的占比； 准确检测的标准是检测框与标注框的IOU大于某个阈值，正确识别的的检测框中的文本与标注的文本相同。
 
-### 常见问题
+### 2. 常见问题
 
 **Q**: 基于深度学习的文字检测方法有哪几种？各有什么优缺点？
 
@@ -77,9 +77,9 @@ Optimizer:
  （2）统计训练样本文字数目。最长字符数目的选取考虑满足80%的训练样本。然后中文字符长宽比近似认为是1，英文认为3：1，预估一个最长宽度。
 
 
-### 数据与垂类场景
+### 3. 数据与垂类场景
 
-- 训练数据：
+#### 3.1 训练数据：
 目前开源的模型，数据集和量级如下：
 
  - 检测： 
@@ -94,12 +94,12 @@ Optimizer:
 其中，公开数据集都是开源的，用户可自行搜索下载，也可参考[中文数据集](./datasets.md)，合成数据暂不开源，用户可使用开源合成工具自行合成，可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer) 、[SynthText](https://github.com/ankush-me/SynthText) 、[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
 
 
-- 垂类场景
+#### 3.2 垂类场景
 
 PaddleOCR主要聚焦通用OCR，如果有垂类需求，您可以用PaddleOCR+垂类数据自己训练；
 如果缺少带标注的数据，或者不想投入研发成本，建议直接调用开放的API，开放的API覆盖了目前比较常见的一些垂类。
 
-- 自己构建数据集
+#### 3.3 自己构建数据集
 
 在构建数据集时有几个经验可供参考：
 

diff --git a/doc/doc_en/config_en.md b/doc/doc_en/config_en.md
@@ -120,3 +120,86 @@ In ppocr, the network is divided into four stages: Transform, Backbone, Neck and
 | batch_size_per_card | Single card batch size during training | 256 | \ |
 | drop_last | Whether to discard the last incomplete mini-batch because the number of samples in the data set cannot be divisible by batch_size | True | \ |
 | num_workers | The number of sub-processes used to load data, if it is 0, the sub-process is not started, and the data is loaded in the main process | 8 | \ |
+
+
+## MULTILINGUAL CONFIG FILE GENERATION
+
+PaddleOCR currently supports 80 (except Chinese) language recognition. A multi-language configuration file template is
+provided under the path `configs/rec/multi_languages`: [rec_multi_language_lite_train.yml](../../configs/rec/multi_language/rec_multi_language_lite_train.yml)。
+
+There are two ways to create the required configuration file:：
+
+### Automatically generated by script
+
+[generate_multi_language_configs.py](../../configs/rec/multi_language/generate_multi_language_configs.py) Can help you generate configuration files for multi-language models
+
+- Take Italian as an example, if your data is prepared in the following format:
+ ```
+ |-train_data
+ |- it_train.txt # train_set label
+ |- it_val.txt # val_set label
+ |- data
+ |- word_001.jpg
+ |- word_002.jpg
+ |- word_003.jpg
+ | ...
+ ```
+
+ You can use the default parameters to generate a configuration file:
+
+ ```bash
+ # The code needs to be run in the specified directory
+ cd PaddleOCR/configs/rec/multi_language/
+ # Set the configuration file of the language to be generated through the -l or --language parameter.
+ # This command will write the default parameters into the configuration file
+ python3 generate_multi_language_configs.py -l it
+ ```
+
+- If your data is placed in another location, or you want to use your own dictionary, you can generate the configuration file by specifying the relevant parameters:
+
+ ```bash
+ # -l or --language field is required
+ # --train to modify the training set
+ # --val to modify the validation set
+ # --data_dir to modify the data set directory
+ # --dict to modify the dict path
+ # -o to modify the corresponding default parameters
+ cd PaddleOCR/configs/rec/multi_language/
+ python3 generate_multi_language_configs.py -l it \  # language
+ --train {path/of/train_label.txt} \ # path of train_label
+ --val {path/of/val_label.txt} \  # path of val_label
+ --data_dir {train_data/path} \  # root directory of training data
+ --dict {path/of/dict} \  # path of dict
+ -o Global.use_gpu=False # whether to use gpu
+ ...
+
+ ```
+Italian is made up of Latin letters, so after executing the command, you will get the rec_latin_lite_train.yml.
+
+### Manually modify the configuration file
+
+ You can also manually modify the following fields in the template:
+
+ ```
+ Global:
+ use_gpu: True
+ epoch_num: 500
+ ...
+ character_type: it # language
+ character_dict_path: {path/of/dict} # path of dict
+
+ Train:
+ dataset:
+ name: SimpleDataSet
+ data_dir: train_data/ # root directory of training data
+ label_file_list: ["./train_data/train_list.txt"] # train label path
+ ...
+
+ Eval:
+ dataset:
+ name: SimpleDataSet
+ data_dir: train_data/ # root directory of val data
+ label_file_list: ["./train_data/val_list.txt"] # val label path
+ ...
+
+ ```