mv layout and doc vqa dataset to docs/dataset

tink2123 · Apr 27, 2022 · e4348f6 · e4348f6
1 parent 8ea84de
commit e4348f6
Show file tree

Hide file tree

Showing 15 changed files with 52 additions and 53 deletions.
diff --git a/README.md b/README.md
@@ -101,7 +101,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
 - [PP-Structure 🔥](./ppstructure/README.md)
  - [Quick Start](./ppstructure/docs/quickstart_en.md)
  - [Model Zoo](./ppstructure/docs/models_list_en.md)
- - [Model training](./doc/doc_en/training_en.md)  
+ - [Model training](./doc/doc_en/training_en.md) 
  - [Layout Parser](./ppstructure/layout/README.md)
  - [Table Recognition](./ppstructure/table/README.md)
  - [DocVQA](./ppstructure/vqa/README.md)
@@ -121,9 +121,9 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
  - [Other Data Annotation Tools](./doc/doc_en/data_annotation_en.md)
  - [Other Data Synthesis Tools](./doc/doc_en/data_synthesis_en.md)
 - Datasets
- - [General OCR Datasets(Chinese/English)](./doc/doc_en/datasets_en.md)
- - [HandWritten_OCR_Datasets(Chinese)](./doc/doc_en/handwritten_datasets_en.md)
- - [Various OCR Datasets(multilingual)](./doc/doc_en/vertical_and_multilingual_datasets_en.md)
+ - [General OCR Datasets(Chinese/English)](doc/doc_en/dataset/datasets_en.md)
+ - [HandWritten_OCR_Datasets(Chinese)](doc/doc_en/dataset/handwritten_datasets_en.md)
+ - [Various OCR Datasets(multilingual)](doc/doc_en/dataset/vertical_and_multilingual_datasets_en.md)
 - [Code Structure](./doc/doc_en/tree_en.md)
 - [Visualization](#Visualization)
 - [Community](#Community)
@@ -170,4 +170,4 @@ More details, please refer to [Multilingual OCR Development Plan](https://github
 
 <a name="LICENSE"></a>
 ## License
-This project is released under <a href="https://github.com/PaddlePaddle/PaddleOCR/blob/master/LICENSE">Apache 2.0 license</a>
+This project is released under <a href="https://github.com/PaddlePaddle/PaddleOCR/blob/master/LICENSE">Apache 2.0 license</a>
diff --git a/README_ch.md b/README_ch.md
@@ -128,12 +128,12 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库，助力
  - [其它数据标注工具](./doc/doc_ch/data_annotation.md)
  - [其它数据合成工具](./doc/doc_ch/data_synthesis.md)
 - 数据集
- - [通用中英文OCR数据集](./doc/doc_ch/datasets.md)
- - [手写中文OCR数据集](./doc/doc_ch/handwritten_datasets.md)
- - [垂类多语言OCR数据集](./doc/doc_ch/vertical_and_multilingual_datasets.md)
- - [版面分析数据集](./doc/doc_ch/layout_datasets.md)
+ - [通用中英文OCR数据集](doc/doc_ch/dataset/datasets.md)
+ - [手写中文OCR数据集](doc/doc_ch/dataset/handwritten_datasets.md)
+ - [垂类多语言OCR数据集](doc/doc_ch/dataset/vertical_and_multilingual_datasets.md)
+ - [版面分析数据集](doc/doc_ch/dataset/layout_datasets.md)
  - [表格识别数据集](doc/doc_ch/dataset/table_datasets.md)
- - [DocVQA数据集](./doc/doc_ch/docvqa_datasets.md)
+ - [DocVQA数据集](doc/doc_ch/dataset/docvqa_datasets.md)
 - [代码组织结构](./doc/doc_ch/tree.md)
 - [效果展示](#效果展示)
 - [《动手学OCR》电子书📚](./doc/doc_ch/ocr_book.md)

diff --git a/doc/doc_ch/datasets.md → doc/doc_ch/dataset/datasets.md b/doc/doc_ch/datasets.md → doc/doc_ch/dataset/datasets.md
@@ -6,17 +6,17 @@
 - [中文文档文字识别](#中文文档文字识别)
 - [ICDAR2019-ArT](#ICDAR2019-ArT)
 
-除了开源数据，用户还可使用合成工具自行合成，可参考[数据合成工具](./data_synthesis.md)；
+除了开源数据，用户还可使用合成工具自行合成，可参考[数据合成工具](../data_synthesis.md)；
 
-如果需要标注自己的数据，可参考[数据标注工具](./data_annotation.md)。 
+如果需要标注自己的数据，可参考[数据标注工具](../data_annotation.md)。 
 
 <a name="ICDAR2019-LSVT"></a>
 #### 1、ICDAR2019-LSVT
 - **数据来源**：https://ai.baidu.com/broad/introduction?dataset=lsvt
 - **数据简介**： 共45w中文街景图像，包含5w（2w测试+3w训练）全标注数据（文本坐标+文本内容），40w弱标注数据（仅文本内容），如下图所示： 
- ![](../datasets/LSVT_1.jpg) 
+ ![](../../datasets/LSVT_1.jpg) 
  (a) 全标注数据 
- ![](../datasets/LSVT_2.jpg) 
+ ![](../../datasets/LSVT_2.jpg) 
  (b) 弱标注数据 
 - **下载地址**：https://ai.baidu.com/broad/download?dataset=lsvt
 - **说明**：其中，test数据集的label目前没有开源，如要评估结果，可以去官网提交：https://rrc.cvc.uab.es/?ch=16
@@ -25,16 +25,16 @@
 #### 2、ICDAR2017-RCTW-17
 - **数据来源**：https://rctw.vlrlab.net/
 - **数据简介**：共包含12,000+图像，大部分图片是通过手机摄像头在野外采集的。有些是截图。这些图片展示了各种各样的场景，包括街景、海报、菜单、室内场景和手机应用程序的截图。
- ![](../datasets/rctw.jpg)
+ ![](../../datasets/rctw.jpg)
 - **下载地址**：https://rctw.vlrlab.net/dataset/
 
 <a name="中文街景文字识别"></a>
-#### 3、中文街景文字识别 
+#### 3、中文街景文字识别
 - **数据来源**：https://aistudio.baidu.com/aistudio/competition/detail/8
 - **数据简介**：ICDAR2019-LSVT行识别任务，共包括29万张图片，其中21万张图片作为训练集（带标注），8万张作为测试集（无标注）。数据集采自中国街景，并由街景图片中的文字行区域（例如店铺标牌、地标等等）截取出来而形成。所有图像都经过一些预处理，将文字区域利用仿射变化，等比映射为一张高为48像素的图片，如图所示： 
- ![](../datasets/ch_street_rec_1.png) 
+ ![](../../datasets/ch_street_rec_1.png) 
  (a) 标注：魅派集成吊顶 
- ![](../datasets/ch_street_rec_2.png) 
+ ![](../../datasets/ch_street_rec_2.png) 
  (b) 标注：母婴用品连锁 
 - **下载地址**
 https://aistudio.baidu.com/aistudio/datasetdetail/8429
@@ -48,15 +48,15 @@ https://aistudio.baidu.com/aistudio/datasetdetail/8429
  - 包含汉字、英文字母、数字和标点共5990个字符（字符集合：https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt ）
  - 每个样本固定10个字符，字符随机截取自语料库中的句子
  - 图片分辨率统一为280x32 
- ![](../datasets/ch_doc1.jpg) 
- ![](../datasets/ch_doc3.jpg) 
+ ![](../../datasets/ch_doc1.jpg) 
+ ![](../../datasets/ch_doc3.jpg) 
 - **下载地址**：https://pan.baidu.com/s/1QkI7kjah8SPHwOQ40rS1Pw (密码：lu7m)
 
 <a name="ICDAR2019-ArT"></a>
 #### 5、ICDAR2019-ArT
 - **数据来源**：https://ai.baidu.com/broad/introduction?dataset=art
 - **数据简介**：共包含10,166张图像，训练集5603图，测试集4563图。由Total-Text、SCUT-CTW1500、Baidu Curved Scene Text (ICDAR2019-LSVT部分弯曲数据) 三部分组成，包含水平、多方向和弯曲等多种形状的文本。
- ![](../datasets/ArT.jpg)
+ ![](../../datasets/ArT.jpg)
 - **下载地址**：https://ai.baidu.com/broad/download?dataset=art
 
 ## 参考文献

diff --git a/doc/doc_ch/docvqa_datasets.md → doc/doc_ch/dataset/docvqa_datasets.md b/doc/doc_ch/docvqa_datasets.md → doc/doc_ch/dataset/docvqa_datasets.md
diff --git a/doc/doc_ch/handwritten_datasets.md → doc/doc_ch/dataset/handwritten_datasets.md b/doc/doc_ch/handwritten_datasets.md → doc/doc_ch/dataset/handwritten_datasets.md
@@ -9,7 +9,7 @@
 - **数据简介**：
  * 包含在线和离线两类手写数据，`HWDB1.0~1.2`总共有3895135个手写单字样本，分属7356类（7185个汉字和171个英文字母、数字、符号）；`HWDB2.0~2.2`总共有5091页图像，分割为52230个文本行和1349414个文字。所有文字和文本样本均存为灰度图像。部分单字样本图片如下所示。
 
- ![](../datasets/CASIA_0.jpg)
+ ![](../../datasets/CASIA_0.jpg)
 
 - **下载地址**：http:https://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
 - **使用建议**：数据为单字，白色背景，可以大量合成文字行进行训练。白色背景可以处理成透明状态，方便添加各种背景。对于需要语义的情况，建议从真实语料出发，抽取单字组成文字行
@@ -22,7 +22,7 @@
 
 - **数据简介**: NIST19数据集适用于手写文档和字符识别的模型训练，从3600位作者的手写样本表格中提取得到，总共包含81万张字符图片。其中9张图片示例如下。
 
- ![](../datasets/nist_demo.png)
+ ![](../../datasets/nist_demo.png)
 
 
 - **下载地址**: [https://www.nist.gov/srd/nist-special-database-19](https://www.nist.gov/srd/nist-special-database-19)
diff --git a/doc/doc_ch/layout_datasets.md → doc/doc_ch/dataset/layout_datasets.md b/doc/doc_ch/layout_datasets.md → doc/doc_ch/dataset/layout_datasets.md
diff --git a/..._ch/vertical_and_multilingual_datasets.md → ...set/vertical_and_multilingual_datasets.md b/..._ch/vertical_and_multilingual_datasets.md → ...set/vertical_and_multilingual_datasets.md
@@ -22,7 +22,7 @@
  * CCPD-Challenge: 至今在车牌检测识别任务中最有挑战性的一些图片
  * CCPD-NP: 没有安装车牌的新车图片。
 
- ![](../datasets/ccpd_demo.png)
+ ![](../../datasets/ccpd_demo.png)
 
 
 - **下载地址**
@@ -46,7 +46,7 @@
  * 有效期结束：07/41
  * 卡用户拼音：MICHAEL
 
- ![](../datasets/cmb_demo.jpg)
+ ![](../../datasets/cmb_demo.jpg)
 
 - **下载地址**: [https://cdn.kesci.com/cmb2017-2.zip](https://cdn.kesci.com/cmb2017-2.zip)
 
@@ -59,7 +59,7 @@
 
 - **数据简介**: 这是一个数据合成的工具包，可以根据输入的文本，输出验证码图片，使用该工具包生成几张demo图片如下。
 
- ![](../datasets/captcha_demo.png)
+ ![](../../datasets/captcha_demo.png)
 
 - **下载地址**: 该数据集是生成得到，无下载地址。
 

diff --git a/doc/doc_ch/training.md b/doc/doc_ch/training.md
@@ -81,13 +81,13 @@ Optimizer:
  - 检测： 
   - 英文数据集，ICDAR2015 
   - 中文数据集，LSVT街景数据集训练数据3w张图片
- 
+
  - 识别： 
   - 英文数据集，MJSynth和SynthText合成数据，数据量上千万。 
   - 中文数据集，LSVT街景数据集根据真值将图crop出来，并进行位置校准，总共30w张图像。此外基于LSVT的语料，合成数据500w。
   - 小语种数据集，使用不同语料和字体，分别生成了100w合成数据集，并使用ICDAR-MLT作为验证集。
 
-其中，公开数据集都是开源的，用户可自行搜索下载，也可参考[中文数据集](./datasets.md)，合成数据暂不开源，用户可使用开源合成工具自行合成，可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer) 、[SynthText](https://github.com/ankush-me/SynthText) 、[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
+其中，公开数据集都是开源的，用户可自行搜索下载，也可参考[中文数据集](dataset/datasets.md)，合成数据暂不开源，用户可使用开源合成工具自行合成，可参考的合成工具包括[text_renderer](https://github.com/Sanster/text_renderer) 、[SynthText](https://github.com/ankush-me/SynthText) 、[TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) 等。
 
 <a name="垂类场景"></a>
 ### 3.2 垂类场景
@@ -120,17 +120,17 @@ PaddleOCR主要聚焦通用OCR，如果有垂类需求，您可以用PaddleOCR+
 **Q**：训练CRNN识别时，如何选择合适的网络输入shape？
 
  A：一般高度采用32，最长宽度的选择，有两种方法：
- 
+
  （1）统计训练样本图像的宽高比分布。最大宽高比的选取考虑满足80%的训练样本。
- 
+
  （2）统计训练样本文字数目。最长字符数目的选取考虑满足80%的训练样本。然后中文字符长宽比近似认为是1，英文认为3：1，预估一个最长宽度。
 
 **Q**：识别训练时，训练集精度已经到达90了，但验证集精度一直在70，涨不上去怎么办？
 
  A：训练集精度90，测试集70多的话，应该是过拟合了，有两个可尝试的方法：
- 
+
  （1）加入更多的增广方式或者调大增广prob的[概率](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/rec_img_aug.py#L341)，默认为0.4。
- 
+
  （2）调大系统的[l2 dcay值](https://github.com/PaddlePaddle/PaddleOCR/blob/a501603d54ff5513fc4fc760319472e59da25424/configs/rec/ch_ppocr_v1.1/rec_chinese_lite_train_v1.1.yml#L47)
 
 **Q**: 识别模型训练时，loss能正常下降，但acc一直为0
@@ -141,12 +141,11 @@ PaddleOCR主要聚焦通用OCR，如果有垂类需求，您可以用PaddleOCR+
 
 ***
 
-具体的训练教程可点击下方链接跳转： 
+具体的训练教程可点击下方链接跳转：
 
-- [文本检测模型训练](./detection.md) 
+- [文本检测模型训练](./detection.md)
 
 - [文本识别模型训练](./recognition.md) 
 
 - [文本方向分类器训练](./angle_class.md) 
 - [知识蒸馏](./knowledge_distillation.md)
-
diff --git a/doc/doc_ch/update.md b/doc/doc_ch/update.md
@@ -22,7 +22,7 @@
 - 2020.7.15 整理OCR相关数据集、常用数据标注以及合成工具
 - 2020.7.9 添加支持空格的识别模型，识别效果，预测及训练方式请参考快速开始和文本识别训练相关文档
 - 2020.7.9 添加数据增强、学习率衰减策略,具体参考[配置文件](./config.md)
-- 2020.6.8 添加[数据集](./datasets.md)，并保持持续更新
+- 2020.6.8 添加[数据集](dataset/datasets.md)，并保持持续更新
 - 2020.6.5 支持 `attetnion` 模型导出 `inference_model`
 - 2020.6.5 支持单独预测识别时，输出结果得分
 - 2020.5.30 提供超轻量级中文OCR在线体验

diff --git a/doc/doc_en/FAQ_en.md b/doc/doc_en/FAQ_en.md
@@ -42,7 +42,7 @@ At present, the open source model, dataset and magnitude are as follows:
  English dataset: MJSynth and SynthText synthetic dataset, the amount of data is tens of millions. 
  Chinese dataset: LSVT street view dataset with cropped text area, a total of 30w images. In addition, the synthesized data based on LSVT corpus is 500w.
 
- Among them, the public datasets are opensourced, users can search and download by themselves, or refer to [Chinese data set](./datasets_en.md), synthetic data is not opensourced, users can use open-source synthesis tools to synthesize data themselves. Current available synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator), etc.
+ Among them, the public datasets are opensourced, users can search and download by themselves, or refer to [Chinese data set](dataset/datasets_en.md), synthetic data is not opensourced, users can use open-source synthesis tools to synthesize data themselves. Current available synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator), etc.
 
 10. **Error in using the model with TPS module for prediction** 
 Error message: Input(X) dims[3] and Input(Grid) dims[2] should be equal, but received X dimension[3]\(108) != Grid dimension[2]\(100) 

diff --git a/doc/doc_en/datasets_en.md → doc/doc_en/dataset/datasets_en.md b/doc/doc_en/datasets_en.md → doc/doc_en/dataset/datasets_en.md
@@ -12,30 +12,30 @@ In addition to opensource data, users can also use synthesis tools to synthesize
 #### 1. ICDAR2019-LSVT
 - **Data sources**：https://ai.baidu.com/broad/introduction?dataset=lsvt
 - **Introduction**： A total of 45w Chinese street view images, including 5w (2w test + 3w training) fully labeled data (text coordinates + text content), 40w weakly labeled data (text content only), as shown in the following figure:
- ![](../datasets/LSVT_1.jpg)
+ ![](../../datasets/LSVT_1.jpg)
 
  (a) Fully labeled data
 
- ![](../datasets/LSVT_2.jpg)
- 
+ ![](../../datasets/LSVT_2.jpg)
+
  (b) Weakly labeled data
 - **Download link**：https://ai.baidu.com/broad/download?dataset=lsvt
 
 <a name="ICDAR2017-RCTW-17"></a>
 #### 2. ICDAR2017-RCTW-17
 - **Data sources**：https://rctw.vlrlab.net/
 - **Introduction**：It contains 12000 + images, most of them are collected in the wild through mobile camera. Some are screenshots. These images show a variety of scenes, including street views, posters, menus, indoor scenes and screenshots of mobile applications.
- ![](../datasets/rctw.jpg)
+ ![](../../datasets/rctw.jpg)
 - **Download link**：https://rctw.vlrlab.net/dataset/
 
 <a name="中文街景文字识别"></a>
 #### 3. Chinese Street View Text Recognition
 - **Data sources**：https://aistudio.baidu.com/aistudio/competition/detail/8
 - **Introduction**：A total of 290000 pictures are included, of which 210000 are used as training sets (with labels) and 80000 are used as test sets (without labels). The dataset is collected from the Chinese street view, and is formed by by cutting out the text line area (such as shop signs, landmarks, etc.) in the street view picture. All the images are preprocessed: by using affine transform, the text area is proportionally mapped to a picture with a height of 48 pixels, as shown in the figure:
 
- ![](../datasets/ch_street_rec_1.png) 
+ ![](../../datasets/ch_street_rec_1.png) 
  (a) Label: 魅派集成吊顶 
- ![](../datasets/ch_street_rec_2.png) 
+ ![](../../datasets/ch_street_rec_2.png) 
  (b) Label: 母婴用品连锁 
 - **Download link**
 https://aistudio.baidu.com/aistudio/datasetdetail/8429
@@ -49,13 +49,13 @@ https://aistudio.baidu.com/aistudio/datasetdetail/8429
  - 5990 characters including Chinese characters, English letters, numbers and punctuation（Characters set: https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt ）
  - Each sample is fixed with 10 characters, and the characters are randomly intercepted from the sentences in the corpus
  - Image resolution is 280x32 
- ![](../datasets/ch_doc1.jpg) 
- ![](../datasets/ch_doc3.jpg) 
+ ![](../../datasets/ch_doc1.jpg) 
+ ![](../../datasets/ch_doc3.jpg) 
 - **Download link**：https://pan.baidu.com/s/1QkI7kjah8SPHwOQ40rS1Pw (Password: lu7m)
 
 <a name="ICDAR2019-ArT"></a>
 #### 5、ICDAR2019-ArT
 - **Data source**：https://ai.baidu.com/broad/introduction?dataset=art
 - **Introduction**：It includes 10166 images, 5603 in training sets and 4563 in test sets. It is composed of three parts: total text, scut-ctw1500 and Baidu curved scene text, including text with various shapes such as horizontal, multi-directional and curved.
- ![](../datasets/ArT.jpg)
+ ![](../../datasets/ArT.jpg)
 - **Download link**：https://ai.baidu.com/broad/download?dataset=art
diff --git a/doc/doc_en/handwritten_datasets_en.md → ...doc_en/dataset/handwritten_datasets_en.md b/doc/doc_en/handwritten_datasets_en.md → ...doc_en/dataset/handwritten_datasets_en.md
@@ -9,7 +9,7 @@ Here we have sorted out the commonly used handwritten OCR dataset datasets, whic
 - **Data introduction**:
  * It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.
 
- ![](../datasets/CASIA_0.jpg)
+ ![](../../datasets/CASIA_0.jpg)
 
 - **Download address**:http:https://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
 - **使用建议**:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.
@@ -22,7 +22,7 @@ Here we have sorted out the commonly used handwritten OCR dataset datasets, whic
 
 - **Data introduction**: NIST19 dataset is suitable for handwritten document and character recognition model training. It is extracted from the handwritten sample form of 3600 authors and contains 810000 character images in total. Nine of them are shown below.
 
- ![](../datasets/nist_demo.png)
+ ![](../../datasets/nist_demo.png)
 
 
 - **Download address**: [https://www.nist.gov/srd/nist-special-database-19](https://www.nist.gov/srd/nist-special-database-19)