Skip to content

Commit

Permalink
add dataset desc
Browse files Browse the repository at this point in the history
  • Loading branch information
WenmuZhou committed Aug 16, 2022
1 parent 02e881e commit d69b74e
Show file tree
Hide file tree
Showing 4 changed files with 40 additions and 2 deletions.
11 changes: 11 additions & 0 deletions doc/doc_ch/dataset/table_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
- [数据集汇总](#数据集汇总)
- [1. PubTabNet数据集](#1-pubtabnet数据集)
- [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集)
- [3. 好未来表格识别竞赛数据集](#2-WTW中文场景表格数据集)

这里整理了常用表格识别数据集,持续更新中,欢迎各位小伙伴贡献数据集~

Expand All @@ -12,6 +13,7 @@
|---|---|---|
| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
| 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
| WTW中文场景表格数据集 |https://github.com/wangwen-whu/WTW-Dataset| 需要进行转换后才能用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |

## 1. PubTabNet数据集
- **数据简介**:PubTabNet数据集的训练集合中包含50万张图像,验证集合中包含0.9万张图像。部分图像可视化如下所示。
Expand All @@ -31,3 +33,12 @@
<img src="../../datasets/table_tal_demo/1.jpg" width="500">
<img src="../../datasets/table_tal_demo/2.jpg" width="500">
</div>

## 3. WTW中文场景表格数据集
- **数据简介**:WTW中文场景表格数据集包含表格检测和表格数据两部分数据,数据集中同时包含扫描和拍照两张场景的图像。

https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif

<div align="center">
<img src="https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif" width="500">
</div>
10 changes: 10 additions & 0 deletions doc/doc_en/dataset/table_datasets_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
- [Dataset Summary](#dataset-summary)
- [1. PubTabNet](#1-pubtabnet)
- [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset)
- [3. WTW Chinese scene table dataset](#3-wtw-chinese-scene-table-dataset)

Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~

Expand All @@ -12,6 +13,7 @@ Here are the commonly used table recognition datasets, which are being updated c
|---|---|---|
| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
| TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
| WTW Chinese scene table dataset |https://github.com/wangwen-whu/WTW-Dataset| Conversion is required to load with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)|

## 1. PubTabNet
- **Data Introduction**:The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below.
Expand All @@ -30,3 +32,11 @@ Here are the commonly used table recognition datasets, which are being updated c
<img src="../../datasets/table_tal_demo/1.jpg" width="500">
<img src="../../datasets/table_tal_demo/2.jpg" width="500">
</div>

## 3. WTW Chinese scene table dataset
- **Data Introduction**:The WTW Chinese scene table dataset consists of two parts: table detection and table data. The dataset contains images of two scenes, scanned and photographed.
https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif

<div align="center">
<img src="https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif" width="500">
</div>
11 changes: 10 additions & 1 deletion ppstructure/table/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,16 @@ After the operation is completed, the excel table of each image will be saved to
In this chapter, we only introduce the training of the table structure model, For model training of [text detection](../../doc/doc_en/detection_en.md) and [text recognition](../../doc/doc_en/recognition_en.md), please refer to the corresponding documents

* data preparation
The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。

For the Chinese model and the English model, the data sources are different, as follows:

English dataset: The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。

Chinese dataset: The Chinese dataset consists of the following two parts, which are trained with a 1:1 sampling ratio.
> 1. Generate dataset: Use [Table Generation Tool](https://github.com/WenmuZhou/TableGeneration) to generate 40,000 images.
> 2. Crop 10,000 images from [WTW](https://github.com/wangwen-whu/WTW-Dataset).
For a detailed introduction to public datasets, please refer to [table_datasets](../../doc/doc_en/dataset/table_datasets_en.md). The following training and evaluation procedures are based on the English dataset as an example.

* Start training
*If you are installing the cpu version of paddle, please modify the `use_gpu` field in the configuration file to false*
Expand Down
10 changes: 9 additions & 1 deletion ppstructure/table/README_ch.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,15 @@ note: 上述模型是在 PubLayNet 数据集上训练的表格识别模型,仅

* 数据准备

训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683)[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。
对于中文模型和英文模型,数据来源不同,分别介绍如下

英文数据集: 训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683)[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。

中文数据集: 中文数据集下面两部分构成,这两部分安装1:1的采样比例进行训练。
> 1. 生成数据集: 使用[表格生成工具](https://github.com/WenmuZhou/TableGeneration)生成4w张。
> 2.[WTW](https://github.com/wangwen-whu/WTW-Dataset)中获取1w张。
关于公开数据集的详细介绍可以参考 [table_datasets](../../doc/doc_ch/dataset/table_datasets.md),下述训练和评估流程均以英文数据集为例。

* 启动训练

Expand Down

0 comments on commit d69b74e

Please sign in to comment.