T-H-E Dataset

This dataset includes handwritten Turkish, Hungarian and English characters collected from 200 participants. 78 different letters are represented in 156000 binary characters including both the upper and lower-case versions in T-H-E Dataset (Turkish-Hungarian-English). It can be downloaded in six different versions enabling users to combine the different alphabets for different recognition purposes. Different versions are explained below. For further details, see the article at the bottom of the page.

SUMMARY

VERSION	Number of Characters	Number of Classes	Data	Contained characters
Version I	156000	78	Original + Augmented	English, Hungarian, Turkish
Version II	24000	12	Original + Augmented	Turkish special
Version III	36000	18	Original + Augmented	Hungarian special
Version IV	104000	52	Original + Augmented	English
Version V	156000	55	Original + Augmented	English, Hungarian, Turkish
Version VI	78000	78	Original	English, Hungarian, Turkish

VERSIONS

Version I: This version represents the entire dataset. It includes all the 28x28 pixel binary characters from the three alphabets together forming a balanced dataset with 156000 characters belonging to 78 classes.

Version II: It consists of merely 12 Turkish special characters (6 upper-case and 6 lower-case). 2000 samples of each character can be found in the Version II forming a 24000-character dataset.

Version III: Similar to the Version II, this includes 18 Hungarian special characters only (9 lower-case and 9 upper-case) forming a 36000-character dataset.

Version IV: The fourth version includes 2000 samples of 52 English characters (26 upper-case and 26 lower-case). This representation enables us to merge English letters with Hungarian special Characters and work only on Hungarian characters by just putting two versions together. A fair warning should be provided about the Turkish alphabet; putting Version II and Version IV together does not result in the Turkish alphabet since there are no letter “q”,” w” and “x” in the Turkish alphabet. The users may want to exclude those 3 letters (3 lower-case and 3-upper-case) from the Version 4 in order to work on Turkish alphabet accurately.

Version V: This version is derived from the Version I type which includes all the characters from different alphabets together. The characters having a similar way of representation in their upper-case and lower-case form are put into the same class in this version such as lower case “o” and upper case “O”. The characters merged are shown in the MERGED CLASSES section below. In this group there are 55 classes and 156000 samples. However, only in this version the number of instances in each class is not balanced. Some classes have 2000 samples whereas merged ones are represented in 4000 samples.

Version VI: The original handwritten characters (1000 instances for every 78 classes) are put forward in the sixth version. Using this version, it is possible to experiment different distortion techniques and their impact on the classification performance can be tested. 78000 characters from 78 different classes can be found in this version.

CLASSES LABELS

a-1	t-20	A-40	U-60
b-2	u-21	B-41	V-61
c-3	v-22	C-42	W-62
d-4	w-23	D-43	X-63
e-5	x-24	E-44	Y-64
f-6	y-25	F-45	Z-65
g-7	z-26	G-46	Ç-66
h-8	ç-27	H-47	Ğ-67
i-9	ğ-28	I-48	İ-68
j-10	ı-29	J-49	Ş-69
k-11	ş-30	K-50	Ö-70
l-12	ö-31	L-51	Ü-71
m-13	ü-32	M-52	Á-72
n-14	á-33	N-53	É-73
o-15	é-34	O-54	Í-74
p-16	í-35	P-55	Ó-75
q-17	ó-36	Q-56	Ő-76
r-18	ő-37	R-57	Ú-77
s-19	ú-38	S-58	Ű-78
	ű-39	T-59

MERGED Class Labels

a	1	l	12	w-W	23	é	34	G	45
b	2	m-M	13	x-X	24	í-Í	35	H	46
c-C	3	n	14	y-Y	25	ó-Ó	36	L	47
d	4	o-O	15	z-Z	26	ő-Ő	37	N	48
e	5	p-P	16	ç	27	ú-Ú	38	Q	49
f	6	q	17	ğ	28	ű-Ű	39	R	50
g	7	r	18	ı-İ	29	A	40	T	51
h	8	s-S	19	ş-Ş	30	B	41	Ç	52
i-I	9	t	20	ö-Ö	31	D	42	Ğ	53
j-J	10	u-U	21	ü-Ü	32	E	43	Á	54
k-K	11	v-V	22	á	33	F	44	É	55

MERGED Letters

c- C	s-S
i-I	ş-Ş
í- Í	u-U
ı-İ	ú -Ú
j-J	Ü-Ü
k-K	ű- Ű
m-M	v-V
o-O	w-W
ó- Ó	x-X
Ö-Ö	y-Y
ő- Ő	z-Z
p-P

How to cite

Please cite the following paper when using or referencing the dataset:

G. Ediboğlu Bartos, Y. Hoscan, A. Kauer, and É. Hajnal, “A Multilingual Handwritten Character Dataset: T-H-E Dataset,” Acta Polytechnica Hungarica, 2020.(under acceptance)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitattributes		.gitattributes
README.md		README.md
augment.m		augment.m
readme.txt		readme.txt
version1.csv		version1.csv
version2.csv		version2.csv
version3.csv		version3.csv
version4.csv		version4.csv
version5.csv		version5.csv
version6.csv		version6.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T-H-E Dataset

SUMMARY

VERSIONS

CLASSES LABELS

MERGED Class Labels

MERGED Letters

How to cite

About

Releases

Packages

Languages

bartosgaye/thedataset

Folders and files

Latest commit

History

Repository files navigation

T-H-E Dataset

SUMMARY

VERSIONS

CLASSES LABELS

MERGED Class Labels

MERGED Letters

How to cite

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages