Skip to content

A parallel corpus of article-headline pairs obtained from Japanese Wikinews.

Notifications You must be signed in to change notification settings

tm4roon/jawikinews-headline-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Japanese-Wikinews Headline Dataset

The datasets contain article-headline pairs obtained from Japanese Wikinews. The articles and headlines are segmented to words using mecab-ipadic.

In this repository, there are following three version datasets according to the article length:

  • full-articles: the dataset with articles more than 10 tokens, and headlines;
  • long version: the dataset with articles extracted from the first five sentences or 256 tokens, and headlines.
  • short version: the dataset with articles extracted from the three sentences or 128 tokens, and headlines.

Data Statistics

Table1 Number of documents

Table2 N-gram overlaps in headline

Releases

No releases published

Packages

No packages published