Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Data

analogy

analogy.txt is the analogical reasoning dataset on Chinese.

wordsim

240.txt and 297.txt are wordsim-240 and wordsim-296 respectively.

Word pair OPEC 石油 in 297.txt is removed in the test.

These two datasets are conventional similarity test for Chinese. These files are uploaded for convenience and they are NOT created by the authors of the paper. However, it is hard to find the source. I will be willing to accept any suggestion about refining the reference.

Wordsim-240 (original name: words-240) is from 汪祥, 贾焰, 周斌, 丁兆云, 梁政. 基于中文维基百科链接结构与分类体系的语义相关度计算. 小型微型计算机系统. 2011, 32(11):2237-2242. (pdf) and Wang Xiang, Jia Yan, Zhou Bin, et al. Computing Semantic Relatedness using Chinese Wikipedia Links and Taxonomy. Journal of Chinese Computer Systems, 2011, 32(11): 2237-2242. (pdf)

Wordsim-296 is from SemEval-2012 task 4: evaluating Chinese word similarity. (Abstract) (pdf)

Non-compositional wordlist

The wordlist is uploaded as non-composition.txt.