Skip to content

Latest commit

 

History

History
 
 

testsets

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated (Chen et al., 2015), where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 (Li et al., 2018) is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations.

CA8

CA8 incorporates comprehensive morphological and semantic relations in Chinese. Specifically, CA8-morphological (CA8-Mor) contains 10177 morphological questions, which are constructed based on two types of relations: reduplication and semi-affixation. CA8-semantic (CA8-Sem) contains 7636 semantic questions, which can be divided into 4 categories and 28 sub-categories. Detailed description is as follows:

Morphological Questions: Reduplication
Category Sub-category POS Morphological Function Example
A AA Noun Form kinship terms 爸 (dad) → 爸爸 (dad)
Yield every / each meaning 天 (day) → 天天 (everyday)
Measure Yield every / each meaning 个 (-) → 个个 (every/each)
Verb Signal doing something a little bit 说 (say) → 说说 (say a little)
Signal things happen briefly 看 (look) → 看看 (have a brief look)
Adjective Intensify the adjective 大 (big) → 大大 (very big)
Transform it to adverbs 慢 (slow) → 慢慢 (slowly)
A yi A Verb Signal trying to do something 吃 (eat) → 吃一吃 (try to eat)
A lai A qu Verb Signal doing something repeatedly 飞 (fly) → 飞来飞去 (fly around)
AB AABB Noun Yield many / much meaning 山水 (mountain and river) → 山山水水 (many mountains and rivers)
Verb Indicate a continuous action 说笑 (laugh and chat) → 说说笑笑 (laugh and chat for a while)
Adjective Intensify the adjective 清楚 (clear) → 清清楚楚 (very clear)
Yield the meaning of not uniform 大小 (size) → 大大小小 (all sizes)
Adverb Intensify the adverb 彻底 (completely) → 彻彻底底 (totally and completely)
A li A B Adjective Oralize the adjective and yield derogatory meaning 慌张 (flurried) → 慌里慌张 (anxious)
ABAB Verb Signal doing something a little bit 注意 (pay attention) → 注意注意 (pay a little attention)
Adjective Intensify the adjective 雪白 (white) → 雪白雪白 (very white)
Transform it to a verb 高兴 (happy) → 高兴高兴 (make someone happy)

Affixation is a morphological process whereby a bound morpheme (an affix) is attached to roots or stems to form new language units. Chinese is a typical isolating language that has few affixes. Liu et al. (2001) points out that although affixes are rare in Chinese, there are some components behaving like affixes and can also be used as independent lexemes. They are called semi-affixes. We follow their work and adopt this concept.

Morphological Questions: Semi-affixation
Category Semi-affix Example
Semi-prefix 一 (one) → 第一 (first)
一 (one) → 初一 (the first day of a lunar month)
一 (one) → 十一 (eleven)
一 (one) → 周一 (Monday)
星期 一 (one) → 星期一 (Monday)
虎 (tiger) → 老虎 (tiger)
草 (grass) → 小草 (grass)
海 (sea) → 大海 (large sea)
导体 (conductor) → 半导体 (semiconductor)
细胞 (cell) → 单细胞 (unicell)
链接 (link) → 超链接 (hyperlink)
大陆 (continent) → 次大陆 (subcontinent)
常规 (conventional) → 非常规 (unconventional)
次 (time) → 每次 (every time)
明星 (star) → 全明星 (all star)
君子 (gentlemen) → 伪君子 (hypocrites)
热带 (tropical zone) → 亚热带 (sub-tropical zone)
酒 (wine) → 洋酒 (foreign wine)
比分 (score) → 总比分 (total score)
物质 (matter) → 反常规 (antimatter)
总统 (president) → 副总统 (vice president)
Semi-suffix 我 (I) → 我们 (we)
这 (here) → 这里 (here)
这 (this) → 这些 (these)
这 (this) → 这样 (such)
这 (this) → 这个 (this one)
这 (this) → 这边 (here)
这 (this) → 这种 (this kind)
这 (this) → 这次 (this time)
这 (this) → 这儿 (here)
东 (east) → 东部 (east)
心 (heart) → 心中 (in the heart)
山 (mountain) → 山上 (on the mountain)
前 (front) → 前面 (in the front)
强 (strong) → 强者 (the strong one)
科学 (science) → 科学家 (scientist)
胖 (fat) → 胖子 (a fat man)
木 (wood) → 木头 (wood)
木 (wood) → 木工 (carpentry)
木 (wood) → 木匠 (carpenter)
笑 (laugh) → 笑星 (comedian)
老 (old) → 老手 (old hand)
主义 乐观 (optimistic) → 乐观主义 (optimism)
吝啬 (stingy) → 吝啬鬼 (miser)
中 (Chinese) → 中式 (Chinese style)
考古 (archaeology) → 考古队 (archaeological team)
黄 (yellow) → 黄色 (the yellow color)
地质 (geology) → 地质学 (discipline of geology)
宿命 (fate) → 宿命论 (fatalism)
汽车 (bus) → 汽车站 (bus station)
光谱 (spectrum) → 光谱仪 (spectrograph)
学术 (academic) → 学术界 (academia)
追星 (chasing a star) → 追星族 (fans)
赌 (gamble) → 赌棍 (gambler)
雨 (rain) → 雨灾 (rain disaster)
冷 (cold) → 冷气 (cold air)
酸 (acid) → 酸性 (acidic)
歌 (song) → 歌厅 (KTV)
复印 (copy) → 复印机 (copier)
说 (say) → 说法 (saying)
粤 (Yue) → 粤剧 (Cantonese Opera)
船 (ship) → 船长 (captain of a ship)
Semantic Questions
Category Sub-category Example
Geography country - capital 中国 (China) - 北京 (Beijing)
country - currency 中国 (China) - 人民币 (Chinese yuan)
province - abbreviation 广东 (Guangdong) - 粤 (Yue)
province - capital 广东 (Guangdong) - 广州 (Guangzhou)
province - drama 广东 (Guangdong) - 粤剧 (Cantonese Opera)
province - channel 广东 (Guangdong) - 广东卫视 (Guangdong Satellite TV)
province - university 浙江 (Zhejiang) - 浙江大学 (Zhejiang University)
city - university 南京 (Nanjing) - 南京大学 (Nanjing University)
university - abbreviation 师范大学 (Normal University) - 师大 (Normal University)
History dynasty - emperor 汉 (Han) - 刘邦 (Liu Bang)
dynasty - capital 秦 (Qin) - 咸阳 (Xian Yang)
title - emperor 汉高祖 (Emperor Gaozu of Han) - 刘邦 (Liu Bang)
celebrity - country 屈原 (Qu Yuan) - 楚国 (Country Chu)
Nature number 第一 (first) - 状元 (the first in an imperial examination)
time 春节 (Spring Festival) - 正月 (the first month in a lunar year)
animal 公鸡 (cock) - 母鸡 (hen)
plant 杏树 (apricot tree) - 杏 (apricot)
ornament 手指 (finger) - 戒指 (ring)
chemistry 盐 (salt) - 氯化钠 (sodium chloride)
physics 冰 (ice) - 水蒸气 (steam)
weather 小满 (Grain Full) - 夏天 (summer)
reverse 松 (loose) - 紧 (tight)
color 海 (sea) - 蓝色 (blue)
People company - founder 阿里巴巴 (Alibaba) - 马云 (Ma Yun)
work - scientist 地动仪 (seismograph) - 张衡 (Zhang Heng)
work - writer 朝花夕拾 (Dawn Blossoms Plucked at Dusk) - 鲁迅 (Lu Xun)
family - member 爷爷 (grandfather) - 孙子 (grandson)
student - degree 小学 (elementary school) - 小学生 (schoolchild)

Reference

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In IJCAI. pages 1236–1242.

Yuehua Liu, Wenyu Pan, and Wei Gu. 2001. Practical grammar of modern Chinese. The Commercial Press.