The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated (Chen et al., 2015), where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 (Li et al., 2018) is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations.
CA8 incorporates comprehensive morphological and semantic relations in Chinese. Specifically, CA8-morphological (CA8-Mor) contains 10177 morphological questions, which are constructed based on two types of relations: reduplication and semi-affixation. CA8-semantic (CA8-Sem) contains 7636 semantic questions, which can be divided into 4 categories and 28 sub-categories. Detailed description is as follows:
Morphological Questions: Reduplication | ||||
Category | Sub-category | POS | Morphological Function | Example |
A | AA | Noun | Form kinship terms | 爸 (dad) → 爸爸 (dad) |
Yield every / each meaning | 天 (day) → 天天 (everyday) | |||
Measure | Yield every / each meaning | 个 (-) → 个个 (every/each) | ||
Verb | Signal doing something a little bit | 说 (say) → 说说 (say a little) | ||
Signal things happen briefly | 看 (look) → 看看 (have a brief look) | |||
Adjective | Intensify the adjective | 大 (big) → 大大 (very big) | ||
Transform it to adverbs | 慢 (slow) → 慢慢 (slowly) | |||
A yi A | Verb | Signal trying to do something | 吃 (eat) → 吃一吃 (try to eat) | |
A lai A qu | Verb | Signal doing something repeatedly | 飞 (fly) → 飞来飞去 (fly around) | |
AB | AABB | Noun | Yield many / much meaning | 山水 (mountain and river) → 山山水水 (many mountains and rivers) |
Verb | Indicate a continuous action | 说笑 (laugh and chat) → 说说笑笑 (laugh and chat for a while) | ||
Adjective | Intensify the adjective | 清楚 (clear) → 清清楚楚 (very clear) | ||
Yield the meaning of not uniform | 大小 (size) → 大大小小 (all sizes) | |||
Adverb | Intensify the adverb | 彻底 (completely) → 彻彻底底 (totally and completely) | ||
A li A B | Adjective | Oralize the adjective and yield derogatory meaning | 慌张 (flurried) → 慌里慌张 (anxious) | |
ABAB | Verb | Signal doing something a little bit | 注意 (pay attention) → 注意注意 (pay a little attention) | |
Adjective | Intensify the adjective | 雪白 (white) → 雪白雪白 (very white) | ||
Transform it to a verb | 高兴 (happy) → 高兴高兴 (make someone happy) |
Affixation is a morphological process whereby a bound morpheme (an affix) is attached to roots or stems to form new language units. Chinese is a typical isolating language that has few affixes. Liu et al. (2001) points out that although affixes are rare in Chinese, there are some components behaving like affixes and can also be used as independent lexemes. They are called semi-affixes. We follow their work and adopt this concept.
Morphological Questions: Semi-affixation | ||
Category | Semi-affix | Example |
Semi-prefix | 第 | 一 (one) → 第一 (first) |
初 | 一 (one) → 初一 (the first day of a lunar month) | |
十 | 一 (one) → 十一 (eleven) | |
周 | 一 (one) → 周一 (Monday) | |
星期 | 一 (one) → 星期一 (Monday) | |
老 | 虎 (tiger) → 老虎 (tiger) | |
小 | 草 (grass) → 小草 (grass) | |
大 | 海 (sea) → 大海 (large sea) | |
半 | 导体 (conductor) → 半导体 (semiconductor) | |
单 | 细胞 (cell) → 单细胞 (unicell) | |
超 | 链接 (link) → 超链接 (hyperlink) | |
次 | 大陆 (continent) → 次大陆 (subcontinent) | |
非 | 常规 (conventional) → 非常规 (unconventional) | |
每 | 次 (time) → 每次 (every time) | |
全 | 明星 (star) → 全明星 (all star) | |
伪 | 君子 (gentlemen) → 伪君子 (hypocrites) | |
亚 | 热带 (tropical zone) → 亚热带 (sub-tropical zone) | |
洋 | 酒 (wine) → 洋酒 (foreign wine) | |
总 | 比分 (score) → 总比分 (total score) | |
反 | 物质 (matter) → 反常规 (antimatter) | |
副 | 总统 (president) → 副总统 (vice president) | |
Semi-suffix | 们 | 我 (I) → 我们 (we) |
里 | 这 (here) → 这里 (here) | |
些 | 这 (this) → 这些 (these) | |
样 | 这 (this) → 这样 (such) | |
个 | 这 (this) → 这个 (this one) | |
边 | 这 (this) → 这边 (here) | |
种 | 这 (this) → 这种 (this kind) | |
次 | 这 (this) → 这次 (this time) | |
儿 | 这 (this) → 这儿 (here) | |
部 | 东 (east) → 东部 (east) | |
中 | 心 (heart) → 心中 (in the heart) | |
上 | 山 (mountain) → 山上 (on the mountain) | |
面 | 前 (front) → 前面 (in the front) | |
者 | 强 (strong) → 强者 (the strong one) | |
家 | 科学 (science) → 科学家 (scientist) | |
子 | 胖 (fat) → 胖子 (a fat man) | |
头 | 木 (wood) → 木头 (wood) | |
工 | 木 (wood) → 木工 (carpentry) | |
匠 | 木 (wood) → 木匠 (carpenter) | |
星 | 笑 (laugh) → 笑星 (comedian) | |
手 | 老 (old) → 老手 (old hand) | |
主义 | 乐观 (optimistic) → 乐观主义 (optimism) | |
鬼 | 吝啬 (stingy) → 吝啬鬼 (miser) | |
式 | 中 (Chinese) → 中式 (Chinese style) | |
队 | 考古 (archaeology) → 考古队 (archaeological team) | |
色 | 黄 (yellow) → 黄色 (the yellow color) | |
学 | 地质 (geology) → 地质学 (discipline of geology) | |
论 | 宿命 (fate) → 宿命论 (fatalism) | |
站 | 汽车 (bus) → 汽车站 (bus station) | |
仪 | 光谱 (spectrum) → 光谱仪 (spectrograph) | |
界 | 学术 (academic) → 学术界 (academia) | |
族 | 追星 (chasing a star) → 追星族 (fans) | |
棍 | 赌 (gamble) → 赌棍 (gambler) | |
灾 | 雨 (rain) → 雨灾 (rain disaster) | |
气 | 冷 (cold) → 冷气 (cold air) | |
性 | 酸 (acid) → 酸性 (acidic) | |
厅 | 歌 (song) → 歌厅 (KTV) | |
机 | 复印 (copy) → 复印机 (copier) | |
法 | 说 (say) → 说法 (saying) | |
剧 | 粤 (Yue) → 粤剧 (Cantonese Opera) | |
长 | 船 (ship) → 船长 (captain of a ship) |
Semantic Questions | ||
Category | Sub-category | Example |
Geography | country - capital | 中国 (China) - 北京 (Beijing) |
country - currency | 中国 (China) - 人民币 (Chinese yuan) | |
province - abbreviation | 广东 (Guangdong) - 粤 (Yue) | |
province - capital | 广东 (Guangdong) - 广州 (Guangzhou) | |
province - drama | 广东 (Guangdong) - 粤剧 (Cantonese Opera) | |
province - channel | 广东 (Guangdong) - 广东卫视 (Guangdong Satellite TV) | |
province - university | 浙江 (Zhejiang) - 浙江大学 (Zhejiang University) | |
city - university | 南京 (Nanjing) - 南京大学 (Nanjing University) | |
university - abbreviation | 师范大学 (Normal University) - 师大 (Normal University) | |
History | dynasty - emperor | 汉 (Han) - 刘邦 (Liu Bang) |
dynasty - capital | 秦 (Qin) - 咸阳 (Xian Yang) | |
title - emperor | 汉高祖 (Emperor Gaozu of Han) - 刘邦 (Liu Bang) | |
celebrity - country | 屈原 (Qu Yuan) - 楚国 (Country Chu) | |
Nature | number | 第一 (first) - 状元 (the first in an imperial examination) |
time | 春节 (Spring Festival) - 正月 (the first month in a lunar year) | |
animal | 公鸡 (cock) - 母鸡 (hen) | |
plant | 杏树 (apricot tree) - 杏 (apricot) | |
ornament | 手指 (finger) - 戒指 (ring) | |
chemistry | 盐 (salt) - 氯化钠 (sodium chloride) | |
physics | 冰 (ice) - 水蒸气 (steam) | |
weather | 小满 (Grain Full) - 夏天 (summer) | |
reverse | 松 (loose) - 紧 (tight) | |
color | 海 (sea) - 蓝色 (blue) | |
People | company - founder | 阿里巴巴 (Alibaba) - 马云 (Ma Yun) |
work - scientist | 地动仪 (seismograph) - 张衡 (Zhang Heng) | |
work - writer | 朝花夕拾 (Dawn Blossoms Plucked at Dusk) - 鲁迅 (Lu Xun) | |
family - member | 爷爷 (grandfather) - 孙子 (grandson) | |
student - degree | 小学 (elementary school) - 小学生 (schoolchild) |
Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.
Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, and Huanbo Luan. 2015. Joint learning of character and word embeddings. In IJCAI. pages 1236–1242.
Yuehua Liu, Wenyu Pan, and Wei Gu. 2001. Practical grammar of modern Chinese. The Commercial Press.