Split a letter string to pinyin tokens
pysplit [-intials] [-lm lm_mode_file] string1 string2 string3 ...
pysplit -h: help
pysplit -intials(-i): support split by single intial
# use make
make
Haven't test on Windows, but the codes were written in C++11, so it should work on Windows with few changes.
A bigram Pinyin language model was trained. The training set is built in the following steps:
-
convert the chinese sentence into pinyin with pypinyin, for example:
我爱北京天安门,我爱中国。 wo ai bei jing tian an men,wo ai zhong guo。
-
restrict the vocabulary in pinyin token, initials; all the other word with be treated as UNK.
-
for support initials patterns, such as
zhangly
,wdlei
, the "pinyin" words will be ramdonly changed into there initials in specified probabilty((in the subimtted model, it is 1%)). for example:bei jing => bei jing 100% b jing 1% b j 1% bei j 1%
The perplexity is calculated by bigram with Laplace Smoothing(Add One Smoothing). Because the maxmium count of bigram patterns is less than 160,000, Laplace Smoothing is good engouh.