Skip to content

ai-geeker/pysplit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pysplit

Introduction

Split a letter string to pinyin tokens

pysplit [-intials] [-lm lm_mode_file] string1 string2 string3 ...
pysplit -h: help
pysplit -intials(-i): support split by single intial

Build

# use make
make

Haven't test on Windows, but the codes were written in C++11, so it should work on Windows with few changes.

Language Model and Perplexity

A bigram Pinyin language model was trained. The training set is built in the following steps:

  1. convert the chinese sentence into pinyin with pypinyin, for example:

    我爱北京天安门,我爱中国。
    wo ai bei jing tian an men,wo ai zhong guo。
    
  2. restrict the vocabulary in pinyin token, initials; all the other word with be treated as UNK.

  3. for support initials patterns, such as zhangly, wdlei, the "pinyin" words will be ramdonly changed into there initials in specified probabilty((in the subimtted model, it is 1%)). for example:

    bei jing =>
    bei jing 100%
    b jing 1%
    b j  1%
    bei j 1%
    

The perplexity is calculated by bigram with Laplace Smoothing(Add One Smoothing). Because the maxmium count of bigram patterns is less than 160,000, Laplace Smoothing is good engouh.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published