pysplit

Introduction

Split a letter string to pinyin tokens

pysplit [-intials] [-lm lm_mode_file] string1 string2 string3 ...
pysplit -h: help
pysplit -intials(-i): support split by single intial

Build

# use make
make

Haven't test on Windows, but the codes were written in C++11, so it should work on Windows with few changes.

Language Model and Perplexity

A bigram Pinyin language model was trained. The training set is built in the following steps:

convert the chinese sentence into pinyin with pypinyin, for example:

我爱北京天安门，我爱中国。
wo ai bei jing tian an men，wo ai zhong guo。

restrict the vocabulary in pinyin token, initials; all the other word with be treated as UNK.
for support initials patterns, such as zhangly, wdlei, the "pinyin" words will be ramdonly changed into there initials in specified probabilty((in the subimtted model, it is 1%)). for example:
```
bei jing =>
bei jing 100%
b jing 1%
b j  1%
bei j 1%
```

The perplexity is calculated by bigram with Laplace Smoothing(Add One Smoothing). Because the maxmium count of bigram patterns is less than 160,000, Laplace Smoothing is good engouh.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
model		model
src		src
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pysplit

Introduction

Build

Language Model and Perplexity

About

Releases

Packages

Languages

ai-geeker/pysplit

Folders and files

Latest commit

History

Repository files navigation

pysplit

Introduction

Build

Language Model and Perplexity

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages