Skip to content

Latest commit

 

History

History
42 lines (29 loc) · 1.82 KB

README.md

File metadata and controls

42 lines (29 loc) · 1.82 KB

Statistical Language Model for Chinese Spelling Correction

Features

  • ngram统计语言模型:kenlm

Usage

快速加载

pycorrector快速预测

example: examples/kenlm/demo.py

from pycorrector import Corrector
m = Corrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作,我也很高心。']))

output:

[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('', '', 2)]},
{'source': '你找到你最喜欢的工作,我也很高心。', 'target': '你找到你最喜欢的工作,我也很高兴。', 'errors': [('', '', 15)]}]

Dataset

toy train data

中文维基百科200条数据,见 examples/data/wiki_zh_200.txt

big train data

中文维基百科文本均可,本质上是训练一个文本语言模型。

Train model

参考: https://blog.csdn.net/mingzai624/article/details/79560063?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522169925331716800222836904%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=169925331716800222836904&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~rank_v31_ecpm-1-79560063-null-null.nonecase&utm_term=kenlm&spm=1018.2226.3001.4450