Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
custom_word_freq.txt		custom_word_freq.txt
demo.py		demo.py
detect_demo.py		detect_demo.py
disable_char_error.py		disable_char_error.py
en_correct_demo.py		en_correct_demo.py
load_custom_language_model.py		load_custom_language_model.py
my_custom_confusion.txt		my_custom_confusion.txt
my_custom_proper.txt		my_custom_proper.txt
proper_correct_demo.py		proper_correct_demo.py
traditional_simplified_chinese_demo.py		traditional_simplified_chinese_demo.py
use_custom_confusion.py		use_custom_confusion.py
use_custom_proper.py		use_custom_proper.py
use_custom_word_freq.py		use_custom_word_freq.py

README.md

Statistical Language Model for Chinese Spelling Correction

Features

ngram统计语言模型：kenlm

Usage

快速加载

pycorrector快速预测

example: examples/kenlm/demo.py

from pycorrector import Corrector
m = Corrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))

output:

[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]

Dataset

toy train data

中文维基百科200条数据，见 examples/data/wiki_zh_200.txt

big train data

中文维基百科文本均可，本质上是训练一个文本语言模型。

16GB中英文无监督、平行语料Linly-AI/Chinese-pretraining-dataset
524MB中文维基百科语料wikipedia-cn-20230720-filtered
人民日报2014版熟语料，网盘链接:https://pan.baidu.com/s/1971a5XLQsIpL0zL0zxuK2A 密码:uc11

Train model

参考： https://blog.csdn.net/mingzai624/article/details/79560063?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522169925331716800222836904%2522%252C%2522scm%2522%253A%252220140713.130102334.pc%255Fblog.%2522%257D&request_id=169925331716800222836904&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~blog~first_rank_ecpm_v1~rank_v31_ecpm-1-79560063-null-null.nonecase&utm_term=kenlm&spm=1018.2226.3001.4450

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kenlm

kenlm

README.md

Statistical Language Model for Chinese Spelling Correction

Features

Usage

快速加载

pycorrector快速预测

Dataset

toy train data

big train data

Train model

Files

kenlm

Directory actions

More options

Directory actions

More options

Latest commit

History

kenlm

Folders and files

parent directory

README.md

Statistical Language Model for Chinese Spelling Correction

Features

Usage

快速加载

pycorrector快速预测

Dataset

toy train data

big train data

Train model