-
进展
- 分词
- HMM中文分词
- HMM+ADD Delta 平滑
- HMM+Good-Turing平滑
- 基于上下文的HMM
- POS
- NER
- 分词
-
项目结构:
-
dataset:对于数据做预处理,包括读取文件、去除常见的特殊字符和中英文标点等。
-
model:实现模型需要的算法
-
evaluate: 主要是比较两个给定文本计算accuracy、recall 和F1_score
-
run: 根据前面文件的接口运行程序
-
-
语料(课上前期自行标注的医学语料,量较少)
- 分词语料格式 txt文件,没个词之间用空格隔开
- POS语料格式
- NER语料格式
-
分词模型及评价
-
model:HMM
- Parameter estimate:MLE
- Decode:viterbi algorithm
-
Evaluate:
- Accuracy: 0.7784937575513492
- Recall: 0.7972449063763095
- F1_score: 0.7877577634689054
-
Model:HMM+Smoothing
- Add-one(0.7720314111208059,0.802854079023344,0.7871411241407198)
- Add-delta
- 0.2 (0.7766468985414329, 0.7993895900354697, 0.7878541522702328)
- 0.1 (0.7772249418184737, 0.798894663037202, 0.7879108363163034)
- 0.005 (0.778851574454377, 0.7977398333745772, 0.7881825590872045)
- 0.047 (0.7786173633440514, 0.7989771508702467, 0.7886658795749704)
- 0.027(0.7794224117126538, 0.7992246143693805, 0.7891993157937607)
- Good-Turing
- k=20 (0.7782969467493757, 0.796914955044131, 0.7874959243560482)
- k=60 (0.7784402191427651, 0.7969974428771757, 0.7876095373955575)
- k=100 (0.7784223672548546, 0.796914955044131, 0.7875601206488955)
-
-
Notifications
You must be signed in to change notification settings - Fork 0
Bynax/NLP_HW
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
实现NLP的CW、POS、NER的任务
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published