Skip to content

Bynax/NLP_HW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP HOMEWORK

  • 进展

    • 分词
      • HMM中文分词
      • HMM+ADD Delta 平滑
      • HMM+Good-Turing平滑
      • 基于上下文的HMM
    • POS
    • NER
  • 项目结构:

    • dataset:对于数据做预处理,包括读取文件、去除常见的特殊字符和中英文标点等。

    • model:实现模型需要的算法

    • evaluate: 主要是比较两个给定文本计算accuracy、recall 和F1_score

    • run: 根据前面文件的接口运行程序

  • 语料(课上前期自行标注的医学语料,量较少)

    • 分词语料格式 txt文件,没个词之间用空格隔开
    • POS语料格式
    • NER语料格式
  • 分词模型及评价

    • model:HMM

      • Parameter estimate:MLE
      • Decode:viterbi algorithm
    • Evaluate:

      • Accuracy: 0.7784937575513492
      • Recall: 0.7972449063763095
      • F1_score: 0.7877577634689054
    • Model:HMM+Smoothing

      • Add-one(0.7720314111208059,0.802854079023344,0.7871411241407198)
      • Add-delta
        • 0.2 (0.7766468985414329, 0.7993895900354697, 0.7878541522702328)
        • 0.1 (0.7772249418184737, 0.798894663037202, 0.7879108363163034)
        • 0.005 (0.778851574454377, 0.7977398333745772, 0.7881825590872045)
        • 0.047 (0.7786173633440514, 0.7989771508702467, 0.7886658795749704)
        • 0.027(0.7794224117126538, 0.7992246143693805, 0.7891993157937607)
      • Good-Turing
        • k=20 (0.7782969467493757, 0.796914955044131, 0.7874959243560482)
        • k=60 (0.7784402191427651, 0.7969974428771757, 0.7876095373955575)
        • k=100 (0.7784223672548546, 0.796914955044131, 0.7875601206488955)

About

实现NLP的CW、POS、NER的任务

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages