Skip to content

amigcamel/Jseg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jseg

A modified version of Jieba

All credit goes to fxsjy/jieba.
Find more on: https://github.com/fxsjy/jieba

Synopsis

  1. Equipped with Emoticon detection
    Emoticons will not be segmented as sequences of meaningless punctuations.

  2. Data are trained with Sinica Corpus
    Results are more accurate when dealing with Traditional Chinese (F1-score = 0.91).

  3. Using Brill Tagger
    Training data are trained with Sinica Treebank, which raises the accuracy of POS tagging.

Environment

  • Python2.7+
  • Python3.3+

Installation

pip install -U jseg

Usage

from jseg import Jieba
j = Jieba()

Add user defined dictionary

j.add_guaranteed_wordlist(lst)

Here's a sample text:

sample = '期末要爆炸啦! ◢▆▅▄▃崩╰(〒皿〒)╯潰▃▄▅▇◣'

Segmentation with POS (part-of-speech)

j.seg(sample, pos=True)

About

A modified version of Jieba

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages