Skip to content

EUREKA is an unsupervised model to detect new words from Chinese corpus.

Notifications You must be signed in to change notification settings

Schlampig/EUREKA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EUREKA

Source:


Data:

  • stop-words dictionary: a stop-words dictionary file could leverage the final performance of EUREKA, an example could be seen here (this dictionary is copied from the Lyrichu).
  • input corpus: the input corpus is a long string, such as a novel text, or a concatenated documentation pieces. See an example.
  • corpus in mongodb: you can store each document as one sample in a collection of a mongodb database, with the format like this:
{"_id": ObjectId("123456789"), "content": your_corpus(long string)}

Codes Dependency:

eureka -> model   

Using Example:

from eureka import Eureka
model = Eureka()
model.load_dictionary()

# data from .txt file
####################################################################
import codecs
corpus = codecs.open("document.txt", "r", "utf-8").read()

n = len(corpus)
if n < 5000:
    print("The corpus is too small.")
elif n < 250000:
    res = model.discover_corpus(corpus)
else:
    res = model.discover_corpus_multi(corpus, corpus_size=200000, re_list=True)  # corpus_size is the length of sub-corpus in from the input corpus

# data from mongo
####################################################################
import pymongo
client = pymongo.MongoClient("mongodb:https://localhost:27017/")
col = client["your_database_name"]["your_collection_name"]
res = model.discover_corpus_mongo(col, n=20000, corpus_size=200000, re_list=True)  # n is the number of samples used in collections

Requirements

  • Python>=3.5
  • pandas>=0.22.0
  • pkuseg
  • jieba>=0.39
  • tqdm>=4.19.5
  • Flask(optional, if runing the server.py)
  • pymongo(optional, EUREKA could handle mongo data while it essentially does not need this lib)
  • ipdb(optinoal, if debugging in command line)

Allusion

  • Eureka is from Ancient Greek word heúrēka, which means I have found.
  • Eureka is also a heroine from a Japanese anime called Eureka Seven.

About

EUREKA is an unsupervised model to detect new words from Chinese corpus.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages