Skip to content

chunchih/article-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

article_match

ref-link

Into word2vec-training Folder

Step 1 : Download The Raw Data On Wiki

Organize the article on Wiki to train the model, can choose what kind of data you like Data Link

  • Output : word2vec-model/data.xml.bz2

Step 2 : Get Raw Data

Using the function WikiCorpus from gensim.corpora, to get the sentence of the article

python wiki_to_txt.py <filename of data.xml.bz2>
  • Output : word2vec-model/wiki_texts.txt

Step 3 : Translation

In some article, there are some simplified chinese words which have same meaning with traditional one, we use opencc to translate it.

opencc -i wiki_texts.txt -o wiki_zh_tw.txt -c s2tw.json
  • Output : word2vec-model/wiki_zh_tw.txt

Step 4 : Cut the Sentence

Using the jieba to cut the chinese senetence into short words

python wiki_seg_jieba.py
  • Output : word2vec-model/wiki_seg.txt

Step 5 : Start Training

With wiki_seg.txt, we use word2vec to train the matching model

python w2v-train.py

Into article-keywords Folder

Step 6 : Find the Target Article-keywords:

Put the article in medium length into target_article.txt in article-keywords.

Step 7 : Cut Target Article Into Short Pieces

  • Recommend Use jieba to seperate the words if your model is trained by jieba's method.
  • But if you have the model trained by CKIP way, please use it and get better matching effect
python jieba_seg.py target_article.txt

or

Register the account and password on CKIP. CKIP can provide better effect to cut the chinese sentence than jieba; However, the shortcome of CKIP is slower to jieba because it needs to send sentence by part with Internet and get resouce back.

python ckip_seg.py target_article.txt

Rember to put your account and password into ckip_account.txt in two lines

  • Output : article-keywords/target_article_seg.txt

Step 8 : Find keywords in target article

First, Use Counter() to get the frequency of each words in target article. Second, each words to add the frequcy of similar words in order, and sort it. Third, from top of the list, eliminate the following words on list, which is very similar with previous one.

python find_key_weight.py target_article_seg.txt
  • Output : article-keywords/target_article_keywords.txt, list of the keywords in target article which all of them are not similar to each other.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages