Skip to content

143230/CLTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#CLTA This is the project that pulishes the source code of category correlation based bilingual topic models: CC-BiLDA and CC-BiBTM, which can be applied to cross-lingual applications, such as cross-lingual taoxnomy alignment.

###Requirements:

  1. JDK 1.8.0_111
  2. Maven 3.3.9

###Data you need:

  1. Biterm Documents or Word Documents
  2. Biterm-Category or Document-Category Distribution file

###Biterm Documents content format: each line represents a category biterm document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese-chinese biterm document>@#@#@<chinese-english biterm document>@#@#@<english-english biterm document>
for example:
https://www.ebay.com/chp/Fins-/16054@#@#@Fins@#@#@en@#@#@[呼吸 手套,...]@#@#@[呼吸 full,...]@#@#@[cheap sailor,...]
###Word Documents content format: each line represents a category word document organised as follows:
<category url>@#@#@<category label>@#@#@<category lang>@#@#@<chinese word document>@#@#@<translated english word document>
for example:
https://conference_en#c-7081035-6117083@#@#@committee@#@#@en@#@#@[任命, 报告...]@#@#@[elect, person...]
###Biterm-Category Distribution file content format: each line represents a biterm-category distribution organised as follows:
<word1>@#@#@<word2>@#@#@<lang1_lang2>\t[<category url>@#@#@<category distribution>,...]
for example:
稿件@#@#@carry@#@#@ZH_EN [https://cmt_cn#c-8430559-8614325@#@#@1.0]

###Document-Category Distribution file content format: each line represents a document-category distribution organised as follows:
<document id>@#@#@<document label>@#@#@<document language>\t[<category url>@#@#@<category distribution>, ...]
for example:
https://cmt_cn#c-1609047-4017692@#@#@合著者@#@#@zh@#@#@ [https://cmt_cn#c-1609047-4017692@#@#@1.0]

###input file organization: suppose the dataset name is 'A', for CC-BiLDA method, the Word Documents and the Document-Category Distribution file are as:
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A
corpus/A/exact matching/CC-BiLDA/TextPairs(for BiLDA)_A.(<avg_pi> or <hier_pi>)
for CC-BiBTM method, the Biterm Documents and the Biterm-Category Distribution file are as:
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A
corpus/A/exact matching/CC-BiBTM/Biterms(for BiBTM)_A.(<avg_pi> or <hier_pi>)

###Compile Project: To run this project, you need to first compile this project using maven:
mvn assembly:assembly

#Run Project: Then the jar package of this project will be generated in the target directory named by 'alignment-1.0-SNAPSHOT.jar'

if you are first time to using this project, run:
java -jar target\alignment-1.0-SNAPSHOT.jar -h
you will get the help options

usage: Model Run Options
 -alpha <arg>         Hyper Parameter Alpha
 -avg                 Using Average Category Distribution to inference the
                      GibbsSampling.
 -f <arg>             File Name
 -h                   HELP_DESCRIPTION
 -hier                Using Hierarchy Category Distribution to inference
                      the GibbsSampling.
 -iter <arg>          Iteration Number
 -k <arg>             Topic Number
 -m <arg>             Method for training the corpus, one of <CCBiBTM,
                      CCBiLDA>
 -savestep <arg>      Step to Save
 -source_beta <arg>   Source Beta
 -t <arg>             Data Type
 -target_beta <arg>   Target Beta

then you can following the help option to run this project on your own datasets. for example, you can run:
java -jar target/alignment-1.0-SNAPSHOT.jar -m CCBiBTM -f "Biterms(for BiBTM)" -t "product catalogue" -iter 300 -savestep 100 -k 100
if options not refered, values will be put default.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages