Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
btm		btm
output		output
sample-data		sample-data
script		script
src		src
README.md		README.md

Repository files navigation

Code of Biterm Topic Model

Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:

   P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}.

With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013.

Usage

Topic learning:

$ ./btm est <n_iter> <save_step> <pt_input> <pt_outdir> K int, number of topics, like 20 W int, size of vocabulary alpha double, Symmetric Dirichlet prior of P(z), like 1 beta double, Symmetric Dirichlet prior of P(w|z), like 0.01 n_iter int, number of iterations of Gibbs sampling save_step int, steps to save the results pt_input string, path of training docs pt_outdir string, output directory
Inference topic proportions for documents, i.e., P(z|d):

$ ./btm inf <pt_input> <pt_outdir> K int, number of topics, like 20 type string, 4 choices:sum_w, sum_b, lda, mix. sum_b is used in our paper. pt_input string, path of training docs pt_outdir string, output directory

There are two scripts in "script/" to help you run a toy example in "data" directory.

run a toy example

$ bat.sh
Results display

$ python script/tran.py

Output the topics with top 10 words of the toy example.

Input & Output

1 Input

The input file contains all the training documents. Each line records a short text doucment, and word indexes (starts from 0) seperated by space. See the toy example in data/doc_wids.txt

2 Output

The estimation program will output into the directory "pt_ourdir":

pw_z.k20 a K*M matrix for P(w|z), if K=20
pz.k20 a K*1 matrix for P(z), if K=20

The inference program will produce:

pz_d.k20 a N*K matrix for P(z|d), if K=20

History

2013-8-28 Add online BTM.

2013-6-1 Add the process of single word document Inference.

2013-5-6 add a doc_infer_sum_w inference procedure.

2013-5-5 v0.2, add Doc and Dataset class. We change the input from biterms to word sequences. Example is the test/doc_wids.txt.

2012-09-25 v0.1

Feel free to contact: Xiaohui Yan([email protected])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code of Biterm Topic Model

Usage

Input & Output

History

About

Releases

Packages

Contributors 2

Languages

License

xiaohuiyan/BTM

Folders and files

Latest commit

History

Repository files navigation

Code of Biterm Topic Model

Usage

Input & Output

History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages