Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:
P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}.
With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).
More detail can be referred to the following paper:
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013.
-
Topic learning:
$ ./btm est <n_iter> <save_step> <pt_input> <pt_outdir> K int, number of topics, like 20 W int, size of vocabulary alpha double, Symmetric Dirichlet prior of P(z), like 1 beta double, Symmetric Dirichlet prior of P(w|z), like 0.01 n_iter int, number of iterations of Gibbs sampling save_step int, steps to save the results pt_input string, path of training docs pt_outdir string, output directory
-
Inference topic proportions for documents, i.e., P(z|d):
$ ./btm inf <pt_input> <pt_outdir> K int, number of topics, like 20 type string, 4 choices:sum_w, sum_b, lda, mix. sum_b is used in our paper. pt_input string, path of training docs pt_outdir string, output directory
There are two scripts in "script/" to help you run a toy example in "data" directory.
-
run a toy example
$ bat.sh
-
Results display
$ python script/tran.py
Output the topics with top 10 words of the toy example.
1 Input
The input file contains all the training documents. Each line records a short text doucment, and word indexes (starts from 0) seperated by space. See the toy example in data/doc_wids.txt
2 Output
The estimation program will output into the directory "pt_ourdir":
- pw_z.k20 a K*M matrix for P(w|z), if K=20
- pz.k20 a K*1 matrix for P(z), if K=20
The inference program will produce:
- pz_d.k20 a N*K matrix for P(z|d), if K=20
2013-8-28 Add online BTM.
2013-6-1 Add the process of single word document Inference.
2013-5-6 add a doc_infer_sum_w inference procedure.
2013-5-5 v0.2, add Doc and Dataset class. We change the input from biterms to word sequences. Example is the test/doc_wids.txt.
2012-09-25 v0.1
Feel free to contact: Xiaohui Yan([email protected])