
btm
- The latest code has been moved into Gibhub, and this site will no longer be updated.
- BTM: https://github.com/xiaohuiyan/BTM
- online BTM: https://github.com/xiaohuiyan/OnlineBTM
- bursty BTM: https://github.com/xiaohuiyan/BurstyBTM
- The papers can be found on my homepage: http://www.shortext.org
Biterm Topic Model (BTM) is a topic model developed for short text (need to set a window length when generating biterms for nomral text), like microblogs and webpage titles. It learns topics by modeling the generation process of word co-occurrences (referred as biterms), rather than word-document co-occurrences.
1. Model Description
In BTM, the distribution of a biterm b=(w1,w2) is
P(w1,w2) = \sum_k{P(w1|k)P(w2|k)P(k)}.
Steps of Gibbs algorithm for BTM: 1. Randomly assign topic uniformly to each biterm b 1. for each biterm b 1. reset topic assignment of b 1. sample topic k according to P(k|B-b) 1. re-assign topic k to biterm b 1. looper step 2) until converge 1. inference the parameters {P(k), P(w|k)}
More detail is refered to the following papers:
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng.A Biterm Topic Model For Short Text. WWW2013.
Xueqi Cheng, Xiaohui Yan, Yanyan Lan, and Jiafeng Guo.BTM: Topic Modeling over Short Texts. TKDE 2014
2. Usage
Before running the code, just type "make" in "src/batch" directory to generate the executable file "btm".
Your can run a toy exapmle by the script:
cd script
./bat.sh
it includes learning, inference, and inspection as follows.
1) Usage for estimation:./btm est K W alpha beta n_iter save_step pt_input pt_outdir
2) Usage for inference:K int, number of topics, like 20
W int, size of vocabulary
alpha double, Symmetric Dirichlet prior of P(z), like 1
beta double, Symmetric Dirichlet prior of P(w|z), like 0.01
n_iter int, number of iterations of Gibbs sampling
save_step int, steps to save the results
pt_input string, path of training docs
pt_outdir string, output directory
./btm inf type K pt_input pt_outdir
3) Result inspectionK int, number of topics, like 20
type string, 4 choices:sum_w, sum_b, lda, mix. sum_b is used in our paper.
pt_input string, path of training docs
pt_outdir string, output directory
python script/tran.py
Output the topics with top 10 words of the topics in the example.
3. Input & Output
1) Input The input file contains all the training documents. Each line records a short text doucment, and word indexes (starts from 0) seperated by space. See the toy example in data/doc_wids.txt
2) Output The estimation program will output into the directory "pt_outdir": * pw_z.k20 a K*W matrix for P(w|z), if K=20 * pz.k20 a K*1 matrix for P(z), if K=20 The inference program will produce: * pz_d.k20 a D*K matrix for P(z|d), if K=20
History
2013-8-28 Add online BTM.
2013-6-1 Add the process of single word document Inference.
2013-5-6 add a doc_infer_sum_w inference procedure.
2013-5-5 v0.2, add Doc and Dataset class. We change the input from biterms to word sequences. Example is the test/doc_wids.txt.
2012-09-25 v0.1
Feel free to contact: Xiaohui Yan(xhcloud@gmail.com)
Project Information
The project was created on Nov 18, 2012.
- License: MIT License
- 3 stars
- svn-based source control