A fundamental property of language is that if a word is used once in a document, it is likely to be used again. Statistical models of documents applied in text mining must take this property into account, in order to be accurate. In this talk, I will describe how to model burstiness using a probability distribution called the Dirichlet compound multinomial. In particular, I will present a new topic model based on DCM distributions. The central advantage of topic models is that they allow documents to concern multiple themes, unlike standard clustering methods that assume each document concerns a single theme. On both text and non-text datasets, the new topic model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA).
Host: Prof. Joydeep Ghosh
Schlumberger Centennial Chair in Engineering
Department of Electrical and Computer Engineering
University of Texas at Austin
ACES 3.118, 471-8980