Accounting for Burstiness of Words in Text Mining
Part of Seminar Series: Data Mining
Date: Wednesday, September 23, 2009
Time: 11 a.m.
Location: Avaya Auditorium: ACE 2.302
Dr. Charles Elkan
Professor
Department of Computer Science & Engineering - University of California San Diego
Abstract
A fundamental property of language is that if a word is used once in a document, it is likely to be used again. Statistical models of documents applied in text mining must take this property into account, in order to be accurate. In this talk, I will describe how to model burstiness using a probability distribution called the Dirichlet compound multinomial. In particular, I will present a new topic model based on DCM distributions. The central advantage of topic models is that they allow documents to concern multiple themes, unlike standard clustering methods that assume each document concerns a single theme. On both text and non-text datasets, the new topic model achieves better held-out likelihood than standard latent Dirichlet allocation (LDA).
Host:
Prof. Joydeep Ghosh
Schlumberger Centennial Chair in Engineering
Department of Electrical and Computer Engineering
University of Texas at Austin
ACES 3.118, 471-8980
Speaker Biography
Charles Elkan is a professor in the Department of Computer Science and Engineering at the University of California, San Diego. In 2005/06 he was on sabbatical at MIT, and in 1998/99 he was visiting associate professor at Harvard. Dr. Elkan is known for his research in machine learning, data mining, and computational biology. The MEME algorithm he developed with his Ph.D. student Tim Bailey has been used in over 1000 publications in biology. Dr. Elkan has won several best paper awards and data mining contests, and his Ph.D students have held tenure-track or equivalent positions at Columbia University, the University of Washington, the University of Queensland, other universities, and IBM Research.

