THURSDAY seminar: Author-Topic Models for Large Text Corpora

October 14, 2004

1:30-2:30pm

Halligan Hall, Room TBA

Host: Carla Brodley

Abstract

We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. An author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. We apply the methodology to two large corpora of technical papers: (1) 160,000 abstracts and 85,000 authors from the CiteSeer digital library, and (2) ten years' worth of papers and authors from the NIPS conference. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors, and detection of unusual papers by specific authors. An online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as CiteSeer. Applications to other data sets, such as email archives and PubMed abstracts, will also be discussed if time permits.

BIO: Padhraic Smyth is a Professor in the School of Information and Computer Science, University of California, Irvine. He is currently also a member of the Institute for Mathematical Behavioral Sciences, the Institute for Genomics and Bioinformatics, and the Center for Biomedical Engineering (all at UC Irvine). Dr. Smyth's research interests include machine learning, data mining, statistical pattern recognition, applied statistics, and information theory. He was a recipient of best paper awards at the 2002 and 1997 ACM SIGKDD Conferences, an IBM Faculty Partnership Award in 2001, a National Science Foundation Faculty CAREER award in 1997 and the Lew Allen Award for Excellence in Research at JPL in 1993. He is co-author of Modeling the Internet and the Web: Probabilistic Methods and Algorithms (with Pierre Baldi and Paolo Frasconi), published by Wiley in 2003. He is also co-author of a graduate text in data mining, Principles of Data Mining, MIT Press, August 2001, with David Hand and Heikki Mannila. He was co-editor of Advances in Knowledge Discovery and Data Mining, published by MIT Press in 1996. He is currently an associate editor for the Journal of the American Statistical Association and for the IEEE Transactions on Knowledge and Data Engineering, has served as an action editor for the Machine Learning Journal, is a founding associate editor for the Journal of Data Mining and Knowledge Discovery, and a founding editorial board member of the Journal of Machine Learning Research. He served as program chair for the 33rd Symposium on Computer Science and Statistics in 2001 and served as general chair for the Sixth International Workshop on AI and Statistics in 1997. He received a first class honors degree in Electronic Engineering from University College Galway (National University of Ireland) in 1984, and the MSEE and PhD degrees from the Electrical Engineering Department at the California Institute of Technology in 1985 and 1988 respectively. From 1988 to 1996 he was a Technical Group Leader at the Jet Propulsion Laboratory, Pasadena, and has been on the faculty at UC Irvine since 1996.