Generative Learning Algorithms for Textual Query Model Estimation

September 8, 2010
Halligan 111b


We address the problem of document retrieval with semantically structured queries - queries where each term has a tagged field label, by introducing Dirichlet Aspect Weighting model which integrates terms from external databases into the query language model in a Bayesian learning framework. In this model, the Dirichlet prior distribution is governed by parameters which depend on the number of fields in the external databases. This model needs semantically compatible additional examples to be augmented to the semantically structured query. These semantically compatible examples are obtained using pseudo relevance feedback. This semantic content feedback augmented to the semantically structured query is used to formulate a loglikelihood function for the Dirichlet Aspect Weighting model and we maximize it using a novel generalized EM algorithm to estimate the semantically structured query. Comparison of the results of Dirichlet Aspect Weighting model on TREC 2005 Genomics track dataset with baseline methods using pseudo relevance feedback, while incorporating terms from external databases shows an improvement.

We also introduce user-term feedback approach where the user of a text retrieval system uses semantic content indicating terms in feedback documents to estimate a query language model. In order to extract the semantic content in feedback documents we propose a clustering based method and present the semantic content indicating terms selected from the clusters to a user for judgement and estimate query language models based on user-term feedback. Experiments on TREC 2005 HARD track dataset show that the user-term feedback approach has better performance over the relevance feedback approach when there are less number of relevant documents among the feedback documents.

Bio: Dr. Atulya Velivelli completed his M.S and Ph.D in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign. He received his B.Tech in Electronics and Communication Engineering from Jawaharlal Nehru Technological University, India. His industry experience was as an intern at IBM T. J. Watson research center and Thomson-Reuters Corp R&D. He has authored papers on text mining, text retrieval, ordinal regression and multimedia information retrieval. He is currently employed with dimension consulting USA.