PhD Defense: Augmented Training Methods for Hidden Markov Models for the Detection of Remote Protein Homologs

May 6, 2010
Halligan 111a
Speaker: Anoop Kumar, Tufts University


While profile hidden Markov models (HMMs) are successful and powerful methods to recognize homologous proteins, they can break down when homology becomes too distant due to lack of sufficient training data. This thesis looks at two models of evoloution that generate data for better training HMMs.

We first show that we can improve the performance of HMMs in the task of recognizing remote homologs by using a simple simulated model of evolution to create an augmented training set. We show, in two different remote protein homolog tasks, that HMMs whose training is augmented with simulated evolution outperform HMMs trained only on real data. We and that a mutation rate between 15 and 20 percent performs best for recognizing GPCR proteins in different classes, and for recognizing SCOP superfamily proteins from different families.

We next consider the remote homology detection problem for beta-structural motifs using HMMs. In particular, we ask if a statistical model trained on members of only one family in a SCOP beta-structural superfamily, can recognize members of other families in that superfamily. Since, HMMs do not capture pairwise statistical preferences of residues that are hydrogen bonded in beta-sheets, we explore a method of incorporating pairwise dependencies into these models. We show that HMMs trained with our pairwise model of simulated evolution achieve nearly a median 5% improvement in AUC for beta-structural motif recognition as compared to ordinary HMMs.