Research Talk: Remote Homology Detection for Alpha-helical Proteins Using Simulated Evolution

November 3, 2011
12:30-1:20p
Halligan 127
Speaker: Mengfei Cao, Tufts University

Abstract

Charles Darwin described evolution as the cause of "endless forms most beautiful and most wonderful". Due to evolution's significance in the world, scientists have studied it for ages. One way to harness the power of evolution is to apply it to remote homology detection, where we attempt to determine whether a protein is evolutionarily related to a group of proteins. In 2009 and 2010, Kumar and Cowen introduced simulated evolution, a method for generating artificial protein sequences inspired by an evolutionary model, and showed that when incorporated into the training phase of a Hidden Markov Model, it could improve models to detect remote protein homologs. Remote Homology detection is important because it is an efficient way for scientists to predict the structure and function of newly discovered protein sequences. The number of such sequences is growing dramatically; however, the lack of prior knowledge and the shortage of proteins of solved 3D structure make the remote homology detection problem quite challenging. Moreover, the remoteness of sequence homology further exacerbates these difficulties because sequences are not easily compared. Constructing probablistic models to approximate the mutation distribution of amino acids among alpha proteins, we show how to adapt Kumar and Cowen's simulated evolution paradigm to alpha-helical proteins. Particularly, we consider the problem of whether an HMM trained on members of an alpha-helical SCOP superfamily can recognize proteins in that superfamily from SCOP families missing from the training set. We show that our simulated evolution model improves the median AUC on this problem by 4.6% compared to HMMer 3.0a2 without simulated evolution, the de facto standard. On the important globin superfamily, our improvement is even more dramatic, improving the AUC of the ROC curve from 71.40% to 92.80%.