Research Talk: Remote Homology Detection for Alpha-helical Proteins Using Simulated Evolution

November 1, 2012
Halligan 127
Speaker: Mengfei Cao, Tufts University


Remote homology detection, where we attempt to determine whether a protein is evolutionarily related to a group of proteins, is one of the most important topics in bio-informatics due to its utility in protein function prediction and protein structure prediction. Hidden Markov Model, as the de facto standard for remote homology detection, has the intrinsic shortcoming that when there are not enough known structures of proteins, the performance in detecting remotely evolutionarily related proteins degrades. As a matter of fact, there are less than 1% proteins with known structures among the known squence space.

In 2010, Kumar and Cowen for beta-structural proteins introduced simulated evolution, a method for generating artificial protein sequences inspired by an evolutionary model, and showed that when incorporated into the training phase of a Hidden Markov Model, it could improve models to detect remote protein homologs. Constructing probablistic models to estimate mutation preference and approximate the mutation distribution of amino acids among alpha proteins, we show how to adapt Kumar and Cowen's simulated evolution paradigm to alpha-helical proteins.

Particularly, we consider the problem of whether an HMM trained on members of an alpha-helical SCOP superfamily can recognize proteins in that superfamily from SCOP families missing from the training set. We show that our simulated evolution model improves the median AUC on this problem by 4.9% compared to HMMer 3.0a2 without simulated evolution.