Research Talk: Touring Protein Space with Matt

November 17, 2009

10:30a-11:15a

Halligan 127

Speaker: Noah M. Daniels, Tufts University

Abstract

The central dogma of proteomics states that genetic sequence determines protein (amino acid) sequence which in turn determines structure and ultimately biological function. Thus, differences in protein structure lead to differences in function. However, while differences in sequence ultimately lead to differences in structure, they do not do so at a uniform rate. For this reason, it can be useful to understand the evolutionary relationship among proteins, and how they relate in terms of structure.

Since 1995, biologists have organized known proteins into a hierarchy according to the three-dimensional structure they form. Structural Classification of Proteins (SCOP) [Murzin et al. 1995] is one such manually-curated hierarchy. However, advances in bioinformatics have resulted in SCOP being a serious bottleneck; it cannot keep up with the flood of newly solved protein structures. There have been attempts, such as Families of Structurally Similar Proteins (FSSP) [Holm et al. 1992], to automate the structural organization of protein space. However, such attempts have not satisfied biologists.

We believe that advances in structural alignment tools, in particular Multiple Alignment with Translations and Twists (Matt) [Menke et al. 2008], allow us to organize protein space automatically in a manner consistent with SCOP, and thus with the needs of biologists. We present Touring Protein Space with Matt, an automated clustering of solved protein space using the Matt alignment tool and a distance metric derived from training on a subset of SCOP. Our hierarchy of protein structure agrees well with SCOP at the family level (corresponding to a high degree of similarity among structures) and the superfamily level (a lesser degree of similarity). We hope that this automated hierarchy will solve the bottleneck problem of SCOP while providing a useful tool for understanding evolutionarily related protein structures, as well as a benchmark set for improving the quality of sequence alignment tools.