Undergraduate Senior Honors Thesis Defense: Exploring the Role of Pfam Families in Protein Function

April 26, 2018
9:30 AM
Halligan 209
Speaker: Daniel Meyer
Host: Lenore Cowen

Abstract

It is widely accepted that the structure of proteins determine their function. Many computational inference methods that act on proteins or sets of proteins rely on assigned functional labels from a popular ontology, often the Gene Ontology (GO). On the other hand, there is rich, easily obtained structural motif information that is captured for proteins using libraries of Hidden Markov Models (HMMs), such as Pfam. Because structure relates to function, and many of these HMM-based-motifs are actually signatures of 3-dimensional substructures within the overall protein fold, it would follow that annotation of these HMM domains in a protein structure should aid in inference of its correct GO functional labels. However, creating a useful Pfam to GO mapping remains a difficult endeavor, first, because it is certainly not a one-to-one mapping. Second, different Pfam-derived domains within a protein structure, either individually, or as a set, might yield different amounts of specificity in regards to the set of possible GO labels that are appropriate. Estimating the amount of specificity that a single, or set of, Pfam-derived domains gives, in regards to GO labeling, is confounded by the unequal representation and/or the lack of coverage of annotation in both domains across the protein universe. We revisit issues of sequence patterns, diversity, and representation in the light of all the new data in current sequence databases. We have developed a suite of parsers and an Object-Relational Mapping using Python and SQLAlchemy to represent selected information of proteins and families from the UniProt and Pfam databases respectively, while making it easy to access and reason about information stored in the graphical structures of the GO and Evidence Code Ontology (ECO). We use this framework to compare dcGO (Fang and Gough, 2013) and GODM (Alborzi et al., 2017), which are designed to optimize different tradeoffs for coverage versus false-positives.