Senior Honors Thesis Defense: Usage of Semantic Similarity Measurement
Study and comparison of protein functions is an important question in modern biology and bioinformatics research. Better understanding of protein functions aids in targeting medical and pharmacological research. Ontologies of functional terms organize and give structure to possible protein functions, while annotation corpuses apply functional labels to specific proteins. Many methods exist to compare protein functions. These methods range from simply counting the number of overlapping functional labels to more complex methods that make use of the structure of ontologies. Specifically, we look at the Resnik semantic similarity measurement. Resnik scores make use of both the structure of an ontology and the distribution of functional labels throughout an annotation corpus. In this paper, we see that incomplete data can lead to erroneous low Resnik values, while high Resnik values are likely to be more meaningful. Using the Gene Ontology Consortium’s ontology (GO) and annotation corpuses from UniProt Swiss-Prot and the Saccharomyces Genome Database (SGD), we analyze Resnik scores. We create matrices of Resnik scores for each species, representing the Resnik values between all pairs of proteins within a species. Using these matrices, we show that even high quality datasets such as SGD and UniProt Swiss-Prot do not completely label their proteins, leaving many proteins labeled with very general functions. We go on to discuss methods for identifying high and low Resnik values. We also show that matrix completion methods do not appear effective in predicting functional similarity between two proteins within a species.