Computational Methods for the Prediction and Analysis of Protein-Protein Interaction Networks

April 26, 2023

9:00am ET

Cummings 601

Speaker: Kapil Devkota - PhD Defense

Host: Lenore Cowen

Abstract

PhD Defense:

A Protein Protein Interaction (PPI) network represents the physical interactions between pairs of proteins in the form of an undirected graph. PPI networks have been found to be broadly useful for visualizing and inferring functional relationships between proteins in an organism's cells. But the experimental PPI screening/detecting assays that produce the data that goes into these representations are hard to produce at sufficient scale to cover all potentially interacting pairs genome wide. In the current PPI landscape, for some important species like baker's yeast and human, we have high quality PPI data, which, though still somewhat noisy and very much incomplete, has enough genome-wide coverage that network-based inference and analysis will be meaningful. In other organisms however, there is either no PPI data whatsoever, or it is too sparse to allow for any such analysis. In both cases, the problems with missing data mean that computational methods to predict missing links are useful. However, since the sparsity and coverage of the PPI data differs significantly between these two settings, the task of predicting new PPI connections should be adjusted to address these differences as well.

For the high-experimental coverage setting, where large amounts of experimental PPI data is available, we propose a novel network- based link prediction method, which combines local heuristics with a global diffusion-based method to assess link likelihood, called GLIDE. For the non-model organism setting, where experimental PPI data is sparse or unavailable, we propose a new deep learning framework called Topsy-Turvy, which uses only protein sequence information to predict link likelihoods, but also learns about PPI interaction from the structure of the PPI network from the model organism it is trained on. The protein sequences themselves are encoded by a language model, in order to imbue the resulting embeddings with some protein structural information before feeding it to the Topsy-Turvy model. We also show how, in both high and low information settings, both GLIDE and Topsy-Turvy outperform previous link-prediction algorithms designed for their corresponding settings.

Finally, back in the high-information setting (i.e. in humans, where adequate amounts of experimental PPI data permits a network-based approach) we move away from link-prediction and revisit one of the most classical inference tasks that PPI networks are used for, namely predicting the function of proteins whose function is unknown, solely from the global structure of the PPI network. We show that we can make a slight modification of GLIDE, called GLIDER, so that the resulting embedding results in functionally enriched neighborhoods. We show that GLIDER, equipped with a naive classifier like k-nearest-neighbors is able to outperform existing function prediction methods, when a widely-used set of Gene Ontology (GO) terms were used as functional labels. We use GLIDER to find functionally enriched neighbors of proteins implicated in Parkinson's disease by Genome-Wide Association Studies (GWAS), and discover some known plus some potentially novel genes implicated in Parkinson's disease.