PhD Defense: Detecting High-Level Gene Function and Inferring Genetic Pathway Relationships Using Machine Learning Methods
My thesis looks at two different problems related to predicting gene function. In the first project, we try to predict GO annotations for genes, using a fast SVM method that models frequency profiles of super-short subsequences (of 2-4 amino acid residues in length).
In the second project, we turn to higher-order network interactions. Recent years have seen a huge increase in the amount of genetic interaction data available from yeast double-mutants, where interactions between pairs of nonessential genes are characterized by the phenotypic effect of their simultaneous suppression or deletion. One of the simplest of such effects is a "synthetic lethality" (SL) interaction between a pair of genes: both genes are nonessential, but their simultaneous deletion destroys the yeast's ability to survive. (Kelley and Ideker 2005) reported that pairs of functionally-redundant protein complexes/pathways tended to exhibit characteristic network topologies in the PI (physical protein- protein interaction) network for yeast, as well as in the GI (genetic interaction, including synthetic lethality) network. In particular, each pair of redundant complexes was enriched for PI interactions within the set of genes making up each complex, but sparse with respect to PI interactions which crossed between nodes in different complexes. A complementary pattern emerged in the GI network: pairs of redundant complexes had relatively many GI interactions (e.g., synthetic-lethality) between complexes, but relatively few within each complex, reflecting the "redundant buffer" relationship between pairs of complexes (in which each serves as a backup for the other within the organism). They described this network pattern as a "between-pathway model" (BPM). Our approach is more restricted in the use of input data but more comprehensive in coverage: we exhaustively search for bipartite subnetwork patterns using a graph-theoretic partitioning method, using only the GI portion of the interactome; only after candidate BPMs have been identified do we look at the PI network data, with an eye toward directly estimating the probability of observing by chance the distribution of PI edges. In this way we are able to generate many more candidate BPMs (i.e., candidate sets of genes which [we propose] are examples of redundant functional pathways) than either Kelley and Ideker or their successor papers. As an added strength -- since we didn't use the PI data at all in our BPM detection scheme - - we can establish independent verification of the candidate BPMs, using the PI data only for validation.