PhD Defense: Computational Approaches for Biological Data Analysis
Due to the development of molecular biological technologies and techniques, more and more large-scale biological data sets are becoming available. Identifying biologically-useful information from these data sets has become an important challenge. Computational biology aims to address this challenge. In this thesis, we introduce several computational techniques for analyzing different types of biological data to study human disease.
As the number of sequenced genomes has grown, we have become increasingly aware of the impact of horizontal gene transfer on our understanding of genome evolution. We introduce a new method for detecting horizontal transfer that incorporates the distances typically used by phylogeny-based methods, rather than the trees themselves. We demonstrate that the distance method is scalable and that it performs well precisely in cases where phylogenetic approaches struggle. We conclude that a distance-based approach may be a valuable addition to the set of tools currently available for identifying horizontal gene transfer.
Next, we look at HIV-human protein-protein interaction (PPI) data to better characterize PPIs between viruses and their hosts. We demonstrate that HIV proteins tend to interact with many human proteins, viral proteins tend to interact with human hubs, and the hub proteins with which they interact tend to be preferentially conserved. Furthermore, we designed two new approaches to predict the PPIs between viruses and hosts and tested these methods on HIV integrase and its human partners.
We also designed a noise-tolerant method to detect gene sets by biclustering gene expression data. The approach was evaluated on a human metabolic gene expression dataset and a mouse developmental dataset. This work showed that our approach has strong power to detect functionally-related gene sets for use in the analysis of novel gene expression data.
Finally, using the chemosensitivity values of the NCI-60 panel of cancer cell lines and their gene expression profiles, we designed an approach to predict the chemosensitivity values of new samples to anticancer drugs. Our approach relies on attempting to predict the actual sensitivity values rather than discrete sensitivity classes. Our analysis showed that our method performs significantly better than chance, and works well on an identifiable subset of compounds. We then tested and validated this hypothesis by predicting the chemosensitivity of primary tumors to the topoisomerase I inhibitor topotecan, using a model trained only on the NCI-60 cell lines.