Developing Machine Learning Models with Heterogeneous Data for Biological Engineering Applications
Recent emerging machine learning techniques have been applied to solve a wide variety of biological engineering problems. The success of many machine learning models heavily depends on the availability of related labeled data regarding the prediction task. However, due to the high cost of experimental measurements and biochemical data scarcity, obtaining labeled data is typically difficult. The hypothesis in this thesis is that using abundant auxiliary heterogeneous biochemical data for training machine learning models can compensate for limited labeled data. Further, the availability of such data opens the possibility of developing novel machine learning models that can significantly improve over existing models. In this thesis, we develop three methods that leverage heterogeneous data in specific domains to design accurate models for solving three biochemical problems. First, we develop a statistical and machine learning analysis pipeline, termed ASAP-SML, to identify salient antibody sequence features. We use heterogeneous data derived from antibody sequences to demonstrate that combinations of feature values are distinct from one set of antibody sequences of their unique properties as inhibitors from a reference set. Second, we create an innovative general deep learning-based Recommender System (RS) framework, termed Boost-RS. We use auxiliary data to enhance collaborative filtering-based recommender system models with existing knowledge on enzyme-substrate interactions by “boosting” the learning of embedding vectors. Third, we consider the challenging task of assigning chemical identities to measured spectra data, a problem known as metabolite annotation. We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), which takes advantage of prior Multi-Layer Perceptron (MLP) networks and Graph Neural Networks (GNNs) based annotation models and learns a weighting for the outputs of MLP and GNN spectral predictors to generate a rank-learning based spectral prediction for a query molecule. Our developed methods cover a wide range of biological engineering applications. Because our methods are designed to maximally utilize available heterogeneous data, they outperform state-of-the-art approaches even when the latter are updated to consider such auxiliary data. Overall, the work presented here demonstrates the importance of developing machine learning models that are tailored to the target biological application.
Join meeting in Granoff Music Center 271, or via Zoom.
Password: see colloquium email
Dial-in not an option for this event.