Quals Research Talk Day!!

December 1, 2023

12:15-5:45 pm ET

Cummings 280

Speaker: G.Cicekli, E.Harvey, F.Huston, J.Hu, B.Stubbs, M.Umair

Host: Diane Souvaine

Abstract

Schedule:

12:15-1:15 M. Umair, "Towards Naturalistic Turn-Taking in Spoken Dialogue Systems"

1:15-2:00 E. Harvey, "Building Effective Neuroimaging Classifiers from Small Pilot Data"

2:00-2:45 J. Hu, "Improving Traveling Salesman Problem models through generating hard problem-solution pairs"

2:45-3:15 Break with Refreshments

3:15-4:15 F. Huston, "Synthesizing Programs from Fragments with Mason"

4:15-5:00 G. Cicekli, "Towards Detecting Anomalous Module Changes in GRNs"

5:00-5:45 B. Stubbs, "Drug repositioning using LINCS and Larimar"

Abstracts:

Muhammad Umair:

Towards Naturalistic Turn-Taking in Spoken Dialogue Systems

Abstract: Smooth turn-taking is an important aspect of natural conversation. It allows interlocutors to maintain adequate mutual comprehensibility and ensures the sequential dependency of successive turns. In natural conversation, interlocutors tend to minimize gaps and overlaps while following the “one- speaker-at-a-time" rule. A consequence is that the timing between utterances is normatively constrained, and deviations convey socially relevant paralinguistic information. However, for spoken dialogue systems, smooth turn- taking continues to be a challenge. This motivates the need for spoken dialogue systems to employ a robust model of turn-takin to ensure smooth message exchange without communicating unintended social information. In this presentation, Umair will provide an overview of his research, which aims to develop models of turn-taking grounded in an understanding of natural interaction. First, we developed GailBiot, an automated system for extracting highly relevant paralinguistic features of conversation. Next, we created a participant labeled corpus of opportunities for turn-transition and empirically justify our treatments of such opportunities. Finally, we explored whether Large Language Models (LLMs) can be reliably used to predict turn opportunities in spoken language.

Research Area: Spoken-Dialogue Systems, Human Interaction, Natural Language Processing

Research Advisor: JP de Ruiter

_______________________________________________________________

Ethan Harvey

Title: Building Effective Neuroimaging Classifiers from Small Pilot Data

Abstract: Dataset size is often the limiting factor for model performance in deep learning. Without a large dataset of labeled images, deep learning models face challenges like overfitting and poor generalization. This talk will explore challenges posed by small dataset sizes while working on a multi-year funded project to build models that can diagnose stroke/dementia from CT and MRI images of the brain. We offer a probabilistic approach based on Gaussian processes to extrapolate accuracy or similar performance metrics as dataset size increases. Our mean functions are carefully chosen to adapt past work on point estimation to accommodate a priori common sense (accuracy or AUROC can never be larger than 1) and domain knowledge (some tasks may not reach perfect accuracy, only 1-𝜀). Additionally, we have carefully evaluated a recent Bayesian transfer learning strategy, and found we could not reproduce, echoing a growing trend in literature calling for improved benchmarking practices. We suggest strong baselines to be used in future evaluations.

Research Area: Machine Learning

Research Advisors: Mike Hughes, Rob Jacob

_______________________________________________________________

Jiajing Hu

Improving Traveling Salesman Problem models through generating hard problem-solution pairs

Abstract: Traveling Salesman Problem is a well-studied NP-hard problem in combinatorial optimization. Existing neural methods to tackle this problem mostly lie in two directions: train in a supervised way by using a solver to get the optimal solution and learning the distribution of the optimal solution; or train with reinforcement learning. For supervised learning, getting solutions from a solver is computational expensive because this is an NP-hard problem. For reinforcement learning, it does not need a solution, but exploring the space is time consuming. In addition, both methods use randomly generated training data, leaving us a question of the model’s capacity on harder dataset. To alleviate the need to rely on a solver and to extend the model’s capacity on harder datasets, we propose to generate hard problem-solution pairs, and to use them to help with training.

Research Area: Machine learning

Research Advisor: Liping Liu

_______________________________________________________________

Fox Huston

Title: Synthesizing Programs from Fragments with Mason

Abstract: Object-oriented programmers often use design patterns when synthesizing their code. While these provide a convenient reuse of coding idioms, using them is a manual process, and requires a lot of uninteresting work. We present Mason, a program synthesis tool that takes user-provided fragments, unit tests, and a base program, and generates code that implements these patterns. It prunes the program search space by using a novel type and constraint system with unknowns for types as well as names. By using rules that may also modify the input program to resolve these constraints, Mason is able to successfully assemble a correct program from its inputs.

Research Area: Programming Languages

Research Advisor: Jeff Foster

_______________________________________________________________

Gizem Cicekli

Title: Towards Detecting Anomalous Module Changes in GRNs

Abstract: Gene Regulatory Networks (GRNs), are directed graphs that describe regulatory interactions between genes, that convey how genes can turn each other on (regulate) and off (deregulate). Since GRNs can describe the “observed” state of a biological system as it is, it can play a vital role to understand the disease mechanism as temporal changes in the network structure are important to understand immune response, infection and differentiation processes. In addition, identifying such changes in turn can help also develop or identify drugs that would appropriately negate the regulatory effect of a disease. Hence, GRN inference from high- throughput bulk gene expression experiments is a well-studied area. However, many studies simply infer a static GRN snapshot of the biological process. So, if we can explain how the network topology evolves over time and characterize the structural change then it would help understand for example how a disease progresses or when a mutation or drug interference happens. We discuss how using the newer single cell RNA sequence data has unique properties that provide a way of inferring a dynamic stream of GRNs, and then pose the problem as detecting an anomalous subgraph problem in dynamic networks.

Research Area: Bioinformatics, ML

Research Advisor: Donna Slonim

_______________________________________________________________

BJ Stubbs

Title: Drug repositioning using LINCS and Larimar

Abstract: We are in an unprecedented era of Medical scientific breakthroughs, and every year new therapies are developed that promise more effective disease management. However, drug research remains expensive and low throughput. Drug repositioning can help rescue some of these orphaned drugs, elucidate new areas of research, and provide alternatives to patients who cannot tolerate the current standards of care.

A standard strategy for drug repositioning is to use a case-control framework to detect a disease signal, then reverse the direction of this signal and look for a chemical that produces this “nullifying” signal. We propose detecting this nullyfing signal using data from the LINCs project.

Unfortunately, while there is a lot of data, LINCS is rather sparse with most cell-perturbagen combinations missing. This means that finding a relevant subset of data from LINCS requires additional techniques such as imputation or transfer learning. Donna Slonim’s lab has had some success in imputing LINCS data using collaborative filtering, but there is room for improvement.

We propose developing a topic model implementation of stochastic block models as a pseudo-knn based imputation scheme. In the manuscript “A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data, ” Valle et al show that the topic model metaphor of simultaneously clustering words into topics and documents into clusters can be extended to Clustering genes and samples leading to profiles that provide insights into cancer subtype and prognosis and we believe that similarly, topics of genes across cells will allow us better resolution in neighbor selection for knn-based imputation.

The tools for data extraction, cleaning, simulation, imputation, and evaluation will be available on github primarily in an R language package named Larimar.

Research Advisor: Donna Slonim