PhD Defense: Machine Learning in Health Informatics: Making Better use of Domain Experts

June 15, 2012
3 pm - 6 pm
Halligan 111A
Speaker: Byron Wallace, Tufts University


The world is awash with data and there are too few experts (with too little time) to make sense of it. Data mining and machine learning methods can potentially mitigate this problem by automatically or semi-automatically filtering, organizing and otherwise processing unstructured data. But if such technologies are to be useful, they must leverage domain expert time and expertise efficiently. I will discuss novel machine learning (ML) research in this direction that has been motivated by applications in health informatics. There is a pressing need for ML methods in this field due to the data deluge in clinical science. Indeed, PubMed (a repository of biomedical literature) now indexes well over 20 million manuscripts, and continues to grow exponentially. Concurrently, there has been a trend toward Evidence-Based Medicine (EBM), which looks to inform medical practice with current best evidence. Systematic reviews, which address a specific clinical question by synthesizing all of the relevant published medical evidence, are central to EBM, and are increasingly used to inform healthcare decisions. But such reviews are extremely laborious to produce: doctors must read through many thousands of (mostly irrelevant) abstracts to find the few tens of papers that are relevant to their clinical question. We aim to mitigate the workload involved in conducting systematic reviews by semi-automating the process using machine learning.

However, due to properties inherent to clinical informatics tasks, `off-the-shelf' ML technologies tend to perform poorly in this domain. Specifically, if machine learning is to be successful in clinical science in general (and in semi-automating the conduct of systematic reviews in particular), novel methods must be developed to: mitigate the effects of class imbalance during model induction; exploit the wealth of domain knowledge highly skilled domain experts (e.g., doctors) bring to the task; and to induce better models with less effort (fewer labels). In this talk I will present new machine learning methods that address these issues, and demonstrate their efficacy in the task of abstract screening. In particular, I will present new perspectives on class imbalance, novel methods for exploiting `dual supervision' (i.e., labels on both instances and features), and an `active learning' framework appropriate for multiple expert scenarios. Each of these contributions aims to squeeze better classification performance out of fewer labels, thereby making better use of domain expertsí time and expertise in real-world machine learning applications.