Research Talk: Semi-Automated Screening of Biomedical Citations for Systematic Reviews
Systematic reviews address a specific clinical question by unbiasedly assessing and analyzing the pertinent biomedical literature. In such reviews, investigators must peruse a large set of potentially relevant article abstracts in order to decide which are to be included in the review. Typically, researchers must evaluate thousands of citations to identify a few tens of eligible articles for a given review. We aim to reduce the burden imposed by citation screening on reviewers by building a classification model to automatically exclude irrelevant citations. This problem lends itself naturally to the pool-based active learning framework; we have access to a large set of unlabeled citations and an expert willing to provide labels, at a cost (time). An interesting caveat in our scenario is that any semi-automated approach must achieve near perfect recall with respect to relevant citations (i.e., the minority class), else the scientific validity of the systematic review may be compromised. Unfortunately, as we demonstrate, active learning via traditional methods (e.g., uncertainty sampling) under class imbalance tends to induce classifiers with high predictive accuracy but poor recall. We present a novel method for active learning under class imbalance that achieves gains in accuracy (compared to random sampling) without sacrificing recall.