Surviving the Growing Data Cost of Machine Learning

March 5, 2020

3:00

Halligan 102

Speaker: Bert Huang

Host: Mike Hughes

Abstract

The world has become intoxicated with supervised deep learning. Demonstrations of successful deep learning on massive datasets impress practitioners, incentivizing them to invest huge amounts of resources to create datasets of their own. The willingness of practitioners to accept these large costs causes researchers to design complex models that require more data. This cycle has led to a status quo where practical machine learning is synonymous with expensive annotation of large datasets. In this talk, I’ll present an alternative to this paradigm: weak supervision, which uses inexpensively obtained unlabeled datasets and low-cost rules that approximately annotate all data at once. I’ll cover the challenges that come with weak supervision, including accounting for redundancy and bias, and detail some of our solutions to counteract these issues. I’ll discuss results using our weak supervision approaches that successfully train models for benchmark tasks and cyberbullying detection, all without the need for labeled training data.