PhD Defense: Kernel Methods and Their Application to Structured Data

August 31, 2009

10:00a-12:00n

Halligan 111B

Speaker: Gabriel Wachman, Tufts University

Abstract

Supervised Machine learning is concerned with the study of algorithms that take examples and their corresponding labels, and learn a general classification function that can predict the label of future examples. For example, an algorithm may take as input a set of molecules, each labeled "toxic" or "non-toxic" and try to predict the toxicity of new molecules based on the function learned from the input. In the astronomy domain, one might try to predict the type of a star given a series of measurements of the star's brightness, based on a set of known stars and measurements of their brightness. The thesis investigates three aspects of machine learning algorithms that use linear classification functions that work plicitly in feature spaces by using similarity functions known as kernels.

The first aspect is robustness to noise, that is learning when some of the labels in the known examples are not reliable. An extensive experimental evaluation reveals a surprising result, that the Perceptron Algorithm with Margin is an excellent algorithm in such contexts, and it is competitive or better than more sophisticated alternatives.

The second aspect is producing estimates of the confidence of predictions from such classifiers, especially Support Vector Machines. We explore this topic by proposing new methods and comparing them experimentally to existing approaches to this challenge.

Finally we investigate kernels for the two applications mentioned above, time series from astronomy and molecules from biochemistry, where the data are not initially expressed in Euclidean space. In each case we provide an efficiently computable kernel function that captures a natural similarity between pairs of examples. An experimental evaluation shows that our kernels lead to excellent performance when used with Perceptron variants or Support Vector Machines.

The contribution for the astronomy application goes beyond machine learning, providing a complete system for classifying stars from raw data taken in astronomy surveys, a task that typically requires a large amount of domain expert time. In this context the thesis investigates and evaluates several statistical tests and mechansims for filtering and processing time series data.