Tufts ML Alumni
Gabriel is a PhD candidate working with Roni Khardon. Gabriel earned his bachelor's degree at Georgetown University and worked in network security prior to coming to Tufts. His research interests include kernels for structured data, classification in the presence of noise, and generating probability measures from classifier output.
Associated Publications: [+]
Authors: G. Wachman, R. Khardon, P. Protopapas, and C. Alcock
Proceedings of the European Conference on Machine Learning (ECML)
Authors: G. Wachman and R. Khardon
In the proceedings of the International Conference on Machine Learning.
Authors: R. Khardon and G. Wachman
Journal of Machine Learning Research
Vol 8, pp227--248
Authors: D. Sculley and Gabriel M. Wachman
SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference
Abstract: Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.
Authors: D. Sculley and Gabriel M. Wachman
TREC 2007: The Sixteenth Text REtrieval Conference Proceedings
Abstract: Relaxed Online Support Vector Machines (ROSVMs) have recently been proposed as an efficient methodology for attaining an approximate SVM solution for streaming data such as the online spam filtering task. Here, we apply ROSVMs in the TREC 2007 Spam filtering track and report results. In particular, we explore the effect of various sliding-window sizes, trading off computation cost against classification performance with good results. We also test a variant of fixed-uncertainty sampling for Online Active Learning. The best results with this approach give classification performance near to that of the fully supervised approach while requiring only a small fraction of the examples to be labeled.
Current Research Topics:
Past Research Topics: [+]
Description: On-line learning is an important paradigm where training examples appear one at a time and are not saved for batch processing. This is important for efficiency, and increasingly for suitability for systems that continuously interact with their environment. The well studied Perceptron algorithm is a classical example of an on-line learning algorithm, where SVM is the corresponding batch algorithm. Our work investigates on-line learning algorithms and their theoretical and empirical generalization performance. We have also investigated developing kernels for complex structured data and using these with on-line and batch algorithms.
This work is partly supported by NSF grants IIS-0803409 and IIS-0099446.
Description: In this research, we seek to develop, test, and apply machine learning methods for effective spam filtering in large scale, online settings.
Description: We have multiple projects looking at several aspects of mining and learning with time series data with direct applications in several domains including astrophysics and predictive medicine. Recent projects include detecting anomalous time series using novel and efficient clustering algorithms, clustering and classification of time series data using kernel methods and Gaussian processes, and fast search and analysis algorithms detecting events within time series.
This work is partly supported by NSF grants IIS-0803409 and IIS-0713259.
The data in this archive includes the "OGLE II" dataset as used by the machine learning group at Tufts University. The dataset includes time series of light measurements from 3 type of periodic variable stars from the Optical Gravitational Lensing Experiment (OGLE) survey. The data includes a total of 14087 time series with (3425,3390,7272) in the categories (CEPH, RRL, EB).
The data was generated and kindly provided by other researchers. Please see the official OGLE site for more information on the survey, data and discoveries. Various queries on this and other astronomy data can also be made there as well as at the visier site and Harvard time series site. If using this data please cite the original work (Szymanski, 2005, Acta Astron., 55, 43 and Udalski, Kubiak and Szymanski, 1997, Acta Astron., 47, 319.) as suggested on the OGLE site.
Our group has made use of this data in machine learning research for anomaly detection, classification, probabilistic modeling and period detection. Please see our time series project page and additional publications therein for more information about the dataset and tasks studied.
We provide the OGLE-II dataset in order to make it more readily accessible to machine learning researchers. Toward that we packaged three versions of the data, as linked below. Please consult our papers for more information about the data, its processing and experiments.
(1) The raw data: the original time series that are measured at irregular time points and are not folded. We also provide the folded versions, as well as the known period (and other properties) as found by the OGLE project: ogle2full.tar.gz
(2) A processed form of the data: each time series is folded according to its known period, and then re-sampled via interpolation at 50 regular sampling time points. Two versions are provided, the time series "as is" and after "universal phasing". This form of the data can be simply treated as a point in 50-D Euclidean space and used directly by machine learning algorithms - providing an easy starting point to study the data. ogle50.libsvm and upogle50.libsvm