All 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 or earlier
Display publications 1 to 30 of 136 publications associated with the Machine Learning Group:
Authors: Dan Preston, Carla Brodley, Pavlos Protopapas
Proceedings of the Ninth SIAM International Conference on Data Mining
2009
Abstract: The discovery of events in time series can have important implications, such as identifying microlensing events in astronomical surveys, or changes in a patient's electrocardiogram. Current methods for identifying events require a sliding window of a fixed size, which is not ideal for all applications and could overlook important events. In this work, we develop probability models for calculating the significance of an arbitrary-sized sliding window and use these probabilities to find areas of significance. Because a brute force search of all sliding windows and all window sizes would be computationally intractable, we introduce a method for quickly approximating the results. We apply our method to over 100,000 astronomical time series from the MACHO survey, in which 56 different sections of the sky are considered, each with one or more known events. Our method was able to recover 100% of these events in the top 1% of the results, essentially pruning 99% of the data. Interestingly, our method was able to identify events that do not pass traditional event discovery procedures.
Associated Research Topics: Affiliated Tufts Members: Tufts / Purdue Alumni:Url: http://www.siam.org/proceedings/datamining/2009/dm09_007_prestond.pdf
Authors: U. Rebbapragada, P. Protopapas, C. E. Brodley, C. Alcock
Machine Learning
Vol. 74, Issue 3, p. 281
2009
Abstract: Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD’s reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
Associated Research Topics: Affiliated Tufts Members: Tufts / Purdue Alumni:Authors: Saket Joshi, Kristian Kersting, and Roni Khardon
In the Proceedings of the International Joint Conference on Artificial Intelligence
2009
Authors: U. Rebbapragada, L. Mandrake, K. Wagstaff, D. Gleeson, R. Castaño, S. Chien, and C. E. Brodley
IEEE Aerospace Conference
2009
Authors: G. Wachman, R. Khardon, P. Protopapas, and C. Alcock
Proceedings of the European Conference on Machine Learning (ECML)
2009
Authors: L. Mandrake, K. Wagstaff, D. Gleeson, U. Rebbapragada, D. Tran, R. Castaño, S. Chien, R. Pappalardo
IEEE Aerospace Conference
2009
Authors: X. Wei, L. Cowen, C. Brodley, A. Brady, D. Sculley and D. Slonim
ISBRA 2008: International Symposium on Bioinformatics Research and Applications
2008
Abstract: forthcoming
Affiliated Tufts Members: Tufts / Purdue Alumni:Url: WeiHGT.pdf
Authors: U. Rebbapragada, P. Protopapas, C. E. Brodley and C. Alcock
Astronomical Data Analysis Software and Systems XVIII
2008
Authors: D. Sculley and Gordon V. Cormack
CEAS 2008: Proceedings of the Fifth Conference on Email and Anti-Spam
2008
Abstract: Recent email spam filtering evaluations, such
as those conducted at TREC, have shown that near-perfect filtering results are attained with a variety of machine learning methods when filters are given perfectly accurate labeling feedback for training. Yet in real-world settings, labeling feedback may be far from perfect. Real users give feedback that is often mistaken, inconsistent, or even maliciously inaccurate. To our knowledge, the impact of this noisy labeling feedback on current spam filtering methods has not been previously explored in the literature. In this paper, we show that noisy feedback may harm or even break state-of-the-art spam filters, including recent TREC winners. We then propose and evaluate several approaches to make such filters robust to label noise. We find that although such modifications are effective for uniform random label noise, more realistic “natural” label noise from human users
remains a difficult challenge.
Url: http://www.eecs.tufts.edu/~dsculley/papers/noisySpam.pdf
Authors: C. Wang, S. Joshi and R. Khardon
Journal of AI Research
Vol 31, pp431-472
2008
Authors: U. Rebbapragada, R. Lomasky, C. E. Brodley and M. Friedl
Proceedings of the 2008 IEEE International Geoscience and Remote Science Symposium
2008
Authors: S. Cabuk, C. E. Brodley and C. Shields
ACM Transactions on Information and System Security
to appear
2008
Abstract: A covert channel can occur when an attacker finds and exploits a shared resource that is not designed to be a communication mechanism. A network covert channel operates by altering the timing of otherwise legitimate network traffic so that the arrival times of packets encode confidential data that an attacker wants to exfiltrate from a secure area from which she has no other means of communication. In this paper, we present the first public implementation of an IP covert channel, discuss the subtle issues that arose in its design, and present a discussion on its efficacy. We then show that an IP covert channel can be differentiated from legitimate channels and present new detection measures that provide detection rates over 95%. We next take the simple step an attacker would of adding noise to the channel to attempt to conceal the covert communication. For these noisy IP covert timing channels, we show that our online detection measures can fail to identify the covert channel for noise levels higher than 10%. We then provide effective offline search mechanisms that identify the noisy channels.
Associated Research Topics: Affiliated Tufts Members: Tufts / Purdue Alumni:Authors: D. Sculley
CEAS 2008: Proceedings of the Fifth Conference on Email and Anti-Spam
2008
Abstract: Internet blogs provide forums for discussions within virtual communities, allowing readers to post comments on what they read. However, such comments may contain abuse, such as personal attacks, offensive remarks about race or religion, or commercial spam, all of which reduce the value of community discussion. Ideally, filters would promote civil discourse by removing abusive comments while protecting free speech by not removing any comments unnecessarily. In this paper, we investigate the use of user flags to train filters for this task, with the goal of empowering each community to enforce its own standards. We find encouraging results on experiments using a large corpus of blog comment data with real users flags. We conclude by proposing several novel deployment schemes for filters in this setting.
Associated Research Topics: Affiliated Tufts Members:Url: http://www.eecs.tufts.edu/~dsculley/papers/blogFiltering07.pdf
Authors: S. Joshi and R. Khardon
Proceedings of the International Conference on Automated Planning and Scheduling
2008
Authors: R. Lomasky, C. E. Brodley, M. Aernecke, D. Walt, and M. Friedl
ECML
2007
Abstract: This paper presents Active Class Selection (ACS), a new class of problems for multi-class supervised learning. If one can control the classes from which training data is generated, utilizing feedback during learning to guide the generation of new training data will yield better performance than learning from any a priori fixed class distribution. ACS is the process of iteratively selecting class proportions for data generation. In this paper we present several methods for ACS. In an empirical evaluation, we show that for a fixed number of training instances, methods based on increasing class stability outperform methods that seek to maximize class accuracy or that use random sampling. Finally we present results of a deployed system for our motivating application: training an artificial nose to discriminate vapors.
Associated Research Topics: Affiliated Tufts Members: Tufts / Purdue Alumni:Authors: Umaa Rebbapragada, Carla E. Brodley
Proceedings of the 18th European Conference on Machine Learning
2007
Abstract: We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding.
Associated Research Topics: Affiliated Tufts Members: Tufts / Purdue Alumni:Authors: D. Yankov, E. Keogh, U. Rebbapragada
IEEE International Conference on Data Mining
2007
Authors: C. Wang, S. Joshi and R. Khardon
In the proceedings of the International Joint Conference on Artificial Intelligence
2007
Authors: G. Wachman and R. Khardon
In the proceedings of the International Conference on Machine Learning.
2007
Authors: M. Arias, R. Khardon and J. Maloberti
Journal of Machine Learning Research
Vol 8, pp549--587
2007
Authors: R. Khardon and G. Wachman
Journal of Machine Learning Research
Vol 8, pp227--248
2007
Authors: G. Garriga, R. Khardon and L. De Raedt
In the proceedings of the International Joint Conference on Artificial Intelligence.
2007
Authors: D. Sculley
CEAS 2007: Proceedings of the Fourth Conference on Email and Anti-Spam
2007
Abstract: Active learning methods seek to reduce the number of labeled examples needed to train an effective classifier, and have natural appeal in spam filtering applications where trustworthy labels for messages may be costly to acquire. Past investigations of active learning in spam filtering have focused on the pool-based scenario, where there is assumed to be a large, unlabeled data set and the goal is to iteratively
identify the best subset of examples for which to request labels. However, even with optimizations this is a costly approach. We investigate an online active learning scenario where the filter is exposed to a stream of messages which must be classified one at a time. The filter may only request a label for a given message immediately after it has been classified. The goal is to achieve strong online classification performance with few label requests. This is a novel scenario for low-cost active spam filtering, fitting for application in large-scale systems. We draw from the label efficient machine learning literature to investigate several approaches to selective sampling in this scenario using linear classifiers. We show that online active learning can dramatically reduce labeling and training costs with negligible additional overhead while maintaining high levels of classification performance.
Url: http://www.eecs.tufts.edu/~dsculley/papers/activeSpam.pdf
Authors: C. Wang and R. Khardon
In the proceedings of the Conference on Uncertainty in Artificial Intelligence
2007
Authors: D. Sculley
KDD 2007: Proceedings of the 13th ACM SIGKDD International Conference on Data Mining and Knowledge Discovery
2007
Abstract: In many data mining applications, online labeling feedback is only available for examples which were predicted to belong to the positive class. Such applications include spam filtering in the case where users never check emails marked “spam”, document retrieval where users cannot give relevance feedback on unretrieved documents, and online advertising where user behavior cannot be observed for unshown advertisements. One-sided feedback can cripple the performance of classical mistake-driven online learners such as Perceptron. Previous work under the Apple Tasting framework showed how to transform standard online learners into successful learners from one sided feedback. However, we find in practice that this transformation may request more labels than necessary to achieve strong performance. In this
paper, we employ two active learning methods which reduce the number of labels requested in practice. One method is the use of Label Efficient active learning. The other method, somewhat surprisingly, is the use of margin-based learners without modification, which we show combines implicit active learning and a greedy strategy to managing the exploration exploitation tradeoff. Experimental results show that these methods can be significantly more effective in practice than those using the Apple Tasting transformation, even on minority class problems.
Url: http://www.eecs.tufts.edu/~dsculley/papers/oneSidedFeedback.pdf
Authors: D. Sculley
Proceedings of 2007 SIAM International Conference on Data Mining
2007
Abstract: The problem of combining the ranked preferences of many experts has gained renewed importance in data mining, and information retrieval applications. Effective rank aggrega tion becomes difficult in real-world situations in which the rankings are noisy, incomplete, or even disjoint. We address these difficulties by extending several standard methods of rank aggregation to consider similarity between items in the various ranked lists, in addition to their rankings. We establish evaluation criteria for these algorithms by extending previous definitions of distance between ranked lists to include the role of similarity between items. Experimental results show that this approach improves the performance of several standard rank aggregation methods, particularly when used with noisy, incomplete rankings.
Affiliated Tufts Members:Url: http://www.eecs.tufts.edu/~dsculley/papers/mergeSimilarRankPoster.pdf
Authors: D. Sculley and Gabriel M. Wachman
SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference
2007
Abstract: Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.
Associated Research Topics: Affiliated Tufts Members:Url: http://www.eecs.tufts.edu/~dsculley/papers/emailAndWebSpamSIGIR.pdf
Authors: D. Sculley and Gabriel M. Wachman
TREC 2007: The Sixteenth Text REtrieval Conference Proceedings
2007
Abstract: Relaxed Online Support Vector Machines (ROSVMs) have recently been proposed as an efficient methodology for attaining an approximate SVM solution for streaming data such as the online spam filtering task. Here, we apply ROSVMs in the TREC 2007 Spam filtering track and report results. In particular, we explore the effect of various sliding-window sizes, trading off computation cost against classification performance with good results. We also test a variant of fixed-uncertainty sampling for Online Active Learning. The best results with this approach give classification performance near to that of the fully supervised approach while requiring only a small fraction of the examples to be labeled.
Associated Research Topics: Affiliated Tufts Members:Url: http://www.eecs.tufts.edu/~dsculley/papers/trec.2007.spam.pdf
Authors: Arias, M. and Khardon, R.
Machine Learning
Volume 64, pages 121-144
2006