Description: We are looking at problems related to the generation of training data. We focus on multi-class learning, and are interested in two scenarios. 1) A new class of problems we have defined, Active Class Selection (ACS). ACS addresses the question: if one can collect n additional instances, how should they be distributed with respect to class? 2) Active Learning, in which one requests labels for existing training data.
Specifically, Active Class Selection addresses the tasks for which one can control the classes from which training data are generated. In such cases, utilizing feedback during learning to guide the generation of new training data will yield better performance than learning from an a priori fixed class distribution. In this paper we first ground the problem of ACS with recent theoretical developments in stability and generalization. Motivated by the theory, we then present several methods for ACS, and illustrate them with a multi-armed bandit framework. In an empirical evaluation, we find that for a fixed number of training instances, methods based on increasing class stability outperform methods that seek to maximize class accuracy or that use random sampling.
In regard to active learning, we define an underlying framework that allows us to perform at online analysis of the active learning process. First, it can be applied to help determine when to stop labeling. Second, online analysis can be used to combine active learning methods to form an ensemble that outperforms its components. Additionally, we are looking at ways to use ideas from Active Class Selection to improve active learning. In particular, if one can ascertain from which classes its most beneficial to obtain more labeled data, then one can pay more attention to the labeling suggestions of classifiers in the ensemble that discriminate these classes.
Description: We are working with the Department of Geography at Boston University on the problem of creating land cover maps of the Earths surface from remotely sensed data. We have several projects: finding mislabeled data, applying multiclass active learning methods, and determining the classes that are supported in feature space.
Description: We focus on the problem of finding and eliminating class label noise, also known as mislabeled training data. We have proposed a novel framework that calculates the probability of class membership on each training instance, and uses these probabilities as instance weights. These probabilities can be used to downweight noise or correct to the true labels. We have applied our methods to land cover data set and are currently working on an application that to verify labels on soil liquefaction data in earthquake-prone regions.
Description: Kernel Methods give a generic way to apply certain learning methods (maximum margin, perceptron, nearest neighbors) to domains with any structure. The main tool is a kernel function which computes inner products in some space, which serves as a virtual representation space for examples. This amazing idea (known since 60s) is being intensively investigated in the machine learning community. Our previous work investigated the scope and limitations of using kernel methods for logic learning. This issue turns out to be tricky since there are nice kernel constructions but they do not necessarily lead to successful learning. Our current focus is on two issues: kernel learning with relational data, and identifying learning algorithms which are robust to "noise" - a small amount of wrong annotation on training data.
Description: Markov decision processes give a mathematical model for agents acting in a dynamic environment. The agent's actions affect the world, it own state, and whether it is "rewarded" or not. However, the results of actions are not deterministic. In relational models, the state of the world is best described by referring to objects and relations among them. This is a very intuitive setting that matches many problems. Our main interest is in developing agents that learn to act in a useful way by utilizing various sources of information: examples from a teacher, exploration by trial and error, utilizing a model of the world etc. In previous work we have developed a theoretical model for learning from examples, developed algorithms and analyzed them, and developed a system L2Act for solving AI planning problem using this framework. Our current work is focused on developing useful knowledge representations for agents' internal representations (e.g. policies) that lead to efficient and robust learning to act.
Description: The problem of "mining frequent patterns" is one of the most widely studied problems in data mining. Here, given a database capturing some activities (e.g. shopping lists in supermarkets) one searches for patterns of activities (e.g. sets of items bought together) which occur with sufficiently high frequency. This idea has been applied in a variety of real world problems. Our previous work developed formal complexity results for itemset mining, by showing an equivalence to some problems of learning by asking questions, and developed a new algorithm which is particularly efficient when patterns sought are large. Current work is focused on mining in relational data. For example given a set of molecules described by graphs (capturing a family of interest e.g. "high solubility", or "effective drug") one would like to find frequent substructures in this family. Our Learning to Act system includes one of the earliest implementations of relational frequent set mining.
Description:
This project focuses on finding scalable solutions for mining large amounts of noisy astronomical survey data for events or unusual objects that may lead to new scientific discoveries. This project is done in collaboration with scientists from the Harvard-Smithsonian Center for Astrophysics and the Time Series Center at Harvard University.
![]() |
![]() |
| A typical (left) and anomalous (right) Eclipsing Binary | |
Description:
Description: This research addresses how machine learning can be used to model normal behavior to flag anomalous behavior. Applications include user-reauthentication and server authentication.
Description: In this project, we are interested in applying clustering techniques to identify interesting patterns in real world data sets. In previous work, we explored how to perform automatic feature selection with clustering. Currently, we are addressing how cluster ensembles can improve performance.
Our driving application, Earth science applications, have two distinctive characteristics of the data -- they often have high dimensionality and they are spatially structured. These two features pose special challenges to the clustering task. First, high dimensionalities cause fundamental difficulties to many traditional clustering algorithms. Second, the spatial structures pose spatial continuity constraints on the clustering solutions, resulting in a constrained clustering problem. The goal of this research is to address the above issues and develop new approaches to clustering high dimensional data sets that are spatially structured.
Description: This project investigates the resource complexity of learning formulae
in propositional logic. Resources include run time as well as the
number of examples needed to learn. In models where the learner can
ask questions we measure the number of questions needed to complete
the learning task. The focus is on provably correct algorithms and
their analysis.
Description: Content-based Image Retrieval (CBIR) consists of retrieving the most visually similar images to a given query image from a database of images. CBIR from medical image databases does not aim to replace the physician by predicting the disease of a particular case but to assist him/her in diagnosis. The visual characteristics of a disease carry diagnostic information and oftentimes visually similar images correspond to the same disease category. By consulting the output of a CBIR system, the physician can gain more confidence in his/her decision or even consider other possibilities.
For a detailed explanation of our project please visit the CBIR homepage at Purdue University.
Description: Normalization techniques in today's firewalls can be shown to virtually eliminate the potential of sending hidden information in IP packet headers, a classic example of a covert storage channel. Yet with the myriad of packets traversing the web, the exact arrival times of packets are overlooked. Whilst this timing is typically transparent to applications above the IP layer, by manipulating and observing the pattern of packet arrival times a timing covert channel can be created with arbitrary packets. Our project explores both the techniques involved in establishing reliable timing channels as well as detecting or impeding them.
Description:
This project addresses hardware solutions to computer security. In particular Heatstroke is a method for detecting and preventing denial of service attacks based on overheating hardware.
Our second project addresses buffer overflow attacks on the return address. Our method, which we call SmashGuard, is a hardware-based solution to prevent Buffer-Overflow Attacks realized by overwriting the Function Return Address. SmashGuard keeps a copy of each Return address that is written to the program stack by each function call, in a LIFO buffer on the CPU - the Hardware stack. When a function returns to the caller, the Return Addresses in the hardware stack and the program stack are compared. A mismatch signals tampering with the Return Address in the program stack, which is a sign of a Buffer Overflow attack. A hardware interrupt is raised and the process terminated before the control is transferred to the modified return addresss. The design of SmashGuard is a kernel patch that supports CPUs modified to support SmashGuard protection.
Description: We are studying problems where a machine learning system sees examples and produces hypotheses which are represented by logical formulae. This generic framework fits many applications from learning moves for Chess to identifying whether molecules will work well as drugs. Our theoretical work aims to characterize the computational complexity of these problems, i.e. how easy or hard they are to solve. Our work includes developing algorithms which are provably correct and efficient, and proving lower bounds that show limitations on the efficiency of any algorithm for such problems.
Description: The goal in this project is to develop an understanding of system that learn knowledge and use it for reasoning. Most of our work dealt with
propositional logic as the underlying reasoning language. The work includes developing different representation formalisms, reasoning
algorithms that use them, and algorithms for learning such representations. Our Learning to Reason results demonstrate (in a technical sense) that it can be easier to learn in order to reason (on some things) than learn a model of the world and reason from it.
Description: We have developed a system for learning from relational data (e.g. from graphs). The system is pretty efficient, gives a non traditional type of algorithm ("bottom up") based on the theoretical results, and is a state of the art Inductive Logic Programming (ILP) system. Current work is focused on solving large scale problems involving classification of molecules.
Description: In this research, we seek to develop, test, and apply machine learning methods for effective spam filtering in large scale, online settings.