**Current Research: **

- Graphical Models: Theory, Algorithms and Applications [+]

**Description:**Our work is done in the context of expressive Bayesian probabilistic models (a.k.a graphical models), developing inference algorithms for them, developing a learning theory that explains why these algorithms work and applying them in interesting applications. Our theoretical results provide distribution-free guarantees on the risk of approximate Bayesian inference algorithms. Recent models include constrained clustering, multi-task learning, sparse Gaussian processes, mixture of expert models for label discretization, matrix facorization and topic models. Recent applications include land-cover clustering and classification, analysis of time series from Astronomy, and predicting contamination level in environmental engineering.

This work is partly supported by NSF grants IIS-1714440 and IIS-0803409

- Learning to Act in Relational Markov Decision Processes [+]

**Description:**Markov decision processes give a mathematical model for agents acting in a dynamic environment. The actions of the agent affect the world, its own state, and whether it is "rewarded" or not. However, the results of actions are not deterministic.The basics of this framework are well understood but the main challenge is in scalability to problems with large state spaces and/or action spaces. Our main interest is in developing efficient agents that learn and act in such domains. Our solutions take advantage of structure (relational or propositionally factored) in the state and action space to yield effective solutions.

This work is partly supported by NSF grants IIS-0936687, IIS 0964457 and IIS-1616280

**Past Research:**

- Behavioral Modeling for Computer Security [+]

**Description:**This research addresses how machine learning can be used to model normal behavior to flag anomalous behavior. Applications include user-reauthentication and server authentication. - Active Class Selection/Active Learning [+]

**Description:**We are looking at problems related to the generation of training data. We are interested in two scenarios. 1) A new class of problems we have defined, Active Class Selection (ACS). ACS addresses the question: if one can collect n additional instances, how should they be distributed with respect to class? 2) Active Learning, in which one requests labels for existing training data.

Specifically, Active Class Selection addresses the tasks for which one can control the classes from which training data are generated. In such cases, utilizing feedback during learning to guide the generation of new training data will yield better performance than learning from an a priori fixed class distribution. Our methods work within a multi-armed bandit framework.

In regard to active learning, we are investigating several real-world issues. Speficially, how to perform active learning in the context of severe class imbalance, how to adapt to changes in the underlying concept to be learned (concept drift), and how to inject domain knowledge into the AL framework. - Clustering High Dimensional Data [+]

**Description:**In this project, we are interested in applying clustering techniques to identify interesting patterns in real world data sets. In previous work, we explored how to perform automatic feature selection with clustering. Currently, we are addressing how cluster ensembles can improve performance.

Our driving application, Earth science applications, have two distinctive characteristics of the data -- they often have high dimensionality and they are spatially structured. These two features pose special challenges to the clustering task. First, high dimensionalities cause fundamental difficulties to many traditional clustering algorithms. Second, the spatial structures pose spatial continuity constraints on the clustering solutions, resulting in a constrained clustering problem. The goal of this research is to address the above issues and develop new approaches to clustering high dimensional data sets that are spatially structured. - Complexity of Learning Propositional Formulae [+]

**Description:**This project investigates the resource complexity of learning formulae

in propositional logic. Resources include run time as well as the

number of examples needed to learn. In models where the learner can

ask questions we measure the number of questions needed to complete

the learning task. The focus is on provably correct algorithms and

their analysis.

his work is partly supported by NSF grant IIS-0099446. - Constraint-Based Clustering [+]

**Description:**We are working with the Department of Geography at Boston University on the problem of defining the classes for land cover classification. Our method used both the unsupervised data and labels created from a previous classification scheme.

This work is partly supported by NSF grant IIS-0803409 - Content-Based Image Retrieval Applied to Medical Images [+]

**Description:**Content-based Image Retrieval (CBIR) consists of retrieving the most visually similar images to a given query image from a database of images. CBIR from medical image databases does not aim to replace the physician by predicting the disease of a particular case but to assist him/her in diagnosis. The visual characteristics of a disease carry diagnostic information and oftentimes visually similar images correspond to the same disease category. By consulting the output of a CBIR system, the physician can gain more confidence in his/her decision or even consider other possibilities.

For a detailed explanation of our project please visit the CBIR homepage at Purdue University. - Covert channels [+]

**Description:**Normalization techniques in today's firewalls can be shown to virtually eliminate the potential of sending hidden information in IP packet headers, a classic example of a covert storage channel. Yet with the myriad of packets traversing the web, the exact arrival times of packets are overlooked. Whilst this timing is typically transparent to applications above the IP layer, by manipulating and observing the pattern of packet arrival times a timing covert channel can be created with arbitrary packets. Our project explores both the techniques involved in establishing reliable timing channels as well as detecting or impeding them. - Finding and Eliminating Class Label Noise [+]

**Description:**We focus on the problem of finding and eliminating class label noise, also known as mislabeled training data. We have proposed a novel framework that calculates the probability of class membership on each training instance, and uses these probabilities as instance weights. These probabilities can be used to downweight noise or correct to the true labels. We have applied our methods to land cover data set and are currently working on an application that to verify labels on soil liquefaction data in earthquake-prone regions. - Hardware Methods for Computer Security [+]

**Description:**

This project addresses hardware solutions to computer security. In particular Heatstroke is a method for detecting and preventing denial of service attacks based on overheating hardware.

Our second project addresses buffer overflow attacks on the return address. Our method, which we call SmashGuard, is a hardware-based solution to prevent Buffer-Overflow Attacks realized by overwriting the Function Return Address. SmashGuard keeps a copy of each Return address that is written to the program stack by each function call, in a LIFO buffer on the CPU - the Hardware stack. When a function returns to the caller, the Return Addresses in the hardware stack and the program stack are compared. A mismatch signals tampering with the Return Address in the program stack, which is a sign of a Buffer Overflow attack. A hardware interrupt is raised and the process terminated before the control is transferred to the modified return addresss. The design of SmashGuard is a kernel patch that supports CPUs modified to support SmashGuard protection. - Learning Expressions in First Order Logic [+]

**Description:**We are studying problems where a machine learning system sees examples and produces hypotheses which are represented by logical formulae. This generic framework fits many applications from learning moves for Chess to identifying whether molecules will work well as drugs. Our theoretical work aims to characterize the computational complexity of these problems, i.e. how easy or hard they are to solve. Our work includes developing algorithms which are provably correct and efficient, and proving lower bounds that show limitations on the efficiency of any algorithm for such problems.

This work is partly supported by NSF grant IIS-0099446. - Learning to Reason [+]

**Description:**The goal in this project is to develop an understanding of system that learn knowledge and use it for reasoning. Most of our work dealt with propositional logic as the underlying reasoning language. The work includes developing different representation formalisms, reasoning algorithms that use them, and algorithms for learning such representations. Our Learning to Reason results demonstrate (in a technical sense) that it can be easier to learn in order to reason (on some things) than learn a model of the world and reason from it. - LogAn-H: a system for learning from relational data [+]

**Description:**We have developed a system for learning from relational data (e.g. from graphs). The system is pretty efficient, gives a non traditional type of algorithm ("bottom up") based on the theoretical results, and is a state of the art Inductive Logic Programming (ILP) system. Current work is focused on solving large scale problems involving classification of molecules.

his work is partly supported by NSF grant IIS-0099446. - Machine Learning for Predictive Medicine -- Finding Lesions in the MRIs of Epilepsy Patients [+]

**Description:**This project is aimed at the automatic detection of focal cortical dysplastic regions from surface based morphometric data. Focal cortical dysplasia (FCD) is the most common cause of treatment resistant epilepsy in pediatrics and the second most cause for adults. Even with great advances in MRI 60% of FCD cases remain undetected in routine MRI visual inspection by experienced radiologists. Using co-registered cortical surfaces and a number of morphological features for both patients and normal controls we apply machine learning methods to predict potential FCD lesions. - Machine Learning for Predictive Medicine -- Multiple Sclerosis Outcome Prediction [+]

**Description:**We are working with researchers from Harvard Medical School to predict outcomes for multiple sclerosis patients. A focus of the research is how best to interact with physicians to use both human expertise and machine learning methods. - Medical Text Data Mining [+]

**Description:**This project develops machine learning methods to health informatics, specifically with the aim of reducing the (human) workload involved in conducting systematic reviews. The idea is to use machine learning algorithms to induce models that semi-automate the clinical evidence synthesis process, thereby reducing workload. This has led to advances in ML including active learning under class imbalance, multiple expert active learning, incorporating labeled features into the SVM optimization, and how to determine when to stop labeling data. - Mining Frequent Patterns [+]

**Description:**The problem of "mining frequent patterns" is one of the most widely studied problems in data mining. Here, given a database capturing some activities (e.g. shopping lists in supermarkets) one searches for patterns of activities (e.g. sets of items bought together) which occur with sufficiently high frequency. This idea has been applied in a variety of real world problems. Our previous work developed formal complexity results for itemset mining, by showing an equivalence to some problems of learning by asking questions, and developed a new algorithm which is particularly efficient when patterns sought are large. Current work is focused on mining in relational data. For example given a set of molecules described by graphs (capturing a family of interest e.g. "high solubility", or "effective drug") one would like to find frequent substructures in this family. Our Learning to Act system includes one of the earliest implementations of relational frequent set mining.

This work is partly supported by NSF grant IIS-0099446. - On Line Learning Algorithms and Kernel Methods [+]

**Description:**On-line learning is an important paradigm where training examples appear one at a time and are not saved for batch processing. This is important for efficiency, and increasingly for suitability for systems that continuously interact with their environment. The well studied Perceptron algorithm is a classical example of an on-line learning algorithm, where SVM is the corresponding batch algorithm. Our work investigates on-line learning algorithms and their theoretical and empirical generalization performance. We have also investigated developing kernels for complex structured data and using these with on-line and batch algorithms.

This work is partly supported by NSF grants IIS-0803409 and IIS-0099446. - Spam Filtering [+]

**Description:**In this research, we seek to develop, test, and apply machine learning methods for effective spam filtering in large scale, online settings. - Time Series Data Mining [+]

**Description:**We have multiple projects looking at several aspects of mining and learning with time series data with direct applications in several domains including astrophysics and predictive medicine. Recent projects include detecting anomalous time series using novel and efficient clustering algorithms, clustering and classification of time series data using kernel methods and Gaussian processes, and fast search and analysis algorithms detecting events within time series.

This work is partly supported by NSF grants IIS-0803409 and IIS-0713259.