Learning Safe Policies in Sequential Decision-Making Problems
In online advertisement as well as many other fields such as health informatics and computational finance, we often have to deal with the situation in which we are given a batch of data generated by the current strategy(ies) of the company (hospital, investor), and we are asked to generate a good or an optimal strategy. Although there are many techniques to find a good policy given a batch of data, there are not much results to guarantee that the obtained policy will perform well in the real system without deploying it. On the other hand, deploying a policy might be risky, and thus, requires convincing the product (hospital, investment) manager that it is not going to harm the business. This is why it is extremely important to devise algorithms that generate policies with performance guarantees.
In this talk, we discuss three different approaches to this fundamental problem, we call them model-based, model-free, and risk-sensitive. In the model-based approach, we first use the batch of data and build a simulator that mimics the behavior of the dynamical system under studies (online advertisement, emergency room of the hospital, financial market), and then use this simulator to generate data and learn a policy. The main challenge here is to have guarantees on the performance of the learned policy, given the error in the simulator. This line of research is closely related to the area of robust learning and control. In the model-free approach, we learn a policy directly from the batch of data (without building a simulator), and the main question is whether the learned policy is guaranteed to perform at least as well as a baseline strategy.
This line of research is related to off-policy evaluation and control. In the risk-sensitive approach, the goal is to learn a policy that manages risk by minimizing some measure of variability in the performance in addition to maximizing a standard criterion. We present algorithms based on these three approaches and demonstrate their usefulness in real-world applications such as personalized ad recommendation, traffic signal control, and American option pricing.
Mohammad Ghavamzadeh received a Ph.D. degree in Computer Science from the University of Massachusetts Amherst in 2005. From 2005 to 2008, he was a postdoctoral fellow at the University of Alberta. He has been a permanent researcher at INRIA in France since November 2008. He was promoted to first-class researcher in 2010, was the recipient of the "INRIA award for scientific excellence" in 2011, and obtained his Habilitation in 2014. He is currently (from October 2013) on a leave of absence from INRIA working as a senior analytics researcher at Adobe Research in California, on projects related to digital marketing.
He has been an area chair, a senior program committee member, and a program committee member at NIPS, ICML, IJCAI, AAAI, UAI, COLT, and AISTATS. He has been on the editorial board of Machine Learning Journal (MLJ) and has been a reviewer for JMLR, MLJ, JAIR, JAAMAS, Journal of operations research, IEEE TAC, and Automatica. He has published over 40 refereed papers in major machine learning, AI, and control journals and conferences, and has organized several tutorials and workshops at NIPS, ICML, and AAAI. His research is in the areas of machine learning, artificial intelligence, control, and learning theory; particularly to investigate the principles of scalable decision-making and to devise, analyze, and implement algorithms for sequential decision-making under uncertainty and reinforcement learning.