Workflow-centric tracing and advanced diagnosis tools for the cloud ecosystem

March 13, 2019
3:00 PM
Halligan 102
Speaker: Raja Sambasivan, Boston University
Host: Jeff Foster

Abstract

We rely on distributed services running in the cloud for almost every aspect of modern society. So, it is critical that engineers diagnose problems observed within them as soon as possible. To help engineers with this extremely challenging task, workflow-centric tracing (also called distributed tracing) has emerged as a foundational technology to inform diagnosis and other management tasks. This is because, in contrast to traditional machine-centric instrumentation methods, it coherently captures all of the work done to process requests within and among the nodes of a distributed service.

In this talk, I will describe my past work on Spectroscope, a tool that uses workflow-centric tracing to automatically localize the source of performance degradations in distributed services. This work represents some of the earliest work on both building workflow-centric tracing infrastructures that are suited to performance diagnosis and demonstrating that such tracing is useful in informing automated diagnosis tools. I will then describe some of my research group’s current efforts, which explore advanced use cases of tracing to address important diagnosis-related challenges. One such challenge is: how can we guarantee that the instrumentation needed to diagnose a new performance problem—both within distributed services and lower datacenter stack layers, such as the guest OS—is present when it is observed?

While the focus of this talk, as well as much of research and industry, is on the adoption of tracing for diagnosing distributed services, our vision is broader. I will conclude by discussing how workflow-centric tracing and our tools could inform an automated management plane across all the parties involved in the cloud ecosystem (e.g., ISPs, cloud providers, cloud tenants). This is a critical requirement to allow the cloud ecosystem to continue to support innovation and to move past the current oligopoly of a few major cloud providers.

Bio

Raja is a Red Hat Visiting Scientist at Boston University. He works with the Mass Open Cloud and leads a research group focused on creating sophisticated diagnosis tools. He has worked on a wide range of technologies related to the cloud ecosystem, including object-based storage, applied machine learning methods to tackle systems problems, inter-domain routing, and Future Internet Architectures. He completed his PhD (on diagnosis tools for distributed services) and postdoc (on incentive-compatible mechanisms for deploying new inter-domain routing protocols) at Carnegie Mellon University.