Identifying and Analyzing Potential Scalability Faults in Large-Scale Distributed Systems

December 8, 2022
3:00-4:15pm ET
Cummings 270, Zoom
Speaker: Haryadi Gunawi, University of Chicago
Host: Raja Sambasivan


Highly-scalable systems are prevalent today, with many of them being publicly accessible. One can simply download the code and deploy them on the cloud. These systems are inherently easy to scale, but scale can be a "foe." Many recent works highlighted the problem of scalability faults that appear only when the system code is deployed at scale but that might not be necessarily visible in smaller-scale deployments. However, many questions remain unanswered. What are the most popular root causes of scalability faults that stem from the source code? Which part of the source code is potentially "harmful" when deployed at scale? How each of the system dimensions (#nodes, #files, etc.) that can be scaled would impact different parts of the system code?

To address these questions, we first performed a large study of real-world scalability faults for over two years. This was a manual, arduous process because there are no spe- cial tags that developers use to specifically mark scalability faults. We performed the study on 10 large-scale distributed systems and tagged 350 fault reports as scalability faults. We then present SView, a framework for identifying and analyzing potential scalability faults in large-scale distributed systems. SView combines instrumentation and statistical concepts to identify dimensional code fragments (DCFs), pieces of code whose number of executions (e.g., loop iterations) is positively correlated with the increase in size of one or more system dimensions, with static analysis modules that detect faulty code patterns involving the DCFs. We apply SView to 4 popular distributed systems, identify hundreds of DCFs and use our analysis modules to detect the harmful ones, including known and unknown scalability faults.


Haryadi S. Gunawi is an Associate Professor in the Department of Computer Science at the University of Chicago where he leads the UCARE research group (UChicago systems research on Availability, Reliability, and Efficiency). He received his Ph.D. in Computer Science from the University of Wisconsin, Madison in 2009. He was a postdoctoral fellow at the University of California, Berkeley from 2010 to 2012. His current research focuses on cloud computing reliability and new storage technology. He has won numerous awards including NSF CAREER award, Facebook, Google, and NetApp Faculty Research Awards, NSF Computing Innovation Fellowship, and Honorable Mention for the 2009 ACM Doctoral Dissertation Award. His research focus is in improving dependability of storage and cloud computing systems in the context of (1) performance stability, wherein he is interested in building storage and distributed systems that are robust to latency tails on unpredictable storage, and (2) reliability and scalability, wherein he is interested in combating concurrency and scalability bugs in cloud-scale distributed systems, and (3) interactions of machine learning and systems, specifically how machine learning techniques can address operating/storage system problems.

Please join meeting in JCC 270 or via Zoom.

Join Zoom Meeting:

Meeting ID: 960 3825 1227

Passcode: see colloquium email

Dial by your location: +1 646 558 8656 US (New York)

Meeting ID: 960 3825 1227

Passcode: see colloquium email