Software faults, failures and their consequences: What we can learn from and do about them?
Software failures and their underlying bugs are one of the most prevalent causes of system outages. New development methodologies, automatic testing, source code checking, and more sophisticated debugging techniques have been successful to significantly reduce the number of software bugs present when delivered for operation. Two main factors are responsible for a non-negligible fraction of bugs being still present during operation with potentially catastrophic consequences: 1) the market pressure of deploying new services and features as soon as possible and 2) the growing software complexity to provide the services required by the market.
So, the question arises: What can we do about these software bugs during operation?
Software faults have been classified according to different characteristics. We propose a new classification based on the software faults characteristics more than the type of triggers that make the software bug surface or the type of the software failure: classifying them into Bohrbugs and Mandelbugs. This theoretical classification has a practical consequence: Each type of bug requires different mitigation techniques. A detailed study of software failures of 8 JPL/NASA missions will be presented analyzing the software bugs, failures and their mitigations from different aspects. The results of this study and the techniques used to perform it are being piloted at JPL as part of a continuous improvement effort, the goal of which is to improve the understanding and management of failure behavior for the robotic spacecraft deployed by JPL. Finally, I will discuss aging-related bugs (a subset of Mandelbugs) and I will present different experimental research evaluating different software rejuvenation approaches such as the ones based on machine learning algorithms to predict software failures caused by aging-related bugs.
Javier Alonso received the master’s and Ph.D. degrees in Computer Science from the Technical University of Catalonia (Universitat Politecnica de Catalunya, UPC, Spain) in 2004 and 2011, respectively. From 2006 to 2011, he held an assistant lecturer position in the Computer Architecture Department of UPC. Since 2011, he has been a Postdoctoral Associate under the mentoring of Prof. Kishor S. Trivedi, in the Electrical and Computer Engineering Department at Duke University. Dr. Alonso’s research interests are in software engineering and distributed systems with special attention to dependability, availability, resilience and software rejuvenation. His main goal is developing mechanisms to deal with software faults and their consequent failures during operation to guarantee high-quality of service to the end users. He has been involved in JPL/NASA, NATO, NEC, Huawei and WiPro funded projects.