Understanding and handling Storage System failures
Date
2023-12
Authors
Zhang, Duo
Major Professor
Advisor
Zheng, Mai
Huang, Cheng
Zhang, Hongwei
Gulmezoglu, Berk
Trajcevski, Goce
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
Constructing a resilient storage system poses a formidable challenge, even for seasoned professionals. A recent case in point is the incident at the Algolia data center [alg (2015b)], dubbed ”When Solid State Drives Are Not That Solid.” In this instance, Samsung SSDs were erroneously implicated in failures caused by a Linux kernel bug [alg (2015a)]. As system complexity continues to escalate, the task of comprehending and diagnosing failures is poised to become even more arduous. To gain a deeper understanding of real-world failures and improve the foundations of robust storage systems, we embarked on a comprehensive research journey. Initially, we conducted three empirical studies encompassing 207 bug reports sourced from Bugzilla [Zhang and Zheng (2021)], 59 bug patches for file system-aware applications [Zhang et al. (2022)], accompanied by reproducibility experiments, and a thorough analysis of 1553 bug patches related to persistent memory within the Linux kernel source tree [Zhang et al. (2021b)]. Within these studies, we meticulously delineated issues across various dimensions such as resolution time, implicated kernel components, reproducibility, and root causes. This meticulous analysis provided us with a quantitative grasp of the challenges at hand. Furthermore, we delved deeply into a subset of these issues, and created a data set called ”BugBench.” This repository encompasses a wealth of information, including general bug patterns, consequences, fix strategies, triggering conditions, reproducibility assessments, and their implications for developing potent tools to tackle these intricate challenges. Subsequently, we identified potential shortcomings of existing related tools, and utilizing virtual machines and static analysis, embarked on an endeavor to enhance the efficacy of testing and debugging methodologies [Zhang et al. (b);Zhang et al. (a);Zhang et al. (2020b);Zhang et al. (2020a);Gatla et al. (2023)]. Through rigorous experimentation, we successfully identified 29 potential issues within persistent memory kernel drivers and uncovered several unexpected system behaviors under different full system testing. Our innovative correlation-based debugging technique proved to be instrumental in significantly curtailing the search space for root causes, whittling it down to a mere fraction (0.06% - 6.2%) of the original kernel function trees generated by FTrace. Our enhancements transcend mere refinement; they open doors to heightened low-level observability, effectively extending their applicability to the latest state-of-the-art storage systems. We hold the aspiration that our diligent efforts and the open-source ”BugBench” repository will serve as catalysts for further research, contributing to the measurement and enhancement of system reliability on a broader scale.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation