Manifesting reliability issues in Storage Systems

Thumbnail Image
Gatla, Om Rameshwar
Major Professor
Zheng, Mai
Duwe, Henry
Jones, Phillip H
Kothari, Suraj C
Rozier, Kristin Yvonne
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue
Is Version Of
Computer Engineering
Storage systems are vital in managing the ever increasing data generated by High Performance Computing based and Cloud-based applications. Therefore ensuring reliability while providing desired performance is important. However, building reliable storage systems is challenging and system may fail due to reasons such as power fault, device failure, software bugs, etc. In such events, storage systems rely on recovery components to bring the system back to a consistent. Unfortunately, similar failure events may occur while performing system recovery and can lead to severe corruptions in the file systems. On the other hand, storage systems are constantly updated to accommodate new storage technologies such as Persistent Memory (PM) devices to satisfy the demands for high performance. PM devices are storage class memory devices that offer low access latency and data persistence. In addition, these devices offer new features such as Direct-Access (DAX) that bypasses the complex Linux storage stack. However building new storage systems using PM devices is quite a challenge. Firstly, there is a new method to access data on these devices. Unlike traditional storage device that operate on block IO interface, PM devices operate over memory IO interface. Therefore, system developers need to develop new methods to access data. Secondly, the Linux kernel had to be modified by including new drivers to accommodate the devices and modifying file systems to support new DAX feature. These modifications can increase the complexity of the storage stack and may hinder the reliability of the storage system. Therefore, as a first step towards building reliable storage systems, this dissertation emphasizes on manifesting the reliability issues explained above. For this we first begin with analyzing the impact of interrupted recovery procedures on the durability of storage systems. To do this we build a fault injection framework to systematically interrupt the recovery procedure of four popular Linux file systems (Ext4, XFS, BtrFS and F2FS). We observe that not only does interrupted recovery induce severe corruption in file system, these corruptions are permanent and cannot be fixed by another run of recovery. We conclude this part by building a generalized redo log library with transaction support that can be easily integrated with existing recovery components to provide some resilience against interruptions. Second, we analyze the impact of PM software stack on system reliability by performing a study on PM-related issues reported in the Linux kernel. To do this we collect all patches submitted to the Linux kernel over the last decade and extract 1,553 PM-related kernel patches. We study these patches in depth and characterize PM-related bugs based on their cause. In addition, we also conduct experiments on PM bug reproducibility and evaluating existing bug detection tools to derive multiple insights such as bug manifesting conditions, remedy solutions, etc. The intuition to perform this study is to assist future work in building tools that can effectively manifest these bugs. Therefore we have open sourced our dataset and workloads utilized to reproduce a subset of PM bugs.
Subject Categories