Separating Data from Metadata for Robustness and Scalability
When building storage systems that aim to simultaneously provide robustness, scalability, and efficiency, one faces a fundamental tension, as higher robustness typically incurs higher costs and thus hurts both efficiency and scalability. My research takes two crucial steps on this difficult road. First, it shows that an approach to storage system design based on a simple principle—separating data from metadata— can yield systems that address elegantly and effectively that tension in a variety of settings. Two observations motivate our approach: first, data in storage systems is usually big (4K to several MBs) while metadata is comparatively small (tens of bytes); second, metadata, if carefully designed, can be used to validate data integrity. These observations suggest an opportunity: by applying the expensive techniques that guarantee robustness against a wide range of failures only to metadata, which has little effect on scalability, it may be possible to protect data as well with minimal cost. We show how to exploit effectively this opportunity in two very different systems: Gnothi—a storage replication protocol that combines the high availability of asynchronous replication and the low cost of synchronous replication for a small-scale block storage; and Salus—a large-scale block storage with unprecedented guarantees in terms of consistency, availability, and durability in the face of a wide range of server failures.
The second step we take addresses a basic question faced by all researchers working on scalable storage: how can we validate the claims of scalability of our storage system prototypes? Today, the largest academic testbed we are aware of has about 1,000 machines and 1PB of disk space, which is about two orders of magnitude smaller than the state-of-the-art for large-scale storage in industry; a gap that is likely to increase in the future. To mitigate this gap, we have built Exalt, an emulation tool to run a large storage system with 100 times fewer machines by compressing data on I/O devices and in memory. Once again, the key to our design is separating data from metadata: we have leveraged the observation that the behavior of storage systems often does not depend on the actual data being stored and developed Tardis, a synthetic data format that allows applications to quickly differentiate data from metadata and achieve high rates of data compression.