post

Distributed Storage Systems

Warning
These notes are potentially out-of-date. Please see the TODO list at the end of the page for latest links.

Contents

Abstract

Distributed storage mechanisms are becoming the de-facto method of data storage for the new generation of applications – web applications by companies like Google, Amazon and Yahoo!. There are several reasons that distributed storage mechanisms are preferred over traditional relational database systems including scalability, availability and performance. In this article, we will examine the range of emerging solutions in the industry and proceed to describe the general architecture used in these distributed storage mechanisms.

Introduction

The new generation of applications require processing of terabytes and even petabytes of data. This is achieved by distributed processing. This is one of major reasons for the power of web companies such as Google, Amazon and Yahoo!.

There are several reasons for distributed processing. On one hand, programs should be scalable and should take advantage of multiple systems as well as multi-core CPU architectures. On the other end, website servers have to be globally distributed for low latency and failover.

Distributed processing implies distributed data. This is a different beast compared to traditional relational database systems. Several researchers have suggested that this is an end of an architectural era and that relational database system vendors have to start over. We believe that RDBMS systems still have their place and use cases (such as business intelligence applications). Howover, the web has changed the requirements of storage database systems for the next generation of applications. We hear about several companies fighting traditional databases to meet their requirements. There are several lessons that has to be learnt from the web including simplicity, scalability, caching, flexibility to handling graphs and allowing simple flexible queries.

In this article/paper, we will examine many emerging solutions such as Amazon Dynamo, CouchDB, Google BigTable and ThruDB. We will explore the internals of these systems and bring about a general architecture of these systems.

Amazon Dynamo

Amazon Dynamo is a highly-available key-value store. It is one of the main pillars behind Amazon.com, one of the largest e-commerce stores in the world.

Dynamo has a primary-key-only interface. This means that data is stored as key-value pairs and the only interface to access data is by specifying the key. Values are expected to relatively small (less than 1 MB).

The data is partitioned and replicated using consistent hashing. The hash key range can be distributed among the available machines. Each key range (and their values) are stored over N machines for redundancy. When data needs to be read (or written) there has to be a minimum of R (or W) machines that return (or store) the data respectively. This ensures consistency between the data.

Data can be requested from a random machine. Each machine has enough local routing information to figure out which machines contain the requested data. If the data is in the same machine, it is returned immediately. Otherwise, the request will be forwarded to any of the N machines that are storing the key range which contains this data. The second hop can be avoided if a client library is used which will determine the routing table and send the request to one of the N machines directly.

Latency/Performance is measured for the 99.9th percentile of distributions in X time since this covers all end-users rather than the average end-user.

Dynamo is designed to be highly available for writing as opposed to reading, since failure of writing inconveniences the end-user of the application. So, any data conflicts are resolved at the time of reading rather than writing.

During reading, if there are conflicts such as some machines have latest data while others have stale data, then Dynamo will try to resolve it by checking the object versioning history chain called vector clocks. If the latest data are children of the stale data, then the latest data is returned. If the conflicting data do not share the same ancestry, then all the retrieved data is returned to the client to resolve these conflicts semantically.

Consistency among replicas during updates is maintained by a quorum-like technique with a decentralized synchronization protocol. Merkle trees are used so that only the ‘diff’ of the tree is exchanged in the synchronization mechanism.

Failures and membership changes are detected by a lightweight gossip-based protocol. Machines can talk to each other every 10 seconds and keep updated on the membership status. If a machine A cannot reach another machine B, it simply assumes that it is down and proceeds to talk to other machines.

The system is completely decentralized with minimal need for manual administration. However, new machines have to be manually added since downtimes of systems are considered to be usually temporary which means it is wasteful to redistribute that machine’s data to other machines in the meanwhile. However, when a new machine is added to the ring, the system automagically starts sharing part of the responsibility by handing over a part of the key range. Notice that this only affects the two machines in-between which this machine is inserted into the ring and does not affect the other machines.

Many of the systems’ characteristics such as availability, durability and performance can be tweaked by changing the values of N, R, W. The typical values are 3, 2, 2.

CouchDB

CouchDB is a document-oriented database server, accessible via REST APIs. Couch is an acronym for “Cluster Of Unreliable Commodity Hardware”, emphasizing the distributed nature of the database. CouchDB is designed for document-oriented applications, such as forums, bug tracking, wiki, email, etc.

CouchDB is ad-hoc and schema-free with a flat address space.

A CouchDB document is an object that consists of named fields. Field values are lists that may contain strings, numbers and dates. A Couch database is a flat collection of these documents. Each document is identified by a unique ID, a DocID. Documents are individual storage elements, have IDs & revisions, can have attachments. Attachments can be of any size.

CouchDB aims to satisfy the Four Pillars of Data Management by the following methods:

  • Save : ACID compliant, Save efficiently
  • See : Easy retrieval, simple reporting methods, fulltext search
  • Secure: Strong compartmentalization, ACL, Connections over SSL
  • Share : Distributed way

The storage model is a Multiversion Concurrency Control (MVCC) system with optimistic locking. A client sees a snapshot of the data and works with this snapshot even it’s being changed at the same time by a different client.

The uniqueness of the data model is append-only writes. Nothing gets overwritten, the newer version of data is appended to the file (the ‘database’), even if the file is ‘deleted’. This ensures that the data is always consistent and never corrupted. There is a header at beginning of database file with count of data items. It is repeated twice in two consecutive rows. After the data is appended, the first row updated with the count and then second row is also updated. Synchronicity between these two rows helps to maintain consistency and integrity of the data across server crashes, hardware failures or power outages because any conflicts between the two rows is automatically handled by CouchDB. An advantage of the append-only writes is that CouchDB immediately starts after the machine is rebooted, there are no startup checks or processes that need to run. A compaction routine can be run to get rid of the old versions and save space but at the heart of it is a version control system with a REST frontend.

Views on the data for reporting and data aggregation is achieved by using the ‘Fabric’ language for XML. However, there is a shift in CouchDB from XML to JSON because it is considered a more natural fit. The querying language for JSON aims to be even more flexible with a concept called structural patterns which is analogous to regular expressions for strings. There are no precedents for this concept it is still under discussion.

Fulltext search is managed by an Indexer daemon and Searcher daemon. There is a current implementation using Lucene but any engine can be plugged in by following these daemon APIs.

CouchDB currently has no authentication system but the idea is to have it built-in. This will relieve frontend applications from having to manage it, and at the same time it could be potentially useful in the database context such as fetching only the data that is accessible to the current user or group.

The replication is distributed. A server can update others after the server is taken offline and data is changed. If there are conflicts, CouchDB will select a winner and keep that as latest. User can manually override this conflict-resolved choice later. Importantly, the conflict resolution yields the same results everywhere ensuring that offline updates are handled properly.

There is also potential to write a storage engine for MySQL based on CouchDB. This means laying a RDBMS interface on top of CouchDB.

Google BigTable

Google BigTable is a distributed storage system for managing structured data. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. It has several goals: wide applicability, scalability, high performance, and high availability. Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

A Bigtable is a “sparse, distributed, persistent multidimensional sorted map”. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes:

   (row:string, column:string, time:int64) -> string

The data model optimized for storing multiple versions of contents such as webpages and similar kinds of documents. It has a focus on quick reads from columns, not rows. The row keys in a table are arbitrary strings (currently up to 64KB in size). Every read or write of data under a single row key is atomic.

Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. The tablets are compressed using the secret algorithms BMDiff and Zippy. These algorithms do not have high compression ratios (compared to LZW) but are designed for computing efficiency.

Bigtable is built on top of Google File System. The underlying file format is SSTable. SSTables are designed so that a data access requires, at most, a single disk access. An SSTable, once created, is never changed. If new data is added, a new SSTable is created. Once an old SSTable is no longer needed, it is set out for garbage collection. SSTable immutability is at the core of Bigtable’s data checkpointing and recovery routines.

Chubby is the distributed lock server that allows a multi-thousand node Bigtable cluster to stay coordinated. Chubby itself is a cluster app. Chubby is architected to keep lock management traffic very light. Chubby also rules over tablet server life and death, stores access control lists, data schemas and the bootstrap location of Bigtable data.

The master assigns tablets to tablet servers, balances the tablet server load, detects the loss or addition of tablet servers, performs garbage collection and some other chores. Importantly, client data doesn’t move through the master. In fact, the system is architected in such a way that most clients never communicate with the master, which helps keeps the master lightly loaded in practice.

Each tablet server typically manages between 10 and 1,000 tablets. Each tablet averages about 100-200 MB. Each Bigtable table commonly consists of multiple tablets. Each tablet contains all the data of a group of rows. A newly created table consists of one tablet. As it grows it is dynamically broken into multiple tablets. This allows the system to automatically scale in a horizontal fashion. Also, Bigtable’s three-level addressing scheme accomplishes the scalability and namespace aspect.

Queries (distributed computations) such as filtering, aggregation, statistics are done using Sawzall language.

After the Google BigTable paper was released, there have been attempts to create similar systems.
Prominent ones are Apache HBase and Hypertable. There are also similar systems in progress such as PNUTS internally developed at Yahoo!.

ThruDB

ThruDB aims to be a complete system to simplify the management of the modern web data layer (indexing, caching, replication, backup) by providing a consistent set of services: Thrucene for indexing, Throxy for partitioning and load balancing, and Thrudoc for document storage.

ThruDB builds on top of several open source projects – Thrift, Lucene (indexing), Spread (message bus), Memcached (caching), Brackup (backup to disk/S3) and also uses Amazon S3.

Thrift (by Facebook) is a framework for efficient cross-language data serialization, RPC, and server programming. Thrift is a software library and set of code-generation tools designed to expedite development and implementation of efficient and scalable backend services. Its primary goal is to enable efficient and reliable communication across programming languages by abstracting the portions of each language that tend to require the most customization into a common library that is implemented in each language. Specifically, Thrift allows developers to define datatypes and service interfaces in a single language-neutral file and generate all the necessary code to build RPC clients and servers.

Thrudoc comes with several data storage engines: Disk, S3, and a yet undocumented Disk+S3 backends. In this implementation, the data is persisted on local disk, which gives us an incredible throughput capacity, and a slave thread quietly replays all of the commands to the S3 backend as well, thus giving us a hassle free persistence and recovery model for virtual environments such as EC2.

TODO
Insert diagram here?

If Thrift objects are stored (vs strings, JSON or XML), then we have backwards-compatible object versioning, including support for new attributes in newer versions of the object.

Types can be specified in the Thrift IDL and the serialization works same across multiple languages whether it is C++ STL or Python dicts.

The system itself is abstracted across Transport and Protocol and is designed for streaming to avoid performance/buffering issues.

ThruDB is memcached-aware and specifying the memcached servers will get the speed and throughput desired.

More

There are several more systems out in the wild as well as emerging systems. Prominent among them are:

  • Amazon Simple Storage Service is a simple data storage system with a hash-table like API. It is a hosted service with internal architecture details not availble. It is claimed that the design requirements of S3 are Scalable, Reliable, Fast, Inexpensive and Simple.
    • Every object stored in Amazon S3 is contained in a bucket. Buckets partition the namespace of objects stored in Amazon S3 at the top level. Within a bucket, you can use any names for your objects, but bucket names must be unique across all of Amazon S3.
    • Within buckets, there are key-value pairs. The value i.e. object can be anything from 1 byte to 5 gigabytes.
    • Objects contain both data and metadata. Data can be anything. Certain predefined metadata provided by Amazon. Other metadata can be set by user.
  • Nirvanix
  • Amazon SimpleDB is a hosted web service for running queries on structured data in real time. It has the core functionality of a database – real-time lookup and simple querying of structured data.
    • Requires no schema, automatically indexes your data and provides a simple API for storage and access.
    • You organize your structured data into domains and can run queries across all of the data stored in a particular domain. Domains are comprised of items, and items are described by attribute-value pairs. To understand these elements, consider the metaphor of data stored in a spreadsheet table. An Amazon SimpleDB domain is like a worksheet, items are like rows of data, attributes are like column headers, and values are the data entered in each of the cells. However unlike a spreadsheet, Amazon SimpleDB allows for multiple values to be associated with each “cell”. Additionally, each item can have its own unique set of associated attributes.
    • It has a simple query language
    • It has been criticized by many as a distributed hash-table and not a real database.
  • MemcacheDB is a distributed key-value storage system designed for persistent. It conforms to memcache protocol. Memcachedb uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported.

There is also a whole wide range of research being done in this area such as papers on “Towards efficient search on unstructured data: an intelligent-storage approach” and “Sinfonia: a new paradigm for building scalable distributed systems”.

General architecture

(a.k.a. What sets these new technologies apart from existing ones?)

After examining many of these systems, we can derive some common principles being followed in
distributed storage systems:

  • Four Pillars of Data Management: Save, See, Secure & Share
  • Decentralization of everything, including responsibilities
  • Cheap commodity hardware, not “big iron” hardware
  • Hardware failure is assumed to be normal, the software layer takes this into account and hence there is redundancy across multiple servers or even multiple data centers, etc.
  • No rigorous schemas. Data is ‘loosely structured’.
  • Speed/latency is achieved by reducing number of network hops and ensuring optimal size of data on each server
  • Number of hops required, network switch speed, caching, etc. are more critical factors than simply disk access speed
  • Keeping track of multiple copies of the same data
  • Maintaining consistency across multiple copies of the same data
  • Ability to “tweak the knob” of the main factors such as number of nodes, redundancy, amount of storage per node, etc.
  • Factors to consider
    • Use case i.e. the actual data models used by the typical applications using this system
      • Notice that the storage mechanism is tuned towards the data model and not the other way around.
    • Hardware
    • Data Structure
    • Failure handling and automatic recovery
    • Query Model
    • ACID (Atomicity, Consistency, Isolation, Durability)
    • Efficiency
    • Performance
    • Cost efficiency
    • Availability
    • Durability
    • Security/Authentication/Authorization
    • Incremental Scalability – Ability to add one host at a time
    • Symmetry – everyone has equal responsibility
    • Decentralization
    • Heterogeneity – more capability means more work
    • Simplicity – With all this complexity, the developer interface should still be simple to use

Conclusion

RDBMS won’t go away, they’re still definitely needed. However, storage requirements for the new generation of applications are different. The Semantic Web / Web 3.0 is going to be full of semi-structured data on an even larger scale, so it’s prudent for applications to take advantage of the upcoming technologies as soon as possible. One pundit claims: “Semantic Web will start the long, slow decline of relational database technology. Web 3.0 enables the transition from “structure upfront” to “structure on the fly”. Future killer applications and appliances will have to connect to the cloud and hence will be written with distributed storage in mind, whether the applications run on the desktop or on the web.

Distributed storage has many concerns – scalability, hardware requirements, query model, failure handling, data consistency, durability, reliability, efficiency, etc.

The landscape of storage architectures/software we have described in this paper are taking big strides in this direction. We have attempted to describe their designs and approaches, and to describe the general characteristics and architecture of these solutions.

The future of data storage can be seen in innovations such as CouchDB Integration with Abdera, an Atom store. This is clearly in the direction of Adam Bosworth’s vision of ‘Atom API for databases’ as described earlier.

The future of data is “data in the cloud”.

TODO
Weak conclusion.

Please do send in your comments and feedback so that I can improve this document.

TODO