WritingScalewayScalewaypublished Sep 9, 2022seen 5d

Building a distributed Key-Value store to power Object Storage

Open original ↗

Captured source

source ↗
published Sep 9, 2022seen 5dcaptured 3dhttp 200method plain

Building a distributed Key-Value store to power Object Storage Build • Louis Solofrizzo • 09/09/22 • 7 min read

As a cloud provider, we handle a considerable amount of data on a daily basis , especially with our Object Storage products . So we needed a way to distribute this data globally, with various consistency, replication, and database sharding for linear read and write latency.

We designed the database we needed in-house. A platform that can scale up to millions of different databases with billions of entries, all the while maintaining client separation, good latency, and great performance. It also had to be:

Flexible because in some regions, we can have multi-datacenter replication

Consistent because there are strict operational requirements on our production

To support continuous growth, any platform needs to be highly scalable

Reliable because even the slightest outage has significant repercussions on client trust (and financial consequences)

Meet the reliability and scaling needs of Object Storage

Hive is a database that stores key-value pairs and shards data across many RAFT clusters, with replication to ensure global availability at the highest level of abstraction. We designed it to scale up to thousands of machines across multiple data centers worldwide and billions of database entries.

For any cloud provider, dealing with failures in infrastructure comprising millions of components is a standard mode of operation; there are always a significant number of failures at any given time. So, we designed Hive to treat failures as the typical case without having an impact on availability or performance.

Ensure consistency and data repartitioning

Clients can use Hive to store data safely, with specific optimizations for specific access patterns. A client can choose a consistency per read or write request.

For instance, for a DNS database engine, consistency might be preferable to have low write latencies. But for an Object Storage engine, strong consistency is paramount.

Rather than creating a generic database engine with a query language, we decided to create specific storage engines optimized for their dedicated use cases.

Hive’s main client is Scaleway’s Object Storage. It stores a reference to objects in the database and uses it for specific Object Storage operations (versioning, using delimiters, or prefix listing).

It also uses consistency features to ensure bucket unicity worldwide and strong consistency multi-datacenter replication to ensure safety.

Our main problem with the previous Object Storage database architecture was the database sharding - the databases were growing larger and this impacted latency and sometimes even replication. We solved this by splitting objects lexicographically among many shards, automatically splitting and merging shards when needed.

We used a modified version of the RAFT quorum protocol to ensure consistency and replication for a shard. In order to avoid having one big RAFT cluster, we split all the shards into RAFT groups, which are RAFT state machines dedicated to specific data sets.

This also enabled us to avoid catastrophic failures in a quorum fail or some other internal error.

Design overview

Hive is composed of many different nodes on different machines, data centers, and regions.

A node stores multiple clusters (thousands of clusters per node) and responds to queries for reading and writing on those clusters. A node can also take clients’ requests and redirect operations to the specific nodes holding the information.

So we split the cluster into two logical parts: Storage & API.

Designing the API

Any node can respond to requests regarding whether or not the node has the data. The node does a cluster resolve on each request, caching most of the results for the next one, and then knows which node it needs to talk to in order to fulfill said request.

This is the opposite of the traditional client redirection from RAFT architectures, as the node does the redirection for the client. This approach enables multiple optimizations, but the main ones are that we can cache the path and the nodes of the requests to avoid having to resolve it again, and we can put a simple load balancer in front of Hive without worrying about redirections.

We made multiple APIs for Hive: frontal ones, usually an HTTP API, that clients talk to, and an internal one, using protobuf for internal node communications.

This design allowed us to bind the client-facing server to a private IP address and use IPv6 internet addresses for node-to-node communication. To ensure the safety of communications between nodes, every packet is end-to-end encrypted.

Designing the storage

Each Hive cluster is composed of at least one storage backend. They can have different database engines and maintain an in-memory state machine. A storage backend implements a subset of a RAFT cluster with a log application and its database engine. It does not know what other backends it is paired with, even though a cluster shares the same RAFT log for all of its backends. Each backend can use different storage engines for their storage: backend ’A’ can use SQLite, and backend ’B’ can use LMDB without any issues.

A common RAFT log is shared by the cluster, storing all the cluster operations and some "meta" operations, like adding a node into a cluster or changing the default consistency of a cluster.

Each node maintains a global cache to quickly resolve a cluster when needed. This cache is stored in RAM, with an AVL tree, and stores all the addresses of the nodes composing a cluster and which node was the last known leader. When a leader changes, any node can redirect the caller to it, and the caller then updates its cache. Upon cluster deletion or re-creation, the node will return a "Does not exist" code, resulting in a cache flush for this particular entry on the caller’s side.

Hive: bird’s eye view

Amazon S3 is a key-value store we exposed it to operate on

Amazon S3 is an Amazon API, which is, in essence, a key-value store. The key is an object name, the value is binary content. Some metadata can be set by the user on an object: tags, access control lists, or simple flags. Alongside this, the actual backend has to store the data’s physical position if one does not store the object content alongside its metadata.

Hive was created for this purpose: it exposes a key-value store HTTP API to an Object Storage API gateway and handles most Object Storage…

Excerpt shown — open the source for the full document.

Additional captured pages

Hive: A Globally-Distributed Key/Value Store Michaël Bonfils mbonfils@scaleway.com Patrik Cyvoct pcyvoct@scaleway.com Florian Florensa fflorensa@scaleway.com Eric Gouyer egouyer@scaleway.com Mickaël Grégori mgregori@scaleway.com Stanislas Lange slange@scaleway.com Nicolas Sampré…