WritingScalewayScalewaypublished Mar 6, 2020seen 5d

Behind the Scenes of C14 Cold Storage

Open original ↗

Captured source

source ↗
published Mar 6, 2020seen 5dcaptured 3dhttp 200method plain

Behind the Scenes of C14 Cold Storage Build • Scaleway • 06/03/20 • 8 min read

Scaleway Glacier is a storage class on the Scaleway Object Storage, built for archiving data at a low cost. It is not a product in itself, but rather an extension of our object storage, it cannot be accessed without the Object Storage API.

In this article, we will go over how this project was born and how it was developed as well as some technical insights on it.

Where It All Started

In 2016, Scaleway Dedibox launched an archiving product: the C14 Classic, still available. The product was very hardware-centric, mainly built around two major aspects:

Competitive SMR disks ,

Motherboards, manufactured internally, that can power on/off disks on demand with a SATA bus tree matrix.

On the API side, C14 Classic was built like a vault: one opens a vault, puts data inside it, then closes the vault (archiving it). The main shortcoming with that design was that you needed to unarchive an entire archive to access a single file in it. In other words, to access a 1 GB file inside a 40 TB archive, you first needed to unarchive 40 TB.

The main concern with this design was the “ Data front ends” , put another way, the big servers that were keeping the unarchived data. As you can imagine, multiple clients unarchiving multiple vaults of terabyte can fill up a server relatively quickly, thus blocking other clients from using the service.

In 2019, an internal Proof of Concept (PoC) was developed to demonstrate that we were able to use the C14 Classic hardware with the new Scaleway Object Storage . There were limitations, of course, but the PoC was very conclusive. Indeed, the C14 Classic hardware was rock stable, and the product turned out to be reliable with solid SLA. In addition, it also allowed us to use 40 PB of C14 hardware that was already in production.

As a result, huge efforts were deployed to transform the PoC in a production-ready project. A year later, the C14 Cold Storage beta was born at Scaleway.

How C14 Cold Storage Was Born

Integration Within Object Storage

First of all, the project was to be used with the Object Storage API. As a result, we needed a standard way to access the API. Lots of patches were deployed on our Object Storage

gateways to ensure compliance with Amazon’s S3 Glacier Storage class .

We also had to start working on a complete lifecycle engine, since the feature is required with the Scaleway Glacier.

Learning From Our C14 Classic Mistakes

Backed by 3 years of run, we learned a thing or two about archiving data, what works and what does not. The main objective was to build C14 Cold Storage around read optimisation, rather than write optimisation. We also decided to use file systems on the SMR disks, since in those 3 years, lots of patches were made in the Linux Kernel in order to optimise filesystems interaction with SMR disks.

Architecture Overview

Internal Insights

Hardware

MrFreeze

MrFreeze is the name of the board that hosts the archiving solution. The board was made in-house, and we did not modify it for the Scaleway Glacier. It comes with the following:

A very basic 4 core ARM CPU

2G of RAM

2 SATA buses

56 Disks slots

4x 1Gbit/sec network port

As you can see, the main feature of the board is to have a lot of disks, but only two SATA buses to access them at the same time. Therefore only two disks can be powered on simultaneously. The board itself exposes a SATA line and power-line API through GPIO ports, and we keep a map on “what-to-power” in order to switch-on disk X or Y on the software side.

One funny quirk of this design is that we do not need to cool down the boards

that much, since 54 disks are powered down all the time, and the heat

dissipation works well. In addition, a whole rack with 9.8 PB of storage consumes less than 600W (22 chassis * 56 disks * 8 TB/disk)

The main caveat, of course, is that a disk needs some time to be powered on:

around 15 seconds from power-on to mountable by the userspace.

SMR Disks

The disk that we use to fill those boards are 8TB Seagate SMR

Disks . These disks are built for storage density, and thus are perfect for data archiving. They can be quite slow, especially for writes, but it is a downside that comes with all SMR disks.

Location Constraints

The C14 racks are located in our Parisian datacenters, DC2 and DC4, also known as The Bunker. Since one rack is around 1 metric ton of disks, we cannot place them wherever we want; for example, a MrFreeze rack in our AMS datacenter is totally out the question. So, we need to be able to transfer data from the AMS Datacenter (or WAW Datacenter) to the ones located in Paris.

Software

Freezer

The Freezer is the software that operates the MrFreeze board. It is responsible

for powering disks on and off, and actually writing data on them. It’s a very

‘simple’ software however, since all the database and intelligence are in the worker,

which is on far more powerful machines.

The Freezer communicates with the worker over a TCP connection, and exposes

a basic binary API:

typedef enum { ICE_PROBE_DISK = 0, /*!< Probe a disk for used and total size */ ICE_DISK_POWER_ON, /*!< Power on a disk */ ICE_DISK_POWER_OFF, /*!< Power off a disk */ ICE_SEND_DATA, /*!< Send data */ ICE_GET_DATA, /*!< Get data */ ICE_DEL_DATA, /*!< Delete data */ ICE_SATA_CLEAN, /*!< Clean all the SATA buses, power off everything */ ICE_FSCK_DISK, /*!< Request for a filesystem integrity check on a disk */ ICE_LINK_DATA, /*!< Link data (zero copy) */ ICE_ERROR, /*!< Error reporting */ } packet_type_t; CopyContentIcon Copy code For simplicity, we logically split the freezer in two parts, one per SATA bus.

So in reality, two workers are speaking to one freezer, each one with a

dedicated bus. We will not go into details about the freezer since it’s mainly

dumping data from a socket to an inode (or the opposite for a read), and performing

some integrity checks.

Integration Within the Object Storage Stack

Before we explain how the actual worker works, we need to explain our Object Storage stack: once an object is created, it is split in erasure coded chunks (6+3), and a rawx is elected to store a chunk. The rawx is a simple binary much like the freezer, without the disk-power bit: It simply dumps a socket into an inode, or the opposite. So for an object, a minimum of 9 rawxs are used to store the actual data.

For hot storage, we do not need to go further than that. We have a rawx per disk…

Excerpt shown — open the source for the full document.