Creating Cockpit: From ecosystem tool to observability product
Captured source
source ↗Creating Cockpit: From ecosystem tool to observability product Build • Maxime Besson • 25/05/23 • 5 min read
At Scaleway, we provide our customers with two types of services: on the one hand, we have billed products. On the other hand, we provide certain features for free — those are ecosystem tools.
One of the tools most awaited by our customers was Scaleway’s monitoring system. That journey began several years ago and led to the public release of our fully-managed observability solution, Cockpit , on May 9, 2023.
But here’s the thing: Cockpit is actually both an ecosystem tool and a product. Initially, we only provided a monitoring solution for our product teams, an internal tool. But as Cockpit grew, it became a fully-fledged product for our customers.
Now, you can monitor both your Scaleway infrastructure data and your application data. And you can import data from other cloud providers you might be using.
Product-specific monitoring wasn’t enough
Let’s take a trip down memory lane. Before we created a shared infrastructure monitoring tool for our customers, each Scaleway product had to handle monitoring for itself. Teams had to:
Select relevant metrics and logs
Retrieve them
Store them
Process them
Create the necessary APIs for data retrieval
This approach had several challenges:
Information storage was scattered across multiple locations
Data retrieval APIs weren’t standardized
There were discrepancies in metric and log availability between products
It soon became clear that a uniform and comprehensive solution for all our products (and consequently our customers) was needed.
Balancing the needs of multiple stakeholders
Our team started by trying to understand both the infrastructure monitoring needs of our product teams, especially the platform team, and our customers. One thing became clear fairly quickly: while the general metrics for each product were the same for 90% of users, some data, how it’s processed, and the collection frequency were very specific to each product.
So we couldn’t just create a one-size-fits-all solution for all products in our ecosystem. Instead, each product team had to remain in charge of the metrics and logs available for their product within our monitoring solution.
So we established guidelines and provided support to help the teams with the following:
Key observability concepts
The importance of selecting the appropriate frequency for each metric
Scalability and storage considerations related to metric cardinality
Best practices for creating dashboards
Best practices for setting up alerts
The necessary technical documentation for sending metrics and logs to our internal APIs
Scalability through components
Monitoring infrastructure is already complex enough when it’s just for a single company. We knew that scaling the system and managing the volume of data when making monitoring available to all our customers would be a huge challenge.
We partly addressed this by constructing a unified system — completely separate systems for each product wouldn’t scale well — while also breaking it down into separate components that can scale separately.
Each component of our infrastructure is developed individually to ensure stability, performance, and the ability to scale. As a result, each component can handle a significant workload. And that makes the entire system scale.
Think of it like a bunch of LEGO bricks! If each brick is stable on its own, the whole structure built from the bricks is stable as well, and it can grow.
Taking it from tool to product with Cockpit
The initial tests of our architecture design were highly satisfying and gave us great confidence in our ability to handle the load. This sparked an idea: what if we opened our APIs to our customers, allowing them to push their metrics and logs for applications as well?
This would have significant benefits for them:
The ability to consolidate Scaleway infrastructure data with their application data
Managing their observability solution
Providing a unified alternative and open-source solution to proprietary options, addressing a previously identified need
So, we began conducting more extensive user research. Several crucial points quickly emerged:
The need to retrieve infrastructure data (our tool was already providing that)
A need for transparency and understanding of observability costs, along with greater predictability of what companies will have to pay
The requirement for an easy-to-use solution
Opening our APIs
Clearly, a more comprehensive observability solution was needed. So we decided to design and develop our infrastructure to receive metrics and logs from our customers. We opened our APIs and enabled customers to push data and remote/read it.
Our observability solution now includes the following:
Grafana-as-a-Service: A managed Grafana solution for our customers, entirely developed by our team, with quick response times and rapid display, even though it’s built on a serverless architecture that resets to zero if the client doesn’t view their dashboards. By default, it is populated with pre-built dashboards for all clients, allowing them to monitor their Scaleway infrastructure within five minutes of activating the monitoring solution.
A remote/read API using the open-source and normalized PromQL and LogQL protocols
A managed alert manager, also pre-populated with default alerts for all Scaleway products (which can be activated or deactivated by the client)
An information hub accessible to all product teams and a front end to display information in the Scaleway console (e.g., product metrics, consumption, etc.)
Coming soon: A new version of our graphs in the console allowing product-specific monitoring of service health
During the private and public betas, we monitored traffic, usage, scale, and clients’ usage. We learned that we could handle significant workloads and scale accordingly to meet the expectations of Scaleway customers and products. We also noticed that we needed to set appropriate limits and best practices to ensure system security and sustainability.
Over time, we developed the right techniques and tools to allow our customers to adapt their usage of our product based on real-world needs. Finally, Cockpit — as both a tool and a product — was born and made available to all Scaleway users in March 2023.
Based on our information about how the product was being used, we also devised a pricing strategy that allows…
Excerpt shown — open the source for the full document.