K8s security - Episode 5: Lock your data
Captured source
source ↗K8s security - Episode 5: Lock your data Deploy • Emmanuelle Demompion • 12/05/21 • 7 min read
In episode 4, we detailed some of the main security issues that are found in software, and it is no surprise that information leakage is one of the most frequent security flaws.
Laws and regulation
Focus personae
Data regulation exists around the world, with various laws, restrictions and agreements. Each country or group of countries inevitably starts making its own rules when it comes to personal information.
Case in point, we have the the GDPR ( General Data Protection Regulation ) in Europe, the LGPD ( Lei Geral de Proteção de Dados ) in Brazil, the PDP ( Protección de Datos Personales ) in Argentina, and the PPA ( The Privacy Protection Authority, formerlyILITA) in Israel. Of course, there are more, each with different levels of personal data protection, regulation, and authorization.
When it comes to businesses working across borders, data regulation since the creation of the GDPR in 2016 has become complicated to say the least. A lot of questions have been raised, especially when it comes to data pipelines, anonymization, data mining, and machine learning.
Personal information
Focus personae
There are many data protection rules, but if we want to do the best as we can to respect them, it really comes down to two very simple concepts:
Anonymization : All personal information should be anonymized if possible. Of course, it is impossible to anonymize something like customer billing information, but when it comes to more generic data, anonymization is a n°1 rule.
Justification : Gathering data may be important for your business, but is all data relevant? Unfortunately, the answer is often no. Keeping customer data because " it might be useful one day " is not a good enough reason, and is not compliant with regulation laws such as the GDPR. Know your data, know why you need it, and document your justification.
Of course, these rules might be "simple" to understand, but implementing them can very quickly become a nightmare and raise a lot of questions depending on the data processing you need to carry out.
Let's take an example
With the GDPR, any customer can ask for his personal information to be removed from any database at any time. Now, let's imagine that you have a customer's personal information in a database, and also another database with anonymized additional data.
If you don't have an association table, you cannot remove the anonymized data, but then, do you actually need to remove it if you cannot trace it to its original owner?
Let's go further, and imagine that this data is used in the training datasets of a machine learning algorithm which takes days to run. Are you supposed to delete this specific data entry from the training datasets, meaning that you will need to re-train your entire machine learning model?
Fortunately, we are given some latitude here since complying with everything at once is impossible for some businesses, and it is often considered that if data cannot be traced back to its original owner, this is " good enough " in terms of compliance.
But will it stay that way? What other regulation lies ahead of us and what solutions would we need to have to address them?
If you want to have a look at data regulation around the world, the CNIL (Commission Nationale de l'Informatique et des Libertés) provides a world map showing the different degrees of data regulation around the world.
Data Breaches
Focus personae
Going back to the technical side of things, Veracode issued a whitepaper covering the biggest data breaches of 2020. The number of customers and companies impacted is beyond imaginable, and most them are barely known to the public.
It also shows that giant companies such as Microsoft or Nintendo are not immune to data breaches and security flaws, and that from small businesses to IT Giants, security should be, more than ever, everyone's concern.
The biggest breaches expose personal information publicly, ranging from personal communications to account credentials, and add up to billions of records over the course of 2020.
The data reveals that information leakage, CRLF injection, cryptographic issues, and code quality are the most common security vulnerabilities plaguing applications today. Fortunately, we know that through secure coding best practices, educational training, and the right combination of testing tools and integrations, developers are able to write more secure code from the start — which means producing innovative applications that avoid cyberattacks and reduce the risk of costly breaches.
citation source Veracode - The Biggest Data Breaches of 2020
Kubernetes and data management
How can data can be managed and protected in a Kubernetes environment?
Kubernetes is often described as stateless, meaning that it is not meant to host persistent data directly on the nodes. This is only logical, since nodes managed by Kubernetes can be auto-healed (i.e. replaced) automatically to ensure the health of the cluster, and can even be created or deleted thanks to the node auto-scaling feature.
Scalability, cost control, and constant cluster health checks come at a price, and this price is statelessness.
Persistent volumes and encryption
Statelessness does not mean that data cannot be stored while using a Kubernetes cluster, just that using the local filesystem of your cluster nodes is not the way to do it. That is where persistent volumes come into play, allowing you to use remote storage (of type block / RWO or nfs / RWM ) and access it through your pods.
Persistent volumes can be used to store any kind of data, and even used as the storage system of a database, managed within a Kubernetes cluster. And as with any Kubernetes object, access to persistent volumes can (and should) be restricted.
Additionally, it is always good practice to encrypt the data you store. Most cloud providers' CSIs ( Container Storage Interface ) implement the encryption of persistent volumes.
Example of encryption options in a Kubernetes storage class (Scaleway's CSI)
Different policies for different data
Focus personae
In software development, we can identify at least three types of data, each of which has its own specificities and requirements, and each of which should be treated according to its nature and purpose. Using the same data storage for all types of data does not make any sense,…
Excerpt shown — open the source for the full document.