nespix - stock.adobe.com

Feature

Kubernetes at 10: Building stateful app storage and data protection

Google engineer Michelle Au was involved in Kubernetes storage early, as the container orchestration platform was developed to support data protection operations such as snapshots

Antony Adshead, Storage Editor

Published: 25 Jun 2024

Kubernetes is 10 years' old. Mid-2024 sees the 10th birthday of the market-leading container orchestration platform.

That decade started as containers emerged as a new way to virtualise applications, but storage and data protection functionality was practically non-existent. Now, Kubernetes offers a mature container platform for cloud-native applications, with all that’s required for the storage of stateful data.

We mark the first decade of Kubernetes with a series of interviews with engineers who helped develop Kubernetes and tackle challenges in storage and data protection – including the use of Kubernetes Operators – as we look forward to a future characterised by artificial intelligence (AI) workloads.

Here, Michelle Au, a software engineer at Google focusing on Kubernetes storage development, talks about getting involved in work such as adding support for snapshots and operators, which add functionality and complexity beyond Kubernetes’ core for advanced services such as in storage and data protection.

What was the market like when Kubernetes first launched?

Kubernetes was an early incumbent in the space. Containers were just becoming popular and companies were starting to explore the area. There were many alternative workload orchestration projects like Mesosphere, Docker Swarm, Cloud Foundry and Nomad.

How did you get involved in storage for Kubernetes?

In 2017, I joined the Kubernetes team at Google and began working on projects as part of the SIG [special interest group] storage community. Before that, I had only heard about Kubernetes through the grapevine but never actually used it. But I thought it would be a great opportunity to be able to participate in an open source project that was reshaping the industry.

When did you realise that Kubernetes was in the leading position in the market?

At the beginning that wasn’t the case. But seeing the exponential growth of workloads year after year, and hearing all the success stories, especially with stateful workloads, is something that makes me very proud of what we’ve built.

When you looked at Kubernetes, how did you approach data and storage?

When I started learning about Kubernetes, the biggest strengths that immediately popped out at me were the declarative API [application programming interface] and reconciliation design paradigm, workload portability across environments, and standardisation of workload deployment best practices. All these strengths of Kubernetes were also goals that guided me when designing Kubernetes storage features.

What issues first came up around data and storage with Kubernetes for you?

When I joined the team, running stateful workloads were primarily limited to small-scale deployments in the cloud. There were still many friction points in terms of storage ecosystem support as well as scheduling, maintenance and disruption management. Running these workloads successfully required many custom-built processes especially around day two challenges.

One of my first projects was to add support for local storage. While working on that, I realised there was a broader issue of data locality awareness and decided to tackle the problem more broadly by introducing the notion of storage topology to the Kubernetes scheduler. This simplified the process of provisioning stateful workloads to be fault tolerant across failure domains whether you’re running in the cloud or on-premise.

Beyond that, we started moving from day one problems to day two problems. Snapshots support was a major gap in Kubernetes yet is critical to any disaster recovery strategy.

What had to change?

Implementing snapshots was surprisingly non-trivial. Mapping an imperative operation to a declarative API required multiple rounds of brainstorming, storage vendors had slightly different semantics, and there was also the question of how to handle application consistency. But in the end we were able to reach consensus and deliver this critical capability that further strengthened Kubernetes’ stateful workload readiness.

How did you get involved around Kubernetes operators?

As part of the Google Kubernetes Engine (GKE) team, I have worked closely with customers and partners that are looking to run operators on our platform. I also joined the Data on Kubernetes community as a representative of the Kubernetes project and GKE to better understand the pain points that users face today and help relay those issues back to the Kubernetes project.

What happened around operators that made them a success for data and storage?

The advent of Custom Resources really changed the game for operators. It made it possible to develop a Kubernetes-style declarative API for your specific use case. Many stateful workloads have a number of intricate and involved operational processes that cannot be easily abstracted in Kubernetes’ core, such as configuring HA [high availability] and replication, and managing disruption and upgrade processes. Operators are now able to orchestrate this complexity for end-users.

How did this support more cloud-native approaches? What were the consequences?

Declarative APIs can easily enable GitOps or config-as-code paradigms. Operators for stateful workloads make it easier to automate provisioning, upgrades and other maintenance operations, with the added benefit of being compatible with how organisations manage all their other Kubernetes workloads.

Kubernetes is now 10. How do you think about it today?

Kubernetes has come a long way from “no way I would run a database on Kubernetes” to “I’m running databases at petabyte scale with automated rolling upgrades”. Kubernetes has become a stable platform to run some of the most demanding workloads, and that has shown as the project has shifted focus from big workload-enabling features to efforts that improve reliability and stability over the past few years.

What problems still exist around Kubernetes when it comes to data and storage?

Ecosystem discovery is a challenge. There are many vendors in the space, and as an end-user, it’s time-consuming and difficult to evaluate all the options. I think this is where the Data on Kubernetes community can help. They have hosted talks and blogs on a wide variety of stateful workloads that range from introductory to advanced topics.

The growth of AI/ML has also introduced new challenges. AI/ML workloads have very different data patterns and requirements than traditional databases, and typically use object storage and file storage instead of block storage.

Also, multi-cloud or hybrid environments are a reality, which adds additional requirements for cross-environment portability. I have seen our Kubernetes storage abstractions continue to hold up to these new use cases, although I think there is still room for enhancements, especially around data lifecycle and management.

Any other anecdotes or information to share?

We are always looking for more contributors and participants in Kubernetes SIG-storage as well as Data in Kubernetes. Please join these communities if you are interested in sharing your stories or contributing to the project.