Behind the Scenes: Migrating Duo to Kubernetes
Duo Security has historically been a Python shop hosted on Amazon Web Services (AWS) and Amazon Elastic Compute Cloud (Amazon EC2) instances. Like many tech companies, we originally adopted a three-tier architecture — consisting of load balancers, servers and databases. And, as we have grown, we also added caching to the stack to better meet our customers' needs.
This three-tiered architecture is great, but also comes with its own set of challenges, which Duo and many companies have sought to mitigate with their own internal tooling.
Here are some of the largest challenges that come with our original approach:
Multiple teams working on the same code base (which has the advantages of making it easier to manage dependencies and facilitating cross-project changes, but also comes with disadvantages, like small changes having a larger ripple effect and making the deployment process more complex)
Lengthy lead time to move app updates from development to production
Impediments to scalability and cost optimization
To address these challenges, our site reliability engineering (SRE) team started investigating microservice architecture — a new paradigm theorized by Martin Fowler, the “father” of microservices.
The approach goes something like this: Splitting large, monolithic programs into smaller, self-contained, loosely coupled services allows for faster iteration and more independent work to be done by various teams. The technology that enables this is known as containerization — basically, creating lightweight, standalone executable packages of software code that are more nimble and easier to work on independently, compared to a single, massive code base.
A shift to a microservices model comes with its own set of challenges — like additional complexity — but an open-source system known as Kubernetes, or K8s for short, offered promise for automating deployment, scaling and managing containerized applications.
Kubernetes at Duo
While Kubernetes initially had strong competition from several other frameworks, it soon emerged as the industry standard.
As you can imagine, migrating from a classic three-tier stack to a microservices architecture is not an easy task. Our SRE team started discussing a K8s migration proof-of-concept in 2020. We needed to vet K8s to ensure that it was going to meet our high standards for reliability, availability and, most importantly, security.
As an organization, we decided to gradually migrate portions of our codebase to Kubernetes. Rome was not built in a day, as they say, and so neither was our infrastructure. We also wanted to avoid side effects that could arise when migrating everything all at once.
To share the knowledge and document all the technical choices made around the migration, we created an architectural decision record (ADR) to capture our decision-making. This also has the advantage of giving context to new engineers as they on-board, and it has drastically decreased the need for future discussions around why individual technological choices were made.
Technical challenges
Our SRE team faced several challenges in implementing Kubernetes, both technological and operational.
The impossible vanilla cluster
When a piece of software is ready to be used without any customization, we call that “vanilla.” But it was obvious that the K8s vanilla cluster did not meet our needs for security, configuration management, policy verification or networking. While several tools have been developed to help manage and simplify the complexities of Kubernetes applications — such as Helm, which bundles various resources into a “charts” that can then be installed in one command — we went with kustomize, which offered more granular control over our cluster. Today, we have around 30 addons installed on our cluster to make it workload-ready.
Dealing with legacy and hybrid software
One of the main advantages of containerization, and a key to its success, is its low installation friction. Indeed “dockerizing” an application is usually trivial since Docker enables infrastructure engineers to easily replicate the underlying operating system (OS) layer with all the dependencies.
However, I say “usually” because working at a certain scale and with security in mind has some other dependencies, and also sometimes complicated processes. For us, the challenge was figuring out how to replicate all of our EC2 dependencies inside Kubernetes. That’s why today we have a hybrid stack, where EC2 and Kubernetes workloads are both well-integrated into the same logging and monitoring tools. This was done to ensure technical consistency and make sure that K8s was not going to be disruptive
Identity management
Identity management was also a key challenge. Kubernetes presented a whole new infrastructure perimeter — and while we had AWS roles, permissions and accounts segregation in place, it was hard to replicate on K8s. Since we are relying on Amazon Elastic Kubernetes Service (EKS), we are combining AWS roles and Kubernetes roles by relying on AWS to manage access identity and authorization to our clusters. Today, although our current access level is uniform, one of our challenges is to bring role-based access control into our cluster to allow for more granularity to be able to promote more ownership for our software development teams.
Operational challenges
Code doesn’t exist in a vacuum. In any organization, technological challenges are paired with operational challenges. How do you get a whole engineering organization to learn, adopt, use and grow a new technology?
Knowledge sharing
Earlier I mentioned how developing an ADR was a foundation of the Kubernetes project at Duo. But be mindful, it makes more than a few wiki docs to get an organization moving in a new direction. Over many presentations to everyone from senior leadership to individual contributors, we created excitement around the project and understanding of K8s strengths and drawbacks. We also developed dedicated documentation and workshops to ensure that teams were well-armed to move their services over to Kubernetes.
Training
We fostered internal collaboration by creating a Kubernetes learning group, which would gather every two weeks to ramp up our skills and work toward Certified Kubernetes Administrator certification. We also learned from the mistakes of others through an archive of Kubernetes failure stories. Indeed, learning by understanding the failures of other companies would give us precious insights on how to operate K8s. We ran a Kubernetes book club across the organization to introduce K8s concepts and spark discussions about architectural decision making in the Kubernetes space. Lastly, we would often invite subject-matter experts to present on a particular area of K8s, such as networking, scaling and so on. This helped ensure our training stayed aligned with our actual implementation at Duo.
Recruiting
The last challenge I will talk about is recruiting. While Kubernetes is the industry leader for containerization, it’s challenging to find engineers who are experienced with it. It has a steep learning curve and requires significant time investment to learn.
And while Kubernetes has a strong reputation for scalability, we have found it to be a large leap from operating a cluster with a few nodes to operating dozens of clusters with dozens — or even hundreds — of nodes and high-volume traffic. That’s one of the reasons there are only a handful of companies operating with this kind of infrastructure. That’s why, within Duo, we created a dedicated team of people with Kubernetes experience to help bridge the gap between core engineers working on maintaining the cluster, and software developers who wanted to tap into Kubernetes. This team, SRE Applications, helps maintain and promote K8s knowledge while keeping the software development team’s perspective in mind, too. And, of course, we are always looking for engineers with Kubernetes experience!
Where things stand today
Today at Duo, we are serving our customers from 10 regional data centers across the world and are about to add two more. In our biggest cluster we have around 400 pods running currently to ensure we serve our customers from around 20 different services.
The advantages to our shift to Kubernetes have been striking. It used to take a dozen hours to redeploy a service after its code was updated. Today it takes about 10 minutes. And simple configuration changes can be deployed in less than five minutes. This allows us to be more nimble in pushing needed updates to our customers as quickly as possible, while still maintaining our focus on security and quality.
The road to Kubernetes implementation was not easy, especially since it introduced a new way of doing things on the application layer as well. Kubernetes has a strong learning curve, so engineers need time to experiment and learn. At Duo, we overcome those operational challenges by investing into the training, recruiting and nurturing collaboration. For the technical challenges, we bridged the gap between traditional and new architectures with hybrid workloads and we hardened our clusters.
The benefits of Kubernetes are numerous. There are good reasons it has become the standard, based on the foundation and the knowledge of engineers working at Borg, and is the new abstracted layer for the infrastructure. It being open-source with not less than 3,180 contributors today, we can rely on it with trust and drive our own audits if necessary.
Yes, there are more challenges to overcome to fully integrate Kubernetes in our stack, but we are also now leading a proof-of-concept with the new big thing in the infrastructure world: service mesh. Come experiment with us!