You have an app and it needs to retrieve data. Naturally, it makes a network call and the request travels over the internet eventually reaching your infrastructure to handle the request. In today’s world, how your infrastructure is provisioned, managed, and updated makes a big difference in terms of reliability and scalability. To solve these common and complex problems, the Cloud Native Computing Foundation is creating best practices, incubating software, and helping organize information for companies' journeys into the cloud and out of their own data centers. So let’s take a look at how we got to cloud native.

The Ancient Past

In the before times, companies would purchase hardware and bake/fry servers into existence. At the time, it was pretty much the only option, and as a result, very expensive. As more and more users were onboarded, scalability (and by extension reliability) became problems that needed to be solved by throwing more hardware at the problem. This solution doesn’t scale well as eventually you run into the problem of no more room in the data center or it is too costly to get more hardware. An additional problem is configuration drift. Since each piece of hardware is unique, partial upgrades/downgrades, incomplete deployments, OS/tool versions, etc. could leave each server in a state where what was tested may not work or creates heisenbugs. This also exacerbates the dev vs. prod parity problem as dev is typically ahead of prod (and there is no spare hardware available to mirror prod) and trying to reproduce and/or troubleshoot problems in an unlike environment adds additional complexity and may mask the issue.

The Past

The next step to avoid the costly hardware problem is to get a select number of beefy machines and start leveraging virtualization to have lots of small machines handle the backend application on the expensive hardware. Now, the application’s architecture greatly influences how successful this strategy is as you may end up paying around the same amount as if everything was bare metal, but in general, you save on overall costs. So what do you get with virtualization? You get an isolated environment so the management of the virtualized OS is separate from the host OS and you get snapshots. The isolated environment gives you an additional security barrier (assuming no critical kernel issue or hypervisor escape bug), so if something gets compromised you can blow it away and rebuild much easier than cordoning off physical hardware. Snapshots help with that too, but the bigger advantage is that you can snapshot prior to rolling out OS updates and application updates making rollbacks very simple. Depending on how changes are managed, however, there is still the possibility of configuration drift, just not at the same scale of physical hardware. There are still some problems though. In terms of scalability, you can still reach the maximum throughput and need to get more hardware to run all of the virtual machines and in terms of reliability, if you allocate your resources wrong (or have a misbehaving application), other virtualized apps can get resource starved and not perform well. The dev vs. prod parity problem still exists as well, but, can be handled much more gracefully as you can potentially image the VM and then move it to the dev cluster for troubleshooting.

The Present

Expanding on what virtualization lands us to where we are today: containerization. Containers, like virtual machines, offer isolated environments but have the additional benefit of less overhead as no guest OS is needed. The main benefit though is that containers are immutable, which means what is built and tested can be promoted (deployed rather than installed) to all environments with no additional testing needed to test environmental differences (other than load which also eliminates the dev vs. prod parity problem). This is also a big win for scalability (although in this case, elasticity) as exact copies can be deployed in mere minutes and then torn down when not needed. If running the containers yourself, you can still run out of resources, but if running on a cloud provider (e.g. AWS, Azure, or GCP), this should be a non-issue (assuming you have sane auto-scaling policies and can afford the bill next month). However, like virtualized applications, containers do share the resources of the host so if you have sized your containers wrong or there is another application using all of the resources, your own applications can suffer. This can be solved by paying a premium to run your containers without other customers.

The Future?

As you may have noticed, transitioning from bare metal to containers increases complexity as there are more software pieces needed to accomplish the same goal. For some, their applications can have all of this complexity abstracted away and become serverless. There are limits to this though as data is not persisted by default and the resources available can be more constrained since you are paying by time and memory rather than having a dedicated instance. The primary concern though is latency since there is no dedicated instance of your application running so you must pay the startup cost much more frequently. However, if your application can take advantage of this and the trade offs are deemed negligible, this may be the way to go as you don’t have to manage a significant portion of what you’d normally need to in all of the other models.

Development and Organizational Requirements

Regardless of if you are virtualizing on a cloud provider, running containers, or are serverless, monolithic or service oriented architectures depend on good engineering and organizational health. Being cloud native, however, really encourages the following: independent, loosely coupled, capability-oriented services that can run on dynamic environments (public, private, & hybrid multi-cloud) in an automated, scalable/elastic, resilient, manageable, and observable way. To get there, you need to modernize (e.g. Agile) the software development life cycle (SDLC) by creating an application that adheres to the 12 Factors.

To start embracing the 12 Factors, begin integrating the rugged culture and DevSecOps into your organization. By including security and DevOps into engineering as a first-class concept, you improve your reliability as there is less likelihood that a security issue will take you down and DevOps ensures that configuration and other operational concerns are included in the requirements and development process rather than being tacked on later which increases your productivity and stability.

Next, adopt Site Reliability Engineering (SRE) to improve the stability of releases and reduce the time needed to restore functionality when an incident occurs.

Lastly, to get to continuous deployment, roll out GitOps to expand DevSecOps to fully automate all deployments and changes to ensure that all changes are documented, source controlled, require approval, and can be rolled back easily to ensure that system management, stability, and reliability are always accounted for.

Now that you’ve done that (or are in the process), how do you measure success? By leveraging the Four Key Metrics, you can create OKRs to track how you are progressing (with data from your SLIs). Additionally, your KPIs should be tied to your SLOs.

Putting this all together, you can reach Cloud Native and have a healthy engineering organization and happy customers.


All of the topics above have lots of background that I couldn’t possibly cover due to depth and there are nuances as to what technology/approach is right for an application. Luckily, the CNCF provides a guide that goes into much more detail about each aspect of Cloud Native.