Kubernetes: Batteries Not Included

If you’ve been around Kubernetes for a while, it’s probably no surprise that Kubernetes is both too much and not enough, depending on who you are and what you need.

Kubernetes feels like it should be useful to everyone. Every company needs a website and a mobile app, these days. Every company has internal tools and systems. Software is all migrating to microservices and distributed systems. They all need databases and message brokers and file storage systems. More and more companies are using machine learning and other complex software systems to drive business value. And Kuberenetes can handle all of that, right?

But Kubernetes can’t do it all by itself. Anyone who wants to deploy and operate Kubernetes clusters first has to scour the internet for the latest and greatest addons and integrations from vendors and the open source ecosystem.

The Cloud Native Landscape is a good place to start, but it’s not a puzzle box top that shows you what the final picture looks like. It’s more like the window display at the puzzle store. Except, you don’t just need to assemble one puzzle, you need to buy a dozen puzzles and cut and glue them all together into a crazy (hopefully beautiful) mosaic.

But is building an internal cloud platform core to your business? Or is the platform just the means to an end?

Don’t get me wrong, building platforms is a fun challenge! It’s also been the focus of my career and specialization, so I hope it stays in demand. But unless it’s providing a competitive advantage, or you need it and can’t buy it, it’s generally best to focus on your company’s core competencies, right?

Yet over and over again, I see companies small and large building their own platforms, either from scratch or around Kubernetes.

Types of Platform Builders

Over the past few years, I’ve noticed a common pattern in the industry:

  1. You put together a platform with whatever works because all you really care about is deploying your workload.
  2. You put together a platform team to build a production-ready platform, because you realize that software delivery and reliability is crucial to your business, and your previous platform was a mess and didn't scale beyond the team that built it to host their workload.
  3. You’re selling your platform (or are so large that you have scads of internal customers), so you have multiple teams building pieces of the platform and an integration team gluing them together.

These three groups have vastly different requirements (and they’re all hiring Kubernetes & DevOps experts).

“But no one ever waits for production readiness before production deployment”

The Platform Cowboy

The first group would really rather be using a PaaS like Heroku or some hosted solution, but Heroku is expensive at scale and a little long in the tooth. It doesn’t support anything but stateless apps, and its curated integrations are limited. So you branch out. You know containers and Kubernetes are the new hotness, and maybe you used them at your last job. You may go with GKE, AKS, EKS, OpenShift, or Rancher, or you may decide to DIY and use kubeadm or kops.

So you brainstorm some requirements and come up with a short list:

  • Cloud hosting
  • High availability (multi-zone)
  • Single sign on
  • Load balancing
  • Secrets management
  • DNS
  • TLS certificates
  • Logs
  • Metrics
  • CI/CD
  • SaaS databases & message brokers
  • SaaS file servers & blob storage
  • DNS
  • Image registry
  • Persistent storage

It’s a good list! You can do this.

It looks like Kubernetes has integrations for a few of these, but you’re gonna have to install and configure them yourself, if you can track down the doc, tutorial, or stack overflow answer with the right sequence of commands. If it’s popular, it’s probably good enough for your proof of concept. No one asked you to build infrastructure anyway; you just need it to deploy your apps.

Six months later you still don’t have all these features dialed in, but you’re already in production, because the business couldn’t wait for perfection.

It’s probably going to take longer than you think.

The Platform Team

So you assemble a crack team of DevOps Kubernetes commandos and throw them at the problem (if you can hire them).

These people know what they’re doing. They know how to do things right. They’ve seen what’s out there and you hired them because of that.

So they go talk to your internal apps teams and SREs and brainstorm and come up with some additional feature you’re gonna need:

  • Multiple clusters
  • Automated cluster provisioning
  • API gateways
  • Service mesh
  • Serverless / FaaS
  • Workload isolation
  • Multitenancy
  • Permission management
  • End to end encryption
  • Upgrades (node & cluster)
  • Autoscaling
  • Capacity planning
  • Observability
  • Backups
  • Curated service catalog
  • Private VPC networking
  • Private DNS
  • Egress gateways
  • Public key infrastructure
  • Fault domains (zones & regions)
  • Cluster federation
  • Multi-cluster ingress
  • Resource quotas and defaults
  • Usage metering and reporting
  • Vulnerability scanning
  • Image provenance
  • Multiple persistent & ephemeral storage classes
  • Policy enforcement
  • Micro-segmentation

Wow. That’s ambitious.

But commandos don’t just rush into battle. They research and plan and aim to leave no man behind. They’ve been charged with building a platform that isn’t just thrown together. So they have to evaluate the options.

Analysis paralysis sets in.

This platform build is slower, more deliberate, trying to reach the elusive “enterprise production readiness”. But no one ever waits for production readiness before production deployment, and security is more of a dream than something you can check off your todo list. So your best laid plans get waylaid by operational concerns and feature requests before you’ve automated, tested, and secured everything (or maybe anything).

Oh, and even if your boss gave you a year of runway, your coworkers are already using the old Kubernetes platform in production. What’s taking so long?

Your backlog is now years long, a seemingly infinite list of things TODO.

Pretty soon you have a dozen clusters and you’re on-call for all of them, answering support questions left and right, trying to do your incident retrospectives, but falling behind on remediation and feature delivery. You have 10x as many resources but also 10x as many users and you’re hemorrhaging money.

Maybe we shouldn’t have gone from riding horses to riding rockets so quickly. Can we back up?

In case you couldn’t tell, the “all you need” is sarcasm.

The Platform Vendor

A few years later…

Hey guys, we got this container platform thing figured out. We should sell this to other people!

By now, your platform team is a platform group or maybe a whole infrastructure organization. Each team is plugging away, adding new tools to your cloud toolbox. You probably have an integration team trying to put the pieces together.

That integration team doesn’t look like the good old platform team, tho. They’re not building for themselves or even people they know any more. The customer is further away and these platform experts have been specializing in this so long they maybe don’t even know what the customer looks like any more. You have PMs for that, right?

The more specialized, the more heads down and focused, the more the individual teams start looking like blind men touching different parts of an elephant and “seeing” a completely different big picture.

The more isolated, the more “not invented here” crops up. When the only OSS solutions on the market are half baked, it’s often easier to build your own competing tool than try to work with another company with different priorities on a project with no governance model yet. More stakeholders, more problems.

Pretty soon your cloud platform is built out of a bunch of Lego pieces your customers have never played with before.

And you _still_ don’t have all those features on the TODO list from when you were just a platform team.

You’ve been too busy selling and dealing with customers. More customers, more problems.

“Instead of an easy golden brick road leading to Emerald City, cluster and workload operators are faced with a pile of bricks and have to pave their own way.”

The Bigger Picture

Platform building is hard, it turns out.

Every new abstraction is another story built on top of the under-construction-indefinitely software skyscraper in the clouds.

I will say this for Kubernetes as a platform: it’s built to last. In this cloud skyscraper, there are floors that you build on top of and there are floors that you rip out and replace. Kubernetes will probably stick around.

The Kubernetes ecosystem is huge, partly due to early investment in the community and partly due to investment in extensible API machinery. These important factors make it hard to hide the Kubernetes API with a higher level abstraction. It’s almost not worth hiding (yet), because your users would lose access to a wealth of flexibility, addons, integrations, and tools.

So instead of hiding Kubernetes, cloud vendors are each building a constellation of components that (hopefully) works together. Fortunately, there’s a lot of competition, which drives innovation and advancement! Unfortunately, each vendor has a bit of a silo problem, making investment decisions based on their previous investments, making it harder for them to collaborate on the bigger picture. There are also a hundred smaller vendors building individual components, each with their own ideas about what the bigger picture looks like.

“The future is already here — it’s just not evenly distributed.”
- William Gibson

Where are we now?

Gartner Hype Cycle (CC BY-SA 3.0)

As you probably know, there’s a lot of hype around Kubernetes still, which means the “trough of disillusionment” is coming, if it’s not already here for you. Maybe in a few years we’ll get to the “plateau of productivity” where most companies won’t need to worry about building their own platform.

But, in the mean time, instead of an easy-to-follow golden brick road leading to Emerald City, developers are faced with a pile of bricks and have to pave their own way to cloud platform bliss.

Cloud Guy. Anthos Solutions Architect at Google (opinions my own). X-Cruise, X-Mesosphere, & X-Pivotal.