What is a CaaS Platform and why we built one
15 mins read

What is a CaaS Platform and why we built one


Every enterprise has a migration story. Ours isn’t about moving to the cloud. It’s about building the thing we wished existed when we started: a Container-as-a-Service platform that makes Kubernetes invisible to the teams shipping code on it.

This isn’t a “we adopted Kubernetes and it was great” post. It’s about why we chose to build a platform, what it looks like under the hood, and what we got wrong along the way.



Why we didn’t just “move to the cloud”

Our starting point was familiar: workloads running on virtual machines in on-premise datacenters. The usual problems.

But these problems compound in ways you don’t see until you’ve lived with them for years:

  • Provisioning delays killed velocity. When a team waits two weeks for an environment, they don’t sit around. They build workarounds — shared staging environments, local hacks, shadow IT. By the time the infrastructure arrives, the team has already shipped something fragile on top of shortcuts. That delay was the root cause of half our technical debt.

  • Every team had invented its own deployment pipeline. There was no standardized path, so teams built their own. Team A used Ansible. Team B had bash scripts. Team C deployed via SSH and prayer. When an incident hit, nobody could help another team debug because no two systems worked the same way.

  • Cost was a black box. VMs running at 10-15% utilization, licenses for software nobody remembered buying. When leadership asked “how much does service X cost us?” the answer was “we don’t know, and we can’t know with our current setup.” That kind of answer erodes trust between engineering and the business fast.

The knee-jerk answer to all of this is “migrate to the cloud.” And we almost made that mistake. Lifting VMs from a datacenter into EC2 instances solves exactly one problem: you no longer manage physical hardware. Every other problem — inconsistent deployments, slow provisioning, cost opacity, security drift — stays exactly the same. You’ve just moved your mess to someone else’s datacenter and started paying by the hour for it.

We needed something higher-level than infrastructure. We needed a platform.



What CaaS actually means (and why the term matters)

The industry talks about “Internal Developer Platform” or “Platform Engineering.” We use Container-as-a-Service because it describes the contract more precisely:

  • As-a-Service means developers don’t manage the underlying infrastructure. They don’t provision clusters, configure CNIs, or debug node issues. The platform team does.
  • Container means the deployment unit is a container image, not a VM, not a JAR file, not a zip. This constraint is intentional — it forces standardization.

The distinction between “running Kubernetes” and “running a CaaS platform” is like giving someone a car engine versus giving them a car. Kubernetes is the engine. Without the rest of the car around it, most people can’t get anywhere.

If your developers need to understand Kubernetes internals to deploy their service, you haven’t built a platform. You’ve built a Kubernetes cluster and called it a platform.

A CaaS platform has a clear interface:

  1. Push code to a Git repository
  2. Open a PR to request infrastructure (namespace, resources, access)
  3. Merge and the platform provisions everything automatically
  4. Observe through built-in dashboards and alerts

Everything else — cluster management, networking, security policies, scaling — is the platform team’s responsibility. This separation is the entire point.



How it compares

Traditional DC Cloud VMs (IaaS) CaaS Platform
Provisioning Days to weeks Hours (if automated) Minutes (self-service)
Deployment Team-specific, manual Semi-automated, still per-team GitOps, standardized across all teams
Scaling Hardware-bound, slow Faster but still manual decisions Automatic (Karpenter, HPA)
Cost model CapEx, opaque OpEx, per-VM (still coarse) Per-namespace, per-team, attributable
Consistency None Possible with effort Enforced by design
Security Depends on team discipline Depends on team discipline Guardrails enforced by the platform
Operational burden High (hardware + OS + app) Medium (OS + app) Low for teams (app only)

Look at the security row. Cloud VMs don’t have weak security because the cloud is insecure — they have inconsistent security because it depends on each team doing the right thing. CaaS makes security enforceable by default instead of optional.



What our CaaS platform looks like

We didn’t try to build everything at once. We started with one cloud provider, one Kubernetes distribution, and expanded once the foundation was solid.



Infrastructure: Terraform on AWS (EKS)

Everything is codified with Terraform. VPC, subnets, IAM roles, IRSA configuration, EKS clusters. No click-ops, no manual steps, no snowflakes.

We started with EKS because it’s where we had the most operational experience. Our roadmap includes OpenShift and Tanzu — not because multi-distribution is a goal in itself, but because different parts of the organization have different constraints (compliance, vendor relationships, existing contracts). The challenge is providing the same developer experience regardless of what’s underneath. That’s harder than it sounds, and we’re not there yet.



Delivery: ArgoCD and the App of Apps pattern

Every change goes through Git. We use ArgoCD with the App of Apps pattern as the single source of truth for everything running on the platform.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-apps
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/jelhaouchi/caas-platform-blueprint.git
    targetRevision: HEAD
    path: gitops/platform
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
Enter fullscreen mode

Exit fullscreen mode

This root Application points to a directory of child Applications — each managing a platform addon (cert-manager, ingress, monitoring, network policies). ArgoCD reconciles the desired state continuously.

Why ArgoCD over Flux? Honestly, both would work. We chose ArgoCD because the App of Apps pattern gives us a clean hierarchy (bootstrap → platform → tenants), ApplicationSets make tenant onboarding template-driven, and the UI — while not essential — helps during incident investigation. The decision is documented in an ADR in our blueprint repo.



Networking: Cilium as CNI

We chose Cilium over the default AWS VPC CNI. Not because it’s trendy — because we hit real problems with the default.

The practical reasons:

  • eBPF replaces iptables. At scale, iptables-based CNIs degrade. Rule chains grow linearly with the number of services. We were already seeing latency spikes during policy updates on clusters with hundreds of services. Cilium’s eBPF datapath handles this at kernel level without the iptables bottleneck.

  • L7 network policies. Kubernetes-native NetworkPolicy only works at L3/L4 (IP + port). With Cilium, we enforce policies at L7 — allow traffic to /api/v1/health but deny /admin. This matters for zero-trust, and it means teams don’t need a service mesh just for traffic control.

  • Hubble for network observability. Hubble gives us a real-time service map of traffic flows across every namespace. When a developer says “my service can’t reach the database,” we pull up the Hubble flow log and see exactly what’s being denied and why. Before Cilium, that kind of debugging could take hours. Now it takes minutes.

The trade-off: Cilium adds operational complexity. It’s another component to upgrade, monitor, and understand. We accepted that trade-off because the networking problems it solves would have been harder to solve without it.



Tenant model: namespace-as-a-service

Teams don’t get cluster-admin. They get a namespace — pre-configured with RBAC, Cilium network policies, resource quotas, and default monitoring.

Onboarding a new team is a Git pull request:

  1. Add a directory to tenants/ in the GitOps repo
  2. Define the namespace, RBAC roles, network policy, and resource quota
  3. Open a PR. Review. Merge.
  4. ArgoCD provisions everything automatically

No tickets. No Slack messages to the platform team asking “can you create a namespace for us?” The PR is the request. Merge it and everything gets provisioned.



Observability: OpenTelemetry as the foundation

We went with OpenTelemetry as the instrumentation standard for the entire platform. OTel is still maturing and the collector ecosystem moves fast, so this wasn’t an obvious choice. But we didn’t want vendor lock-in on observability, and we didn’t want teams instrumenting differently depending on which backend happens to be popular that quarter.

The pipeline:

  • OpenTelemetry Collector as a DaemonSet on every node. Collects traces, metrics, and logs from all workloads using a single agent.
  • Prometheus for metrics storage and alerting (OTLP receiver enabled).
  • Grafana for dashboards and visualization.
  • Hubble (from Cilium) for network-level observability — this complements application-level telemetry with infrastructure-level traffic data.

Every tenant namespace gets baseline dashboards and alerting rules automatically. Teams can extend them but can’t remove the baseline. With OpenTelemetry, teams instrument once using open standards. If we decide to swap Prometheus for Mimir or add Tempo for traces, application code doesn’t change.

One caveat: OpenTelemetry’s logging support is still catching up. Traces and metrics work well. Logs are getting there. We’re running a hybrid approach for now and will consolidate once the ecosystem settles down.

We’re also building a dedicated observability product for data pipelines — a topic for a future post.



Security: enforce, don’t hope

Instead of publishing security guidelines and hoping teams follow them, we enforce them at the platform level:

  • Network policies — Cilium-powered, deny-all by default. Teams explicitly declare what traffic is allowed, including at L7. If you don’t declare it, it’s blocked.
  • RBAC — scoped to namespaces, least-privilege. No team has cluster-wide access.
  • Policy enforcement — OPA/Gatekeeper validates every manifest before it’s applied. No privileged containers. No latest tags. Resource limits required. If a manifest violates policy, it doesn’t deploy — ArgoCD marks it as degraded and the team sees exactly why.

This is the “guardrails, not gates” philosophy. We don’t block teams from shipping. We block them from shipping insecure configurations. There’s a difference.



What we learned (including what went wrong)



Start with one cloud, one distribution. Seriously.

Going multi-cloud or multi-distribution from day one is an abstraction trap. You spend months building a compatibility layer for problems that don’t exist yet. We started with AWS and EKS. We built patterns, documented decisions, and validated them in production. Only then did we start planning for OpenShift and Tanzu. If we’d tried to support three distributions from the start, we’d still be building the abstraction layer instead of running workloads.



Developer experience is the product — but adoption is the real battle

Everyone says “developer experience matters.” What they don’t mention is that even with a great platform, teams resist change. They’ve spent years building their own tooling. Their bash scripts work. Their Ansible playbooks are battle-tested. Asking them to drop all of that and trust something they didn’t build is a people problem, not a tech problem.

What worked for us: we didn’t mandate anything. We started with one willing team, made their life visibly better, and let other teams see the result. People came to us. Forcing adoption just creates resentment.



GitOps is non-negotiable — but the repo structure will haunt you

ArgoCD with App of Apps is powerful. It’s also very easy to create a GitOps repo structure that becomes unmaintainable. We restructured ours three times before landing on something sustainable. The lesson: invest time in your repo layout early. Think about how it scales to 50 teams, not just 5. I’ll detail our final structure in the next post.



Cost visibility changed the conversation with leadership

In the datacenter, cost was a CapEx line item nobody questioned. Moving to CaaS made cost visible — and attributable. Per namespace, per team, per workload. Resource requests and limits aren’t just stability controls; they’re the foundation for cost accountability.

When teams can see what they consume, behavior changes. Over-provisioned services get right-sized — not because we told teams to, but because they could see the waste themselves. If I could go back and give one piece of advice: build cost dashboards into the platform from day one. Nothing earns leadership trust faster.



Platform engineering is product work, not infrastructure work

Your users are developers. If you treat CaaS as an infrastructure project, you’ll build something technically solid that nobody uses. We treat the platform like a product — backlog, user feedback, demos, adoption metrics. The platform team’s success isn’t uptime. It’s how many teams are shipping on it and how fast they can go.



What’s next

This is the first post in a series about building a CaaS platform in production. Coming up:

  • Structuring a GitOps repository for multi-environment Kubernetes — the repo layout we landed on after three rewrites
  • ArgoCD App of Apps at scale — sync waves, health checks, and the gotchas nobody documents
  • DC-to-CaaS migration patterns — the playbook for moving workloads off the datacenter without breaking everything

I’m also open-sourcing the blueprint of our platform architecture:

Production-ready CaaS platform blueprint: multi-cloud Kubernetes (EKS, GKE, AKS) with Terraform, ArgoCD GitOps, and developer self-service

A production-ready Container-as-a-Service platform blueprint for migrating workloads from datacenter to Kubernetes.

Built from real-world experience running CaaS platforms at scale — covering infrastructure provisioning, GitOps delivery, platform addons, and developer self-service.

Target Platforms

Platform Status
AWS (EKS) In progress
OpenShift Planned
Tanzu Planned

Architecture

┌─────────────────────────────────────────────────┐
│                  Developer Self-Service          │
│           (Namespace provisioning via PR)        │
├─────────────────────────────────────────────────┤
│                  GitOps (ArgoCD)                 │
│        App of Apps · ApplicationSets            │
├──────────┬──────────────────────┬───────────────┤
│ Platform Addons                                  │
│ cert-manager · external-dns · ingress-nginx     │
│ karpenter · monitoring · network policies       │
├─────────────────────────────────────────────────┤
│              Kubernetes Cluster                  │
│         EKS · OpenShift · Tanzu                 │
├─────────────────────────────────────────────────┤
│           Infrastructure (Terraform)             │
│       VPC · IAM · Node Groups · IRSA            │
└─────────────────────────────────────────────────┘

Repository Structure


├── terraform/
│   └── aws/              # EKS cluster + networking + IAM
├── gitops/
│   ├── bootstrap/        # ArgoCD app-of-apps
│   ├── platform/         # Cluster addons (cert-manager, ingress,

Terraform, ArgoCD, Cilium, and tenant self-service. The scaffold we wish we had when we started.

If you’re doing something similar — migrating off a datacenter, building an internal platform, trying to make Kubernetes disappear for your developers — I’d genuinely like to hear how it’s going. Reach out on LinkedIn or follow my blog for the next posts.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *