Chapter 1: The Platform Problem
A new team joins your organization. They’ve been hired to build a payments service. They’re experienced engineers — they’ve built distributed systems before, they know Kubernetes, they’ve deployed to production at their previous company. On day three, they ask the question every new team asks:
“How do I deploy?”
The answer takes an hour to explain. There’s a wiki page that describes the “standard deployment process.” It links to a Helm chart maintained by the infrastructure team. The chart has 247 configurable values. There’s a Slack channel called #deploy-help where you can ask questions.
Assume the best case. The wiki is current. The chart is well-maintained. The pipeline is clean. The infrastructure team is responsive. Even in this best case, the payments team must learn which of the 247 values they need to set, which security features to enable, how to configure their secret references in the chart’s expected format, and how to declare their network dependencies. They must make a series of decisions — each small, each reasonable — that collectively determine whether their service is correctly deployed.
The payments team gets their service deployed. Maybe it takes a day; maybe two. (The timeline varies — some organizations have this down to an hour, others take a week. The point is not the duration.) It runs. But nobody can answer basic questions about it with certainty: Is the service’s traffic encrypted? Is it being scraped for metrics? Could it access secrets belonging to other teams if it tried? Is the container image verified? The honest answer to all of these questions is “probably, if the process was followed correctly.” And “probably” is not an answer that survives a security audit.
This is the platform problem. Not the absence of tools — the organization has plenty of tools. The problem is the absence of coherent abstraction. There is no single system that accepts the team’s intent (“I need to run this service with these dependencies”) and produces the full set of infrastructure required to run it correctly, securely, and observably. Instead, there is a collection of tools, documents, and tribal knowledge that the team must assemble into a working deployment, manually, every time.
This chapter examines why this problem exists, how organizations typically try to solve it, and why those solutions fall short. The rest of the book proposes an alternative.
1.1 Kubernetes Is Not a Platform
Section titled “1.1 Kubernetes Is Not a Platform”Kubernetes is the most successful infrastructure project of the last decade. It provides a declarative API for container orchestration: you describe the state you want (Deployments, Services, ConfigMaps), and Kubernetes converges the cluster toward that state through reconciliation. It handles scheduling, basic service discovery, rolling updates, and health checks. Its API is extensible through Custom Resource Definitions, which allow anyone to add new resource types and controllers.
None of this makes Kubernetes a platform.
A platform answers the question “how does a team go from source code to a running, production-grade service?” Kubernetes answers a much narrower question: “given a set of containers and resource descriptions, how do we schedule and run them?” The distance between these two questions is enormous, and it is filled — in most organizations — by a patchwork of scripts, charts, pipelines, and documentation.
Consider what Kubernetes provides and what it does not:
Kubernetes provides container scheduling, service discovery via DNS, declarative resource management, rolling updates, health-check-based restarts, RBAC for API access, and an extensible API.
Kubernetes provides basic versions of many platform capabilities, but not the integrated, enforced versions that production at scale demands. It has Secrets — but no integration with external secret stores, no rotation, and no cross-service access auditing. It has NetworkPolicy — but enforcement requires a CNI that implements it (many default configurations do not), and it covers only L3/L4. It has ResourceQuota — but no compile-time enforcement, no cost estimation, and no team-level governance. It has RBAC — but no workload-level authorization (can this service access this secret? can this image be deployed?). It has none of: service mesh, mutual TLS, identity-based authorization, distributed tracing, compliance verification, or multi-cluster coordination.
This is not a criticism of Kubernetes. Kubernetes is infrastructure — it provides primitives and an extensible API. The gap between those primitives and a production-grade deployment is the gap that a platform fills. Many organizations fill it successfully with combinations of tools. The question this book asks is: can you fill it with guarantees?
To make this concrete: a Deployment resource gets your containers running on a node. It says nothing about whether those containers should be allowed to communicate with other services, whether their secrets are sourced from an appropriate backend, whether their metrics are being collected, or whether the team has budget for the resources they’ve requested. A Deployment is a scheduling primitive. A production-grade deployment requires a dozen more resources — security policies, secret sync, observability targets, availability protections — that Kubernetes does not derive from the Deployment.
A platform fills this gap. The developer declares a service, and the platform produces the Deployment and everything else required to run it correctly. Section 1.4 shows this concretely.
The distinction between Kubernetes and a platform is the distinction between primitives and opinions. Kubernetes provides primitives: here are the building blocks, assemble them as you see fit. A platform provides opinions: here is how things are built, and the opinions are enforced by software, not by documentation.
1.2 How Teams Fill the Gap Today
Section titled “1.2 How Teams Fill the Gap Today”Every organization that adopts Kubernetes eventually confronts the gap between primitives and platform. And every organization fills it somehow. Many fill it successfully — there are organizations running hundreds of services on Helm charts and ArgoCD that deploy reliably, pass security audits, and keep their engineers productive.
This section is not about whether those approaches work. They do. It’s about what they can’t guarantee. Each approach has a ceiling — a point where “the process was followed correctly” stops being good enough, because scale, regulation, or consistency demands require something stronger than process.
The wiki-driven platform
Section titled “The wiki-driven platform”The simplest approach: document the deployment process and expect teams to follow it. The wiki describes how to write a Deployment manifest, how to add a NetworkPolicy, how to configure a ServiceMonitor, how to set up secrets. Each page is accurate when written. In aggregate, they describe a correct deployment.
This approach fails for three reasons. First, documentation drifts. The wiki is written against version 1.24 of Kubernetes with Calico for CNI; the cluster is now running 1.29 with Cilium. Nobody updated the NetworkPolicy examples. Second, documentation is ambiguous. “Add a network policy for your service” — what kind? With what selectors? Allowing traffic from where? The wiki can’t answer these questions for every service; it gives a template and hopes for the best. Third, documentation cannot enforce. The wiki says every service must have a NetworkPolicy. There is no mechanism to verify that it does. Services without policies run fine. Nobody notices until the security audit.
Consider a specific failure. The wiki says to add a NetworkPolicy for every service. Team A follows the instructions and creates a policy that allows ingress from the frontend namespace. Six months later, the frontend moves to a different namespace as part of a reorganization. Team A’s NetworkPolicy still references the old namespace. Traffic stops flowing. The debugging takes four hours because the team doesn’t realize the issue is in the NetworkPolicy — they assume the application is broken. Nobody updates the wiki to mention that NetworkPolicies must be updated when namespaces change, because nobody thinks of it as a wiki problem.
This is not a failure of discipline. It is a failure of architecture. The wiki described a point-in-time correct configuration. It has no mechanism to track the ongoing validity of configurations it inspired. The wiki can tell you what to do. It cannot tell you when what you did is no longer correct.
The wiki-driven platform is governance by honor system. It works in small, high-trust teams where everyone reads the wiki and follows it carefully. It fails at scale, when the number of services exceeds the attention span of the infrastructure team.
The CI pipeline platform
Section titled “The CI pipeline platform”A more sophisticated approach: encode the deployment process in the CI/CD pipeline. Each stage of the pipeline performs a task that the wiki would have described: lint the manifests, inject network policies, scan the container image, validate resource limits, deploy to the cluster. The pipeline is the platform.
This is an improvement over the wiki because the pipeline can enforce. A missing NetworkPolicy isn’t a documentation gap — it’s a failed pipeline stage. Image scanning isn’t optional — it’s a gate. The pipeline makes the deployment process executable and repeatable.
But the pipeline platform has its own failure modes. Pipelines are opaque: a 500-line Jenkinsfile or a 20-stage GitHub Actions workflow is not a reviewable abstraction. When a stage fails, diagnosing the problem requires understanding the pipeline’s internal logic, which is typically written by a different team, in a different language, with different conventions than the application code. Pipelines run at deploy time only: between deploys, the cluster state can drift from what the pipeline produced, and nobody will know until the next deploy (or the next incident). And pipelines are fragile: one broken stage blocks all deployments, and fixing it requires understanding a system that most developers treat as a black box.
There is a subtler problem with pipeline platforms that becomes visible over time: the pipeline encodes institutional knowledge in an opaque, procedural format. When the engineer who wrote the inject-policies stage leaves the organization, the stage becomes a black box. It still runs. Nobody dares modify it. New requirements are added as new stages rather than modifications to existing ones, because modification requires understanding and understanding requires archaeology. After two years, the pipeline is 30 stages long, takes 25 minutes to run, and the team that maintains it spends more time debugging the pipeline than improving it.
Perhaps most importantly, the pipeline platform doesn’t change the abstraction level. The developer still writes Kubernetes manifests (or Helm values, or Kustomize overlays). The pipeline validates and augments those manifests, but the developer is still authoring in the language of Kubernetes primitives. If the developer doesn’t understand what a NetworkPolicy is, they can’t write one — and the pipeline can only inject a generic one or reject the deployment.
Consider the alternative: if a platform controller handles derivation, the CI pipeline simplifies to kubectl apply -f service.yaml (or a git commit that triggers the same). All the complexity — policy evaluation, secret resolution, network policy generation — moves from the pipeline into the controller, where it can be unit-tested, snapshot-tested, and reconciled continuously rather than running once at deploy time.
The Helm chart platform
Section titled “The Helm chart platform”Helm is the package manager for Kubernetes. A Helm chart is a template that produces Kubernetes manifests from a values file. The Helm chart platform works by providing a “blessed” chart — maintained by the infrastructure team — that encodes the organization’s deployment standards. Teams customize it through values.
This approach has real appeal. The chart encodes opinions (every service gets a NetworkPolicy, every service gets a ServiceMonitor) and teams customize the parts they care about (image, replicas, resource limits). The chart is versioned, so updates to the deployment standard can be rolled out by bumping the chart version.
The failure mode is in the values file. Helm supports JSON Schema validation (values.schema.json), which gives you required fields, type checks, and enum constraints. This is real validation — better than nothing. But it’s runtime validation of data shapes, not a type system. JSON Schema can tell you replicas must be an integer. It cannot express “if type is secret then params.keys is required but params.endpoints is forbidden” or track a dependency declaration through policy evaluation, network policy generation, and bilateral matching. The gap between “this field is the right type” and “this configuration produces correct infrastructure” is the gap a programming language closes and a schema language cannot.
Even with schema validation, the abstraction problem remains. The developer is still filling in infrastructure fields — networkPolicy.egress, serviceMonitor.interval, podDisruptionBudget.maxUnavailable — that describe Kubernetes resources, not service intent. Policy-as-code tools (OPA/Gatekeeper, Kyverno) can block bad configurations at admission time, and they work. But they solve enforcement, not authoring. The developer writes the values file; the admission controller rejects the wrong ones. This is better than no enforcement, but the developer is still responsible for knowing what “correct” looks like in Kubernetes terms.
Furthermore, Helm values compose poorly. When the chart grows to 200+ values to accommodate every team’s needs, the values file becomes its own configuration language — one without types, validation, documentation, or tooling. Two teams with different values files for the same chart may produce radically different Kubernetes manifests, and the only way to understand the difference is to template both and diff the output.
There is also a versioning problem that compounds over time. The infrastructure team releases chart version 2.0 with a new NetworkPolicy model. Teams must update their values files to match the new schema. Some teams update immediately. Some update next quarter. Some never update because their service is “stable” and they don’t want to risk a deployment change. After a year, the organization is running three versions of the chart across its fleet, each with different security properties, and the infrastructure team must maintain backward compatibility with all of them.
The Helm chart platform is a template masquerading as an abstraction. It reduces the amount of Kubernetes YAML a developer must write, but it doesn’t change the fundamental model: the developer is still configuring Kubernetes primitives, just through an indirection layer.
The GitOps platform
Section titled “The GitOps platform”GitOps tools (ArgoCD, Flux) synchronize a git repository with a Kubernetes cluster. The repository contains manifests; the tool applies them to the cluster and reconciles drift. If someone manually edits a resource in the cluster, the GitOps tool reverts it to match the repository.
GitOps solves a real problem: deployment delivery and drift detection. It is a sound engineering practice for managing the flow of manifests from source to cluster. What it does not solve is manifest authoring. The manifests in the git repository are still Kubernetes primitives — Deployments, Services, NetworkPolicies — written by developers. GitOps changes how manifests reach the cluster. It does not change what the manifests contain or who is responsible for their correctness.
A team using GitOps with raw manifests still faces every problem described above: they must know to add a NetworkPolicy, they must configure secrets correctly, they must set up observability. The difference is that their mistakes are version-controlled and automatically applied.
GitOps is sometimes described as providing “drift detection” — and it does, for a narrow definition of drift. If someone edits a Deployment in the cluster by hand, ArgoCD will revert it. But GitOps does not detect specification drift: the gap between what the manifests say and what they should say. If the manifests in git are missing a NetworkPolicy, ArgoCD will faithfully ensure that the cluster also has no NetworkPolicy. GitOps enforces the repository as the source of truth. It does not evaluate whether the source of truth is correct.
The common thread
Section titled “The common thread”All four approaches are additive. They add tools, process, and automation on top of Kubernetes without changing the abstraction level. The developer is still authoring Kubernetes primitives. The tools help — they validate, template, deliver, and reconcile — but the fundamental unit of work is still the Kubernetes manifest.
This is where they hit their ceiling. The more services you have, the more manifests, the more values files, the more pipeline stages. Complexity grows linearly with service count, because each service requires a human to make the same set of decisions: what network policies, what secret backend, what observability configuration, what scaling parameters.
A platform should eliminate the decisions that don’t need to be made by the developer — while preserving the decisions that do. This distinction is critical. There are decisions that only the developer can make: what image to run, what resources it needs, what other services it depends on, what secrets it reads. These are inherent to the service and cannot be derived. Then there are decisions that are identical for every service: which network policy format to use, which secret backend to target, whether to collect metrics, how to configure mTLS. These are the platform’s decisions. A platform eliminates the second category by making them automatic — not by removing them, but by taking them off the developer’s plate.
1.3 What a Platform Actually Is
Section titled “1.3 What a Platform Actually Is”The word “platform” means everything from “a Kubernetes cluster” to “a team that manages infrastructure.” Here, it means something specific:
A platform is an opinionated abstraction layer that accepts high-level intent and produces the full set of infrastructure required to satisfy that intent safely, securely, and observably.
Opinionated
Section titled “Opinionated”The platform makes choices. It selects a secret management backend (Vault, AWS Secrets Manager, or something else). It selects a network policy model (bilateral agreements, unilateral ingress rules, or something else). It selects a metrics stack, a service mesh, a certificate authority, a compliance framework.
These choices are not exposed to the developer as options. The developer does not choose which secret backend to use — they declare that their service needs a secret, and the platform routes the request through its chosen backend. The developer does not choose which network policy model to use — they declare their service’s dependencies, and the platform generates the appropriate policies.
Opinions are what make the platform useful. A platform without opinions is a toolkit — it provides capabilities but requires the user to assemble them. A platform with opinions provides outcomes: declare what you need, and the platform delivers it through a coherent, tested, maintained stack.
This is uncomfortable for some engineers. Opinions feel restrictive. “What if I want to use a different secret backend?” The platform’s answer: the default path uses the chosen backend, and that path is automatic, tested, and maintained. If a team has a genuine need that the default path doesn’t serve — and it happens — they can deviate through an explicit escape hatch (Chapter 11). But deviation is visible, auditable, and requires justification. The default is the opinion. Exceptions are possible but never silent.
The point of being opinionated is not to prevent all flexibility. It is to make the common case effortless and the exceptional case explicit. A developer who follows the default path doesn’t think about secret backends. A developer who needs something different goes through a gate that says “I am deviating from the platform’s default, and here’s why.”
The strength of an opinion is proportional to the investment behind it. A platform that “chooses Vault” but only provides a wiki page on how to set it up has a weak opinion. A platform that compiles secret references in your service spec into the correct Vault-backed ExternalSecret resources, with rotation, authorization, and audit logging, has a strong opinion. Strong opinions reduce the cognitive load on developers because the opinion is backed by automation, not documentation.
Abstraction layer
Section titled “Abstraction layer”The platform hides complexity that developers should not manage. This is not a statement about capability — a senior infrastructure engineer could write a CiliumNetworkPolicy, an Istio AuthorizationPolicy, and an ExternalSecret. The question is whether they should.
Writing a correct CiliumNetworkPolicy requires understanding Cilium’s endpoint identity model, label selectors, and the distinction between ingress and egress rules. Writing a correct AuthorizationPolicy requires understanding SPIFFE identities (cryptographic service identities, covered in Chapter 15), Istio’s policy evaluation order, and the ambient mesh’s waypoint proxy model (which replaces sidecars with per-node and per-service proxies). Writing a correct ExternalSecret requires understanding the ExternalSecrets Operator’s store model, template syntax, and the five different routing paths for secret data (Chapter 9).
Any one of these is learnable. All of them together, across every service, maintained consistently, updated when the underlying tools change — this is not a reasonable expectation for an application developer. The platform absorbs this complexity. The developer declares intent (“my service talks to the payments service and reads the orders secret”), and the platform generates the correct resources in the correct format for the current versions of the current tools.
This is not dumbing things down. It is separation of concerns. The platform team understands Cilium, Istio, and ESO deeply and encodes that understanding in the compilation pipeline. Application teams understand their services deeply and express that understanding in the service spec. Each team works at the abstraction level appropriate to their expertise.
High-level intent
Section titled “High-level intent”The input to the platform is not Kubernetes YAML. It is a declaration of what the developer needs, expressed in the platform’s own language — its CRD schema.
A service spec might say: I have a container, it needs 500m CPU and 512Mi memory, it talks to the payments service, it reads from the orders-db secret, I want 3 replicas. This is 15 lines. It contains everything the platform needs to produce a complete deployment.
What the developer does not specify: label selectors, pod template metadata, service account names, network policy rules, authorization policy principals, external secret store references, scrape target configurations, pod disruption budget parameters, or scaling object metrics. These are all derived by the platform from the 15-line spec, the platform’s configuration, and the platform’s opinions.
The art of designing a good CRD schema — the platform’s input language — is the subject of Chapter 4. For now, the principle: the input should capture the developer’s intent at the level of abstraction at which they think about their service. Not lower (Kubernetes primitives) and not higher (a vague “deploy my thing” button that hides all meaningful configuration).
Full set of infrastructure
Section titled “Full set of infrastructure”The platform produces everything, not just the Deployment. This is the property that distinguishes a platform from a template. A Helm chart might produce a Deployment and a Service. A platform produces:
- A Deployment (with the correct resource requests, health checks, and security context).
- A Service (with the correct selectors and ports).
- Network policies at L4 (CiliumNetworkPolicy, restricting which pods can communicate).
- Authorization policies at L7 (Istio AuthorizationPolicy, restricting which identities can make which requests).
- External secrets (syncing credentials from the secret backend into Kubernetes Secrets).
- A metrics scrape target (VMServiceScrape, so the service is observable from the moment it starts).
- A pod disruption budget (protecting the service during node drains and cluster upgrades).
- Autoscaling resources (KEDA ScaledObject or HPA, if the spec includes scaling parameters).
The developer asked for none of these except the Deployment. The platform produced all of them because they are required for a production-grade deployment. If any of them were missing, the deployment would be insecure, unobservable, fragile, or ungoverned. The platform’s job is to produce deployments that are correct by construction — not deployments that are correct if the developer remembered every step.
This extends beyond deployment into Day 2 operations. When the on-call engineer investigates a latency spike at 3 AM, the metrics are already there (the VMServiceScrape was derived at deploy time). When the developer needs to scale, the autoscaling resources exist. When a node is drained during an upgrade, the PDB protects availability. The platform produces not just the deployment but the operational infrastructure that makes the service livable in production.
Safely, securely, and observably
Section titled “Safely, securely, and observably”These properties are not features to be added in a later release. They are constraints on the compilation output. Every resource the platform produces satisfies the organization’s security requirements. Every service is observable. Every deployment is protected against common failure modes. These are not optional — they are the minimum acceptable output of the platform.
This is a strong claim, and it has consequences. It means the platform cannot produce a Deployment without a corresponding NetworkPolicy (because the security requirements demand default-deny networking). It means the platform cannot produce a service without a metrics scrape target (because the observability requirements demand universal metrics collection). It means the platform must evaluate authorization policies before producing any output (because the security requirements demand compile-time authorization).
These constraints make the platform harder to build. They also make it worth building. A platform that produces Deployments without security, observability, or governance is a deployment tool, not a platform. The value of a platform is precisely that it guarantees these properties for every service, automatically.
1.4 What Changes When You Have a Platform
Section titled “1.4 What Changes When You Have a Platform”Return to the payments team from the opening of this chapter. In the original scenario — even assuming well-maintained tooling — they spent time learning the chart’s 247 values, enabling the right security features, and configuring their secret references. They got it deployed, but nobody could answer the security questions with certainty.
Now imagine they’re deploying to an organization with a platform. They write this:
apiVersion: lattice.dev/v1alpha1kind: LatticeServicemetadata: name: payments namespace: commercespec: replicas: 3 workload: containers: api: image: registry.example.com/payments@sha256:a1b2c3... variables: DATABASE_URL: "postgres://app:${resources.payments-db.dsn}@db:5432/orders" resources: requests: cpu: 500m memory: 512Mi limits: cpu: 500m memory: 512Mi resources: orders: type: service direction: outbound payments-db: type: secret params: keys: [dsn] stripe: type: secret params: keys: [api-key]They apply it. Within 60 seconds, the platform has:
- Verified that the container image is signed by a trusted key and referenced by digest.
- Evaluated authorization policies: is this service permitted to access the
payments-dbandstripesecrets? Is it permitted to deploy this image? - Compiled a Deployment with the correct security context, health checks, and resource requests.
- Compiled network policies: the payments service can talk to the orders service (and the orders service must also declare that it accepts traffic from payments — bilateral agreement). No other traffic is permitted.
- Compiled external secrets that sync the database DSN and Stripe API key from the organization’s secret backend into Kubernetes Secrets.
- Created a metrics scrape target so the service’s request rate, error rate, and latency are collected from the moment the first pod starts.
- Created a pod disruption budget ensuring at least 2 of 3 replicas are available during node drains.
- Reported the result in the CRD’s status: phase Ready, 7 compiled resources, all authorization gates passed.
The payments team didn’t write a NetworkPolicy. They didn’t configure a ServiceMonitor. They didn’t create an ExternalSecret. They didn’t think about pod disruption budgets. They declared their service — its containers, its dependencies, its secrets — and the platform handled the rest.
An astute reader might ask: “How is this different from a good Helm chart? I could write a chart that also produces all these resources from a values file.”
The difference becomes clear when you look at what the developer didn’t do. Look at the CRD spec again. There is no networkPolicy.enabled: true. There is no serviceMonitor.enabled: true. There is no podDisruptionBudget.maxUnavailable: 1. The developer declared their service — containers, dependencies, secrets — and the platform provided security, observability, and availability as automatic properties. These are not features the developer configured. They are features the platform provides.
This is the critical distinction: a platform delivers features as automatic properties of deployment, not as options the developer must choose.
Notice what’s in the spec and what isn’t. The developer declared orders as a type: service dependency with direction: outbound. They declared payments-db and stripe as type: secret dependencies. They didn’t mention NetworkPolicies — the platform compiles them from the service dependency declarations. They didn’t mention ExternalSecrets — the platform compiles them from the secret declarations. They used ${resources.payments-db.dsn} in a connection string, and the platform routes it through the correct secret path (mixed-content interpolation in this case, because the secret is embedded in a larger string).
Compare this to a Helm chart for the same service. The chart might have 247 values. Among them:
networkPolicy: enabled: true # developer must know this exists and set it egress: - to: orders # developer must know the correct format port: 8080 # developer must know the correct portserviceMonitor: enabled: true # developer must know this exists and set itpodDisruptionBudget: enabled: true # developer must know this exists and set it maxUnavailable: 1 # developer must know the correct valueEach of these enabled: true flags represents a feature the developer must explicitly opt into. The chart offers network security — but the developer must request it. The chart offers observability — but the developer must turn it on. The chart offers availability protection — but the developer must configure it. Every enabled flag is a question the developer must answer correctly, and the default answer for anything they don’t know about is “off.”
None of this makes Helm wrong. Many organizations build real platforms on Helm charts and operate them successfully for years. A well-designed chart with sensible defaults, JSON Schema validation, and a GitOps delivery pipeline is a legitimate platform — it works, it ships, it covers 80% of teams’ needs. The argument is not that Helm fails. The argument is that Helm is a package manager, and at some point the platform team hits problems that package management cannot solve.
Helm templates manifests from values and hands them to kubectl apply. It is good at last-mile rendering: taking a set of decisions and producing the YAML that implements them. What it is not is an orchestration layer. It cannot reason across services — it has no access to another service’s spec when generating your NetworkPolicy. It cannot evaluate a dependency declaration against a second service’s inbound policy. It cannot re-derive infrastructure when a platform-wide default changes, because it only runs at install or upgrade time. These are not limitations of a bad chart. They are limitations of templating as a paradigm. A programming language closes this gap not by being “more disciplined” but by being a fundamentally different tool: one that can hold state, enforce invariants at compile time, and implement cross-cutting logic that spans the entire fleet.
There is also a tooling gap that is widening. LLMs are remarkably good at generating, reviewing, and refactoring code in general-purpose languages — Rust, Go, TypeScript — because their training data contains millions of examples. LLMs are measurably worse at Helm templates, which combine Go’s text/template syntax, Sprig functions, YAML indentation sensitivity, and implicit context (.Values, .Release, .Chart) into a format that is hostile to both humans and models. As AI-assisted development becomes the norm, the gap between “write a Rust function that derives a NetworkPolicy” and “write a Helm template that conditionally includes a NetworkPolicy with correct YAML indentation” will only grow. Chapter 3 covers this in detail.
On the platform, these are not questions. They are facts. Every service has network policies — not because the developer asked for them, but because the platform produces them from the service’s declared dependencies. Every service has metrics collection — not because the developer enabled a flag, but because the platform wires observability into every deployment. Every service with multiple replicas has a disruption budget — not because the developer configured one, but because the platform calculates it from the replica count.
The shift is from opt-in features to automatic properties:
- Opt-in: “I need a service. I also need to enable the NetworkPolicy. I also need to enable the ServiceMonitor. I also need to configure a PDB. I also need to create ExternalSecrets.” Each capability is available but must be explicitly requested. Forgetting one produces a deployment with a gap. (Note: this is still declarative — the developer declares
networkPolicy.enabled: true. But the capability is only present if the developer opts in.) - Automatic: “I need a service with these dependencies and these secrets.” Security, observability, and availability are properties of every deployment on the platform. There are no flags to enable because these capabilities are not optional — they are consequences of deploying to the platform.
A reasonable objection: “Some services genuinely don’t need all of these capabilities. A batch job doesn’t need a ServiceMonitor. An internal dev tool doesn’t need a PDB. Making everything mandatory is over-engineering.”
This is a fair point, and the platform should handle it. The answer is not to make features optional (which returns to the opt-in model and its gaps). The answer is to make features context-sensitive. The platform generates a PDB for services with replicas > 1 and omits it for single-replica services. The platform generates a ServiceMonitor for long-running services and a different observability resource (or none) for batch jobs. The platform adjusts its output to the workload type — but the adjustment is the platform’s logic, not the developer’s configuration choice. The developer doesn’t disable the PDB; the compiler decides a PDB isn’t applicable.
Part VI of this book addresses workload types (batch jobs, GPU workloads, model serving) that need different compilation outputs. The principle is consistent: the platform’s features are automatic, but what “automatic” produces depends on the workload.
This is what it means for the platform to be a set of features rather than a set of tools. The developer doesn’t assemble infrastructure from components. They declare their service, and the platform provides the infrastructure as a package — one in which security, observability, and governance are inseparable from the deployment itself.
If the developer had tried to deploy an unsigned image, the platform would have rejected it before creating any resources, with an error message explaining which trust policy was violated. If they had tried to access a secret they weren’t authorized for, the platform would have reported which policy denied the access. If they had requested more CPU than their team’s quota allows, the platform would have told them the quota, the current usage, and how much was available. These are not checks the developer runs — they are features of the platform that fire automatically.
The questions from the opening — is traffic encrypted? is the service observable? could it access other teams’ secrets? — all have definitive answers. Yes, because the mesh enforces mTLS on all services deployed to the platform. Yes, because the platform includes observability as a feature of every deployment. No, because the platform includes authorization as a feature of every secret access. The answers are not “yes, if the developer configured it correctly.” They are “yes, because the platform provides it.”
This is the fundamental shift. Without a platform, infrastructure is something the developer builds — imperatively, service by service, feature by feature. With a platform, infrastructure is something the developer receives — as automatic properties of deploying to a system that provides security, observability, and governance as features, not options.
Chapter 2 develops the principle behind this shift — the developer declares intent, the platform derives infrastructure. If compliance requirements change, the developer’s spec doesn’t.
Notice what the payments team did not need to learn. They did not need to know that the organization uses Cilium for L4 network policy. They did not need to know that Istio ambient mesh provides L7 authorization. They did not need to know the ExternalSecrets Operator’s template syntax or which ClusterSecretStore is configured. They did not need to understand SPIFFE identities, pod disruption budget semantics, or scrape interval configuration. They needed to know the platform’s API — the CRD schema — and their own service’s requirements.
This is the separation of concerns that a platform provides. The platform team owns the infrastructure complexity. The application team owns the application complexity. Neither team needs to understand the other’s domain in detail. The CRD schema is the contract between them.
1.5 The Platform Spectrum
Section titled “1.5 The Platform Spectrum”Not every organization needs the platform this book describes. The scope of a platform should match the organization’s scale, risk tolerance, and platform team capacity. Platforms exist on a spectrum.
Lightweight. Blessed Helm charts with admission webhook validation. The chart encodes the organization’s deployment standards; the webhook prevents the most dangerous overrides (disabling network policies, using unscanned images). This catches the worst mistakes and provides a reasonable baseline. It requires a small investment — a few engineers maintaining charts and webhooks. When it’s appropriate: small organizations (under 50 services), high-trust teams that are infrastructure-literate, homogeneous workloads where one chart covers 90% of cases. When you’ve outgrown it: the chart has 200+ values and nobody understands all of them. The webhook is catching misconfigurations weekly. New teams take a week to produce a working values file. The security team can’t verify that all services have correct policies.
Moderate. Custom Resource Definitions with a compilation pipeline, basic policy enforcement, and automated observability. The developer writes a CRD spec, and a controller produces the full set of Kubernetes resources. Policies are evaluated at compile time. Metrics collection is automatic. This is a meaningful platform — it changes the abstraction level and provides real guarantees. It requires a dedicated platform team (3-6 engineers). When it’s appropriate: mid-size organizations (50-200 services), multiple teams deploying independently, regulatory or security requirements that demand consistent enforcement. When you’ve outgrown it: you need multi-cluster management, your compliance team wants continuous evaluation instead of quarterly audits, or you’re running specialized workloads (GPU, batch) that don’t fit the service CRD.
Comprehensive. Multi-cluster lifecycle management, compile-time authorization with a policy engine, layered security enforcement (L4 + L7 + runtime), compliance controllers, GPU workload management, self-managing clusters. This is what this book teaches. It requires a serious platform engineering investment (6+ engineers, ideally more). When it’s appropriate: large organizations (200+ services), regulated industries, multi-cloud or multi-region deployments, organizations where a security incident has material business consequences. When it’s overkill: you have 20 services, one cluster, and a team that deploys once a week.
The purpose of this spectrum is not to rank platforms. A lightweight platform that serves its organization well is better than a comprehensive platform that’s half-built and poorly adopted. The spectrum exists to help the reader evaluate where they are and what signals indicate they need more.
This book describes the comprehensive case in full detail. The principles — automatic features rather than opt-in capabilities, default-deny security, observability as a consequence of deployment — apply at every level. A lightweight platform that understands these principles is better than a comprehensive platform that doesn’t.
You can always choose to implement less. You cannot choose to know less. Understanding the comprehensive case gives you the judgment to decide which parts your organization needs today and which can wait.
If you’re building incrementally, the critical path is: CRD design (Chapter 4) → derivation pipeline (Chapter 8) → secret resolution (Chapter 9) → network policy generation (Chapter 13) → observability wiring (Chapter 19). That’s a working platform. Multi-cluster management (Part II), advanced workloads (Part VI), continuous compliance (Chapter 14), and binary enforcement (Tetragon) are valuable extensions you can add when you need them.
1.6 Who Builds the Platform?
Section titled “1.6 Who Builds the Platform?”The organizational question deserves more than a note, and we will return to it at length in Chapter 23. But we need to establish one distinction here because it shapes the entire book.
The platform team’s job is not to run Kubernetes. Running Kubernetes is an operations task — keeping the control plane healthy, managing node pools, applying security patches. These are important tasks, but they are not platform engineering.
The platform team’s job is to build a product that other engineers use to deploy and operate their services. The users of this product are the application development teams. The product’s interface is the CRD schema. The product’s documentation is its error messages and status conditions. The product’s SLO is the time from “developer writes a spec” to “service is running in production.”
This distinction matters because it determines how the platform team spends its time. In practice, most platform teams do both operations and product work. The question is not “which type are you?” but “what percentage of your time goes to reactive work (tickets, debugging, manual configuration) versus proactive work (building features, improving automation, reducing toil)?”
A team that spends 80% of its time on reactive work scales linearly with the number of application teams — more teams, more tickets, more manual steps. When the organization grows from 5 teams to 50, the reactive workload grows proportionally.
A team that spends 80% of its time on proactive work scales sublinearly — once the platform supports a capability, every team gets it automatically. Adding the tenth team costs the same as adding the hundredth, because the platform does the work, not the platform team.
The practical difference is visible in how the team responds to a common request: “we need to deploy a new service.” A reactive response: create a ticket, allocate a namespace, send the wiki link, be available on Slack for questions. A proactive response: “write a Service CRD and apply it.” One response creates work for the platform team every time. The other creates work only when the platform needs a new feature.
Both modes of work are necessary. Clusters need to be operated. Incidents need to be responded to. The question is whether the platform team’s investment is primarily in scaling the team (more people to handle more tickets) or scaling the platform (better software that handles more workloads without more people).
This book assumes the reader is on the product side — either as a platform engineer building the platform, or as an infrastructure architect designing one. The implementation details assume a team that writes code, operates controllers, and thinks about API design. If you are currently on the operations side and want to transition, this book can serve as a roadmap for what to build and why. The transition is not easy — it requires the team to stop being a service desk and start being a software team — but the reward is a platform that scales with the organization rather than against it.
1.7 What This Book Covers
Section titled “1.7 What This Book Covers”The remainder of the book is organized as follows.
Part I (Chapters 1–4) establishes the conceptual foundation. Chapter 2 introduces the intent/infrastructure distinction — the developer declares what they need, the platform derives the infrastructure, and the developer’s declaration survives changes to compliance requirements, tooling, and policy. Chapter 3 argues that the platform’s derivation logic should be code, not configuration — real programming languages with types, tests, and AI-assisted development. Chapter 4 covers CRD schema design: the principles of designing the platform’s input language.
Part II (Chapters 5–7) covers cluster lifecycle: how the platform provisions, manages, and retires the infrastructure it runs on. This includes declarative cluster provisioning, self-managing clusters that survive the loss of their management plane, and outbound-only inter-cluster communication.
Part III (Chapters 8–11) covers workload management: the derivation pipeline that transforms service specs into Kubernetes resources, secret resolution, autoscaling and resource governance, and escape hatches for workloads that don’t fit the standard model.
Part IV (Chapters 12–14) covers security: default-deny as architecture, layered enforcement across four independent layers, compile-time authorization, bilateral network agreements, supply chain security, and compliance as a continuous reconciliation loop.
Part V (Chapter 15) covers networking: the service mesh as identity infrastructure, multi-cluster identity federation, and the edge where mesh identity meets browser HTTPS.
Part VI (Chapters 16–18) covers advanced workloads: batch scheduling with gang semantics, GPU infrastructure and monitoring, and model serving for inference workloads.
Part VII (Chapters 19–21) covers operations: observability by default, backup and disaster recovery, and testing the platform.
Part VIII (Chapters 22–23) covers cross-cutting principles: cryptographic discipline and the platform-as-product mindset.
Throughout, the book uses a reference implementation to illustrate concepts with real code. The reference implementation is one way to build the platform described here — not the only way. The principles are tool-agnostic; the examples necessarily pick specific tools. Where a specific tool is used, alternatives are presented and the underlying principle is identified.
A note on migration. Many readers will arrive at this book with an existing deployment system — 200 services on Helm charts, a mature CI pipeline, a GitOps repository. The book does not pretend you start from scratch. Chapter 11 covers escape hatches and package management — the mechanisms that allow Helm-installed software to coexist with platform-compiled services. The practical migration path is incremental: new services deploy through the platform, existing services migrate when they’re next modified, and third-party software stays on Helm with mesh integration through the LatticeMeshMember CRD. The platform doesn’t replace your existing tooling on day one. It grows alongside it.
Exercises
Section titled “Exercises”Exercises are rated by difficulty: [M10] means a moderate exercise taking about 10 minutes. [H30] means a hard exercise requiring 30 minutes or more of serious thought. [R] denotes a research problem with no single correct answer — the value is in the argument you construct, not the conclusion you reach.
1.1. [M10] An organization has 200 microservices deployed to Kubernetes. A security audit reveals that 34 of them have no NetworkPolicy, 12 are using container images referenced by tag rather than digest, and 8 have secrets stored in plaintext ConfigMaps. All 200 services were deployed following the same documented process. Explain how this outcome is possible even when the process is correct. What property must a platform have to make this outcome impossible, not merely unlikely?
1.2. [H30] A platform team proposes the following architecture: application teams write standard Kubernetes Deployments and Services. A MutatingAdmissionWebhook intercepts every Deployment and injects a NetworkPolicy, a ServiceMonitor, and a pod disruption budget based on the Deployment’s labels. A separate controller watches for Deployments and reconciles the injected resources if they drift.
Analyze this architecture. Does this qualify as intent-based (the developer describes what they need) or infrastructure-based (the developer describes what to produce)? What can you achieve by extending it, and what can you not? At what point — if any — does it become equivalent to a CRD-based platform? Construct the strongest possible argument that this approach is sufficient and a purpose-built CRD adds unnecessary complexity. Then identify the weakest point in your own argument.
1.3. [R] This chapter argues that a platform should be opinionated. Consider the counter-position: a platform should provide capabilities (network policy generation, secret management, observability wiring) that teams can compose as they choose, without enforcing a single way to use them. This is the “toolkit” approach.
Construct the strongest version of the toolkit argument. Under what organizational conditions is it superior to the opinionated platform? Now consider: is there a stable equilibrium where a toolkit remains a toolkit, or does organizational pressure inevitably push it toward becoming an opinionated platform (or collapsing back into ad-hoc scripts)? What forces drive the evolution in each direction?
1.4. [H30] Section 1.3 defines a platform’s output as “the full set of infrastructure required to satisfy that intent safely, securely, and observably.” Consider a service that is correctly compiled by the platform — all resources are generated, all policies are in place, all metrics are being collected. Three months later, a new CVE is discovered in the base image. The service is still running the vulnerable image.
Is this a platform failure? The platform produced a correct deployment at compile time. The vulnerability didn’t exist at compile time. Should the platform be responsible for maintaining the correctness of its output over time, or only for producing correct output at the moment of compilation? If the former, what are the implications for the platform’s architecture? If the latter, what is the boundary of the platform’s responsibility and what fills the gap?
1.5. [R] The chapter presents four approaches to filling the Kubernetes-to-platform gap (wiki, CI pipeline, Helm charts, GitOps) and argues they all fail because they are “additive” — they don’t change the abstraction level. Construct a counter-example: describe an additive approach that does provide the properties of a platform as defined in Section 1.3. If you cannot construct one, prove that additivity and platform properties are fundamentally incompatible, or identify the specific property that additivity cannot provide.
1.6. [H30] Two organizations each have 100 engineers and 40 microservices. Organization A invests in a comprehensive platform with a 6-person platform team. Organization B uses a wiki and shared Helm charts with a 2-person infrastructure team. After two years, both organizations have scaled to 300 engineers and 150 microservices. Model the total engineering cost (platform team time + application team time spent on deployment and infrastructure) for each organization over the two-year period. Under what assumptions does Organization A’s investment pay off? Under what assumptions does it never pay off? What is the break-even point, and what variables does it depend on most sensitively?
1.7. [R] This chapter assumes Kubernetes as the substrate. Is the compilation model — high-level intent compiled into low-level infrastructure — specific to Kubernetes, or is it a general pattern? Consider applying it to a serverless platform (AWS Lambda + API Gateway + DynamoDB), a bare-metal fleet (no orchestrator), or a mainframe environment. What properties of Kubernetes make the compilation model natural, and which of those properties are actually necessary? Could you build a compilation-based platform without Kubernetes, and what would you lose?