Introduction

Every platform team I’ve worked with has rediscovered the same lesson: the hard part is not installing the tools. The hard part is deciding what to expose, what to hide, and what to enforce. Kubernetes gives you an extraordinary set of primitives. The question is what you build on top of them — and that question is a design question, not a tooling question.

The model: declare intent, derive infrastructure. The 23 chapters that follow work through what that means in practice — for cluster provisioning, workload management, security, networking, advanced scheduling, operations, and the platform as a product. Every chapter is a set of decisions that platform builders face. The decisions are real. The trade-offs have teeth.

Who This Book Is For

Platform engineers, SREs, and infrastructure engineers building or operating internal platforms on Kubernetes. You’re the person your organization expects to answer the question “how should teams deploy?” — and you’ve discovered that the answer is harder than it looks.

It’s also for senior developers who want to understand what a platform provides and why it works the way it does. If you’ve ever stared at a platform team’s CRDs and wondered why they made the choices they made, this book explains the reasoning.

I assume you know what a Deployment is, that you’ve used kubectl, and that you have a general sense of how Kubernetes reconciliation works. I do not assume you’ve built an operator, designed a CRD, or operated a multi-cluster fleet. If you have, you’ll move faster through Parts I and II. If you haven’t, the book builds up from first principles.

I do not assume you know Rust. The reference implementation is written in Rust, and code appears throughout, but the book explains what the code does in prose. You don’t need to be able to write Rust to follow the arguments. You need to be able to read a struct definition and a match statement, which I’ll walk you through the first time they appear.

What This Book Is Not

Not a Kubernetes tutorial — I assume you know what a Deployment is. Not a tool guide — tools appear throughout, but the question is always “what problem does this solve and why did we choose it?” not “how do I install it.” Not a vendor pitch — every technology choice could be made differently, and I say so when it matters.

The Reference Implementation

The book uses Lattice as its reference implementation throughout. Lattice is a Kubernetes operator written in Rust. It accepts a developer’s intent as a custom resource, compiles that intent into the full set of Kubernetes resources required to run the workload correctly, and reconciles those resources onto clusters.

Lattice is real software. The code examples in this book are drawn from it. The design decisions are ones I actually made, and the trade-offs are ones I actually lived with. When I say “this approach has a problem,” I mean I hit that problem.

But Lattice is an example, not the answer. The architectural model — intent declaration, derivation, reconciliation — is universal. The specific decisions about language, framework, policy engine, and mesh are one valid set of choices among many. If you’re building your platform in Go with kubebuilder, or in Java with the Fabric8 client, or in Python with kopf, you face the same design decisions this book addresses. The chapters on API design, security architecture, authorization policy, and the derivation pipeline apply regardless of your implementation language. Lattice is evidence that the model works. Your platform will make different choices and still benefit from the same principles.

The Thesis

The thesis of this book fits in one sentence:

Declare intent, derive infrastructure, and ensure that intent survives change.

A developer declares what they need: “run this container with these dependencies, these resource requirements, and access to these secrets.” The platform derives everything else: the Deployment, the NetworkPolicy, the VMServiceScrape, the PodDisruptionBudget, the secret synchronization, the authorization policy. When something changes — a new compliance requirement, a security policy update, a cluster migration — the derivation logic changes. The developer’s intent does not.

This is the thread that runs through every chapter. Cluster provisioning (Part II) is intent-driven. Workload management (Part III) is derivation. Security (Part IV) is enforcement derived from policy. Networking (Part V) is identity derived from intent. The model repeats because the model works.

What’s Inside

Part I: What Is a Platform? (Chapters 1-4) — The thesis: declare intent, derive infrastructure. CRD design as a product interface.
Part II: Cluster Lifecycle (Chapters 5-7) — Provisioning, self-management via pivot, parent-child communication.
Part III: Workload Management (Chapters 8-11) — The derivation pipeline, secret resolution, autoscaling, escape hatches.
Part IV: Security (Chapters 12-14) — Default-deny, four enforcement layers, bilateral agreements, supply chain verification.
Part V: Networking (Chapter 15) — Mesh identity, multi-cluster connectivity, the edge.
Part VI: Advanced Workloads (Chapters 16-18) — Batch scheduling, GPU management, model serving.
Part VII: Operations (Chapters 19-21) — Observability, disaster recovery, testing the platform.
Part VIII: Principles (Chapters 22-23) — Cryptographic discipline, platform as product.
Plus: Glossary, CRD Reference, Capstone Exercises.

How to Read This Book

There are 23 chapters organized into eight parts. You don’t have to read them in order — but you can, and the argument builds if you do.

Cover to cover. The book is structured as a graduate-level study of platform engineering. Part I establishes the thesis. Parts II through VII apply it to every major domain. Part VIII steps back to principles. If you’re building a platform from scratch or rethinking one from the ground up, this is the path I’d recommend. The chapters reference each other, and the later chapters assume you’ve absorbed the mental model from the earlier ones.

The thesis track: Parts I + IV + VIII. If you want the core argument without the implementation depth, read Part I (What Is a Platform?, Chapters 1-4), Part IV (Security, Chapters 12-14), and Part VIII (Principles, Chapters 22-23). Part I establishes the model. Part IV is where the model proves its value most dramatically — security is where “probably, if the process was followed” becomes unacceptable. Part VIII connects the technical decisions back to product thinking. This path gives you the intellectual framework in roughly a third of the pages.

Jump to the chapter you need. Each chapter opens with a summary of the decisions it addresses. If you’re struggling with a specific problem — how to design your CRD schema, how to handle secrets across clusters, how to authorize workload-to-workload communication, how to serve GPU workloads — find that chapter and start there. Cross-references will point you to prerequisite material if you need it. I’ve tried to make each chapter self-contained enough to be useful on its own, while honest enough to tell you when context from an earlier chapter matters.

Exercises

Every chapter ends with exercises. They are rated by difficulty and expected time:

[M10] — Moderate difficulty, approximately 10 minutes. These test comprehension and ask you to apply the chapter’s ideas to a slightly different scenario. If you’ve read the chapter carefully, you can answer them.
[H30] — Hard, approximately 30 minutes. These require deeper analysis, involve trade-offs with no single correct answer, or ask you to design something. Some of them are genuinely hard. A few don’t have clean answers — the point is the reasoning, not the conclusion.
[R] — Research or open-ended. These point you toward topics the chapter raises but doesn’t fully resolve. They’re suitable for team discussions, blog posts, or further study. There is no expected time because the rabbit holes vary.

Answer keys are provided for every chapter. For the moderate exercises, the answer key gives a definitive answer. For the hard exercises, it gives one defensible answer and explains the trade-offs. For the research exercises, it gives pointers and framing rather than conclusions.

The exercises are not filler. If you skip them, you’ll still learn the concepts. If you do them, you’ll internalize the decision-making process — which is the thing that actually matters when you’re standing in front of a whiteboard deciding how your platform should handle secret rotation.

What This Book Does Not Cover

No book covers everything, and this one makes deliberate omissions.

Multi-tenancy. Hard multi-tenancy — running untrusted workloads from different organizations on shared infrastructure — is a distinct problem with distinct solutions (virtual clusters, kernel-level isolation, node-level separation). This book assumes soft multi-tenancy: multiple trusted teams within one organization sharing a platform. Hard multi-tenancy may appear in a future edition.

Cloud provider deep-dives. The book is cloud-aware but not cloud-specific. It discusses managed Kubernetes, cloud networking, and provider-specific features where they affect design decisions, but it does not provide AWS/GCP/Azure implementation guides. Your cloud provider’s documentation is better at that than I am.

Application development patterns. This book is about the platform, not the applications running on it. It does not cover microservice design, API versioning strategies, or database selection. The platform provides the foundation; what you build on it is your business.

Each of these topics deserves its own book. Cramming them in here would dilute the argument.

A Note on Conviction

Platform engineering requires making decisions and living with them. This book has opinions. It argues that derivation is better than templating, that policy should be expressed in a real language, that security enforcement belongs in the platform rather than in documentation, and that your CRDs are a product interface that deserves product-level care.

You may disagree with some of these opinions. Good. The chapters present the reasoning and the evidence. If you reach a different conclusion with the same evidence, you’ve engaged with the material exactly as intended. The worst outcome is not disagreement — it’s building a platform without having thought through the decisions at all.

Let’s begin.