Skip to content

Chapter 4: Designing Your API Surface

The CRD schema is the most consequential design decision your platform team will make. It is the interface between the platform and every developer who uses it. Every service deployed to the platform begins as a CRD spec written against this schema. Every error message the platform reports is in terms of this schema. Every evolution of the platform — new features, changed behavior, deprecated capabilities — is expressed through changes to this schema.

A good schema makes the common case trivial and the complex case possible. A bad schema makes everything equally difficult.

This chapter is about how to make good schemas. The principles apply regardless of implementation language or framework — whether you’re building in Rust with kube-rs, Go with controller-runtime, Python with kopf, or anything else. The hard part is not implementing a CRD. The hard part is designing one that you, your users, and your future self can live with for years.

A Custom Resource Definition is not a configuration file. It is a versioned, validated, documented API that your platform exposes to its users. The tooling makes it easy to treat CRDs as “just YAML” and lose sight of the API discipline they demand.

When you publish a REST API, you think carefully about the endpoint paths, the request and response schemas, the error codes, the versioning strategy, the backward compatibility guarantees. You write documentation. You review changes. You don’t rename a field without a migration plan.

CRDs deserve the same rigor. They are the public interface of your platform. They will be written by hundreds of developers, stored in git repositories, processed by CI pipelines, and depended upon by tooling. Treating them as casual configuration produces the same result as treating a REST API as casual configuration: breakage, confusion, and loss of trust.

What this means in practice:

The CRD’s OpenAPI schema is the first line of defense. Required fields, enums, minimum and maximum values, regex patterns, nullable constraints — everything that can be validated structurally, should be. An invalid spec should be rejected by the Kubernetes API server at submission time, before it ever reaches your controller.

This is analogous to a programming language’s type system. A strong type system catches errors at compile time. A strong CRD schema catches errors at admission time. The further left you push validation — from runtime to compilation, from compilation to admission, from admission to schema — the faster the developer gets feedback and the less the platform must handle invalid input.

Consider the replicas field. It should be an integer. It should have a minimum of 1 (or 0, if you support scaling to zero). It should have a maximum that reflects your cluster’s capacity or your quota system’s limits. All of this can be expressed in the schema:

replicas:
type: integer
minimum: 1
maximum: 100

A developer who submits replicas: -1 or replicas: "three" gets an immediate API server rejection with a clear error message. Your controller never sees the invalid spec. Your compilation pipeline never runs. The feedback loop is measured in milliseconds.

CRDs support multiple versions: v1alpha1, v1beta1, v1. These version labels carry meaning — they are promises to your users about the stability of the schema.

v1alpha1: No stability guarantees. The schema may change without notice. Fields may be added, removed, or renamed between releases. This is the version for early development, prototyping, and internal testing. Tell your users explicitly: if you depend on alpha, you accept breakage.

v1beta1: The schema is stable. Fields will not be removed without a deprecation cycle. New fields may be added (additive changes). The compiled behavior may be refined but should not change semantically in ways that break existing specs. This is the version for production early adopters — teams that want the latest features and accept the responsibility of being on the leading edge.

v1: Committed. You are promising to support this schema for years. Breaking changes require a new API version (v2), a migration path, and a generous transition period. This is the version for the broad user base — teams that update their specs infrequently and expect them to keep working.

The version you choose communicates your confidence level. Don’t promote to v1 prematurely — once you’ve made the commitment, you’re bound by it. And don’t stay on v1alpha1 forever — it signals to your users that the platform isn’t stable enough to trust.

The status subresource is how the platform communicates back to the developer. It is not an afterthought — it is the primary interface through which developers understand what the platform did with their spec.

A well-designed status answers three questions at a glance:

  1. What phase is the resource in? Pending, Compiling, Ready, Failed, Deleting. One word.
  2. What happened? Conditions — structured booleans with reasons and messages. SecretsResolved: True. PolicyAuthorized: False — AccessSecret denied for secret payments/api-key by policy production-secrets.
  3. Is the platform caught up? The observedGeneration field. If it matches metadata.generation, the platform has processed the latest spec. If it’s behind, the platform hasn’t caught up yet — the developer’s last change is queued but not compiled.

A developer should be able to understand the state of their service by reading the status alone. They should not need to kubectl describe the Deployment, the NetworkPolicy, the ExternalSecret, and the VMServiceScrape to figure out what went wrong. If they’re doing that, your status is too sparse.

Of all the design principles in this chapter, this is the most important and the most counterintuitive: expose the least amount of Kubernetes that you can.

Every field in your CRD schema is a commitment. You must validate it, compile it into correct output, test it, document it, and maintain it across schema versions. Every field is also a question the developer must answer — or at least understand well enough to skip. The more fields you expose, the more your CRD starts to resemble the Kubernetes Deployment spec it’s supposed to abstract over.

Walk through the reasoning with container configuration as an example.

A Kubernetes container spec has dozens of fields: name, image, command, args, workingDir, ports, env, envFrom, resources, volumeMounts, livenessProbe, readinessProbe, startupProbe, lifecycle, terminationMessagePath, terminationMessagePolicy, imagePullPolicy, securityContext, and more.

How many of these should appear in your platform’s CRD?

  • name and image: Yes. The developer must specify what container to run.
  • resources (CPU and memory): Yes. The platform needs resource requests for quota enforcement and scheduling.
  • env: Probably yes, but not as a raw list of Kubernetes EnvVar objects. Your CRD should support environment variables and secret references in a platform-specific format (${secret.name.key}) that the compiler resolves.
  • ports: Maybe. If the platform can infer ports from the container image or from conventions (port 8080 for HTTP), the developer shouldn’t have to specify them. If not, expose them.
  • command and args: Maybe. Most containers have a sensible default entrypoint. Exposing these adds flexibility but also adds a field that most developers leave blank.
  • livenessProbe, readinessProbe: Probably not as raw Kubernetes probe specs. The platform should have sensible defaults (HTTP GET on /healthz, TCP check on the primary port) and allow overrides only when needed.
  • securityContext: No. The platform owns the security context. A developer who needs to modify it (add a capability, run as root) must go through an authorization gate (Chapter 13), not a CRD field.
  • volumeMounts: No. Volumes are managed through the platform’s secret resolution (Chapter 9) and file mount mechanism. Raw volume mounts bypass the platform’s security model.

The result: your container spec might have 5 fields where Kubernetes has 20+. The other 15 are either inferred by the platform, managed by the compiler, or gated behind an explicit policy override. The developer’s cognitive load is proportional to the fields they must specify — not the fields Kubernetes supports.

The honest caveat: cognitive load is moved, not deleted. When a service fails because of a health check default the developer didn’t configure, or a security context they can’t see, they’re now debugging the platform’s behavior instead of their own Kubernetes manifests. With a raw Helm chart, the “magic” is visible in the templates. In this model, the derivation logic is compiled code. To a senior developer, the platform is a black box — and if the black box is wrong, they’re stuck.

This is real. The mitigation is transparency, not simplicity: the CRD status must report exactly what was derived (Section 4.8), a diagnostic CLI must show the full derived output (platform show compiled checkout), and error messages must reference the specific derived resource that’s causing the problem. The developer shouldn’t need to read the platform’s source code to debug a deployment. But they do need tools that make the platform’s decisions visible on demand. If the platform team becomes a bottleneck for every non-standard request, the abstraction has failed — not because hiding fields is wrong, but because the escape hatches (Chapter 11) and debugging tools are insufficient.

Practical heuristic: If fewer than 20% of your users would ever set a field, it probably doesn’t belong in the base spec. Put it behind an annotation, a policy-gated extension, or a separate CRD that signals “I know what I’m doing and I’ve been authorized to do it.”

The best validation is a schema that can’t express the invalid case. This principle, borrowed from type system design, saves you from writing validation logic for states that the schema prevents.

If a field has a fixed set of valid values, model it as an enum — not as a string with a regex.

# Bad: validated by webhook
protocol:
type: string
# Webhook must check: is this "HTTP", "GRPC", or "TCP"?
# Good: validated by schema
protocol:
type: string
enum: [HTTP, GRPC, TCP]

With an enum, the API server rejects protocol: websocket immediately. Without an enum, the invalid value reaches your webhook (or worse, your controller), and you must write, test, and maintain the validation logic yourself.

If field B is required when field A has a certain value, express this through structure rather than conditional validation.

# Bad: cross-field dependency validated by webhook
scaling:
enabled: true
minReplicas: 2
maxReplicas: 10
# Webhook must check: if enabled is true, minReplicas and maxReplicas are required
# Good: structure enforces the dependency
scaling: # optional block
minReplicas: 2 # required within scaling
maxReplicas: 10 # required within scaling
# If scaling is present, minReplicas and maxReplicas are required by schema
# If scaling is absent, no scaling is configured — no validation needed

In the structural version, there is no enabled flag. The presence or absence of the scaling block communicates intent. If the block is present, its required fields must be populated. The schema enforces this without a webhook.

This principle is controversial but important for platform CRDs specifically. If a field represents a decision the developer should make consciously, require it rather than defaulting it.

Consider replicas. A sensible default might be 1 — start with one replica unless specified otherwise. But a developer who didn’t set replicas and got 1 made a different decision than a developer who explicitly wrote replicas: 1. The first developer might have intended 3 replicas and forgot the field. The second developer made a deliberate choice.

For a platform — where the compiled output includes PodDisruptionBudgets, autoscaling resources, and quota accounting, all derived from the replica count — the distinction matters. A developer who deploys with replicas: 1 because they forgot to set it might be surprised when their service has no availability protection during node drains. Requiring the field forces a conscious choice.

The counterargument: required fields add boilerplate. Every developer must set replicas even when the answer is obvious. This is real friction — and friction kills adoption, especially for a new platform competing with an existing Helm workflow.

Timing matters. When your platform has zero users and you’re trying to win adoption, minimize required fields. Make the first spec as short as possible. Default generously. The goal is to get teams deploying through the platform, not to force conscious decisions on an audience that hasn’t committed to the platform yet.

As the platform matures and adoption stabilizes, tighten the requirements. Add required fields in new schema versions. Migrate existing specs. At this point, the platform has earned the right to demand explicit decisions — because teams trust it and understand why the decisions matter.

This isn’t a contradiction — it’s a lifecycle strategy. You start permissive and tighten over time. The key is planning for the tightening from the start: use schema versioning (v1alpha1 → v1beta1 → v1) so that new required fields arrive through version bumps, not surprise breaking changes. If you design the migration path early — defaulting a field today but tracking which specs rely on the default — the eventual transition to requiring it is a mechanical migration, not a painful one.

The test for whether to require or default a field is: would a reasonable developer ever choose a different value? If the answer is “almost never” (scrape interval, health check timeout, termination grace period), default it. If the answer is “frequently” (replicas, CPU, memory), require it — but consider deferring the requirement to a later schema version if adoption is the priority.

The CRD has two halves: the spec (what the developer wants) and the status (what the platform reports). Together, they form a feedback loop that replaces the wiki, the Slack channel, and the troubleshooting ticket.

The spec is the input to the compiler. It is written by the developer, read by the platform, and never modified by the platform. (Technically, the platform could mutate the spec through a mutating admission webhook, but this should be avoided — it’s confusing when the spec you submitted isn’t the spec that’s stored.)

The spec should contain exactly the information the developer needs to express their intent and nothing more. It should not contain computed values (those belong in the status), infrastructure details (those are the compiler’s concern), or operational state (that belongs in the status conditions).

A useful mental model: the spec is a function’s arguments. The developer provides the arguments. The function (the compiler) produces the result (the status and the compiled resources). If you find yourself tempted to put a field in the spec that the developer will never set — that’s a sign it belongs in the compiler’s logic, not the schema.

The status is the platform’s feedback to the developer. It is written by the platform, read by the developer, and should never be modified by the developer (Kubernetes enforces this through the status subresource).

Design the status to answer the questions developers actually ask:

“Did it work?” The phase field. Ready means the service is compiled and running. Failed means something went wrong. Compiling means the platform is working on it. This is the first field a developer checks.

“What went wrong?” The conditions array. Each condition is a structured boolean with a type, a status (True/False/Unknown), a reason (machine-readable), and a message (human-readable). The conditions are the primary debugging interface. A developer who sees PolicyAuthorized: False — AccessSecret denied for secret payments/api-key by policy production-secrets knows exactly what happened, which policy denied the access, and what to do about it.

“Is the platform caught up?” The observedGeneration field. If status.observedGeneration == metadata.generation, the platform has processed the latest spec. If it’s behind, the developer knows their most recent change hasn’t been compiled yet. This prevents a common confusion: “I changed my spec but nothing happened” — the answer is usually “the platform hasn’t reconciled yet.”

“What did the platform produce?” A summary of compiled resources — count, types, and optionally names. “8 resources compiled: Deployment (1), Service (1), CiliumNetworkPolicy (2), AuthorizationPolicy (2), ExternalSecret (1), VMServiceScrape (1).” This gives the developer visibility into the compilation output without requiring them to inspect each resource individually.

The design principle: a developer should be able to understand the complete state of their service from kubectl get and kubectl describe on the CRD alone. If they need to inspect the Deployment, the NetworkPolicy, and the ExternalSecret to understand what’s happening, your status is incomplete.

Chapter 2 drew an analogy between admission webhooks and a compiler’s parser: they reject invalid input before it reaches the compilation pipeline. This section makes that concrete.

CRD schema validation catches structural errors: wrong types, missing required fields, values outside enum ranges. This covers a large class of mistakes. But some validation requires logic the schema can’t express.

  • Image reference format. The schema can require image to be a non-empty string. It cannot require the string to be a valid container image reference with a digest. A webhook can parse the string and reject myapp:latest (tag) while accepting registry.example.com/myapp@sha256:abc123 (digest).
  • Secret reference format. If your secret reference syntax is ${secret.name.key}, the schema can require the field to be a string. It cannot validate that the string is a well-formed secret reference. A webhook can parse the references and reject malformed ones.
  • Cross-field dependencies. If the CRD has a scaling block and a replicas field, you need to validate that scaling.maxReplicas >= replicas. Modern Kubernetes (1.25+) supports x-kubernetes-validations with Common Expression Language (CEL) — you can write self.scaling.maxReplicas >= self.replicas directly in the CRD spec, evaluated by the API server without a webhook. CEL bridges the gap between structural schema validation and a full admission webhook: it handles cross-field rules without the operational overhead of running webhook pods. For constraints that CEL can’t express (semantic validation against external state), you still need a webhook.

Semantic validation depends on the state of the cluster, not just the contents of the spec.

  • Quota checks. “This service requests 8 CPU. The namespace quota is 16 CPU with 12 CPU in use. The request fits.” This requires reading the quota and current usage from the cluster.
  • Referential integrity. “This spec declares a dependency on service payments. Does a service named payments exist in this namespace?” This requires reading the service list.
  • Policy pre-checks. “This spec references a secret. Does a Cedar policy exist that would permit this access?” This requires evaluating the policy engine. (The full evaluation happens during compilation — the webhook can do a fast pre-check to give early feedback.)

Semantic validation is more complex than syntactic validation because it depends on external state, which can change between validation and compilation. A quota check that passes at admission time might fail at compilation time if another service was admitted between the two. This is acceptable — admission validation provides early feedback, not guarantees. The compiler does the authoritative check.

Admission webhooks should use failurePolicy: Fail. If the webhook is unreachable, the API server should reject the resource rather than admitting it unvalidated.

This is a strong stance with operational consequences: if the webhook pods are down, no CRDs can be created or updated. The platform team must treat the webhook as a high-availability component — multiple replicas, proper health checks, and monitoring. The alternative — failurePolicy: Ignore — means invalid specs are admitted when the webhook is down, which undermines the entire validation model.

Designing a good schema is necessary but not sufficient. Developers must be able to find the schema, understand it, and write valid specs efficiently. A schema that exists only as an OpenAPI definition in the CRD YAML is technically complete and practically useless.

kubectl explain. The built-in tool. kubectl explain service.spec.containers shows the field type, description, and whether it’s required. This works out of the box if your CRD includes description annotations on every field — which it should. Invest in writing clear, concise descriptions for every field. This is your baseline documentation; many developers will never read anything else.

Generated reference documentation. Tooling exists to generate HTML or Markdown documentation from CRD OpenAPI schemas. Publish this alongside your platform docs. Include examples for each field, not just type definitions. A developer reading the docs for the resources block should see a complete example of a service dependency declaration and a secret dependency declaration, not just “type: object.”

IDE autocomplete and validation. If developers write CRD specs in an IDE (VS Code, JetBrains), they should get autocomplete and inline validation. This requires publishing the CRD schema where IDE plugins can find it — typically through a Kubernetes connection or a schema file. The investment is small; the productivity impact is significant. A developer who gets autocomplete on spec.workload.resources.<name>.<tab> and sees type, direction, params makes fewer mistakes and asks fewer questions.

Scaffold tooling. A CLI command (platform init checkout) that generates a minimal valid spec with required fields populated and comments explaining each section. This gets developers from zero to a working spec in seconds. The scaffold should produce a spec that compiles and deploys — not a template with placeholder values that fails validation.

Error messages as documentation. Covered in detail in Chapter 23, but worth noting here: when a developer submits an invalid spec, the admission webhook’s error message is documentation. “Field resources.cpu is required” teaches the developer that CPU is required. “Image must use a digest reference (@sha256:…), not a tag” teaches the developer about digest enforcement. Good error messages eliminate the need for many documentation pages.

None of these are optional. A schema without discovery tooling is a schema that developers learn through trial, error, and Slack messages. The investment in kubectl explain descriptions, generated docs, IDE support, and scaffold tooling pays for itself in reduced support load.

Your CRD will change. A new compliance requirement lands. A team discovers the schema can’t express what they need. Someone realizes a field name was wrong from the start. How you handle evolution determines whether your platform improves gracefully or breaks with every release.

Adding a new optional field to the spec does not break existing specs. A spec that doesn’t include the new field continues to work exactly as before. This is your primary evolution mechanism — add new optional fields, compile them when present, ignore them when absent.

Even if you believe nobody uses a field, someone does. In a fleet of 200 services, there is always one spec that relies on the behavior you’re about to remove. Removing a field requires:

  1. Deprecate the field. Add a webhook warning (Kubernetes supports admission response .warnings) that tells developers the field is deprecated and will be removed in the next version.
  2. Release a new schema version. If you’re removing a field from v1beta1, release v1beta2 (or v1) without the field. Keep v1beta1 available with conversion webhooks.
  3. Migrate users. Give them a timeline and a migration guide.
  4. Remove the old version. After the migration period, remove v1beta1 from the CRD’s served versions.

This is slow. It is supposed to be slow. The cost of breakage — angry users, broken deployments, lost trust — far exceeds the cost of a careful deprecation cycle.

A semantic change alters the behavior of the compiler without changing the schema. If replicas: 1 used to mean “one replica, no PDB” and now means “one replica, default PDB,” you’ve changed what the same spec produces. The schema is identical. The output is different. No tooling catches this because no field was added, removed, or modified.

Semantic changes are worse than schema changes because they’re invisible. A developer who reads the changelog and sees “no schema changes” reasonably concludes that their spec will produce the same output. If it doesn’t, they discover the change through a production incident, not a compilation error.

How to handle semantic changes:

  • Feature flag the new behavior. Add an optional field or annotation that opts into the new semantics. Existing specs get the old behavior. New specs can opt into the new behavior. After a transition period, make the new behavior the default and deprecate the old.
  • Epoch-based derivation. The controller reads a platform.lattice.dev/epoch annotation on the CRD. If absent or set to 2025, use the old derivation logic. If set to 2026, use the new logic. This lets a single controller handle two eras of derivation simultaneously — the team migrates service-by-service by bumping the epoch, not by coordinating a fleet-wide cutover.
  • Version bump. If the semantic change is significant, release it as a new schema version. Even if the schema is structurally identical, the version signals to users that behavior has changed.
  • Document relentlessly. Changelog entries, migration guides, webhook warnings. Semantic changes require more communication, not less, because they’re harder to detect.

The mechanics of versioning are necessary but insufficient. A deprecation cycle that exists only in the CRD schema is invisible to most developers. The organizational process matters as much as the technical process:

Communication. When a new field or behavior is added, announce it through the channels your developers actually read — not just a changelog file in a git repo. If your developers use Slack, post there. If they read a weekly engineering newsletter, put it there. Match the importance of the change to the urgency of the communication channel.

Migration tooling. When a version bump requires spec changes, provide a migration tool that transforms v1alpha1 specs into v1beta1 specs automatically. A developer who must manually rewrite 15 lines of YAML will delay the migration. A developer who runs platform migrate checkout --to v1beta1 and reviews the diff will do it in five minutes.

Deprecation dashboards. Track which CRD versions are in use across the fleet. When you deprecate v1alpha1, you should know that 180 of 200 services have migrated and the remaining 20 belong to three teams. Talk to those teams directly. A deprecation timeline without visibility into adoption is a timeline that will slip.

Grace periods. Plan for migration to take longer than you expect. If you announce a 30-day deprecation window, some teams will start on day 29. If you announce a 90-day window, some teams will start on day 89. Build the timeline around your slowest adopters, not your fastest.

The catastrophic case: a conversion webhook bug. A bug in the v1alpha1-to-v1beta1 conversion webhook corrupts specs during migration. The corrupted specs are now in etcd. The derivation pipeline re-derives from corrupted data, producing broken infrastructure. Recovery requires an etcd restore from backup. This is the CRD equivalent of a database migration failure, and it’s why conversion webhooks need the same testing discipline as the derivation pipeline itself — snapshot tests against every existing spec in the fleet, run in CI before the webhook ships. Never deploy a conversion webhook that hasn’t been tested against production data.

4.8 Worked Example: Designing a Service CRD

Section titled “4.8 Worked Example: Designing a Service CRD”

Walk through the design of a service CRD from scratch. This is not the final, perfect CRD — it’s a design exercise that applies the principles from Sections 4.1 through 4.6. Different organizations will make different choices at each step.

Starting point: the developer’s mental model

Section titled “Starting point: the developer’s mental model”

A developer thinks: “I have a service. It runs in containers. It needs CPU and memory. It talks to other services. It reads secrets. I want some number of replicas.”

Translate this directly into a naive first draft:

apiVersion: lattice.dev/v1alpha1
kind: LatticeService
metadata:
name: checkout
namespace: commerce
spec:
replicas: 3
workload:
containers:
api:
image: registry.example.com/checkout@sha256:abc123
command: ["/app/server"]
args: ["--port=8080"]
variables:
LOG_LEVEL: info
DATABASE_URL: "postgres://app:${resources.orders-db.password}@db:5432/orders"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
readinessProbe:
httpGet:
path: /readyz
port: 8080
service:
ports:
http:
port: 8080
protocol: TCP
resources:
payments:
type: service
direction: outbound
orders-db:
type: secret
params:
keys: [password]

The first draft uses every available field. But in practice, most of those fields are optional — the schema supports them, but a typical spec omits them. The minimal surface principle says: show the developer only what they need to set. Everything else stays in the schema as optional fields for the cases that need them.

Here’s the same service with only the fields a typical developer would write:

apiVersion: lattice.dev/v1alpha1
kind: LatticeService
metadata:
name: checkout
namespace: commerce
spec:
replicas: 3
workload:
containers:
api:
image: registry.example.com/checkout@sha256:abc123
variables:
LOG_LEVEL: info
DATABASE_URL: "postgres://app:${resources.orders-db.password}@db:5432/orders"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 500m
memory: 512Mi
resources:
payments:
type: service
direction: outbound
orders-db:
type: secret
params:
keys: [password]

Fewer fields. Same intent. The developer specifies what matters to them (image, variables, resource requests, dependencies). The platform handles the rest.

Examine what was removed and why:

  • command and args removed. Fewer than 5% of services override the image’s default entrypoint. The fields still exist in the schema (they’re optional), but the worked example omits them to show the minimal case. Most developers never set them.
  • Probes removed. The platform defaults to HTTP GET on /healthz (liveness) and /readyz (readiness). This convention is documented once. Developers who follow it — which is most of them — never think about health checks. The 10% who need a custom probe specify it in their spec. The platform still generates probes — the developer just doesn’t configure them in the common case.
  • limits set equal to requests. The reference implementation requires both, but the convention is to set them equal (guaranteed QoS class). Some teams will disagree and want burstable scheduling. That’s a legitimate need — the schema supports it. The worked example shows the common case.
  • service.ports removed. In the first draft, the port was explicitly declared. In the reduced version, we omit it — the platform can infer it from the container’s exposed ports or default to 8080. In practice, the reference implementation keeps service.ports as an optional block because 20-30% of services use non-standard ports. This is a case where minimal surface and minimal friction conflict. The reference implementation chose to keep the field optional with an explicit port name (e.g., http, grpc) because the name carries semantic meaning for mesh routing.

Notice the variables block: a flat map of name to value, where values support ${resources.name.key} interpolation for secret references. This is simpler than Kubernetes’s name/value pair syntax and removes the need for developers to understand secretKeyRef, configMapKeyRef, or fieldRef. The compiler parses the interpolation syntax and routes through the correct secret path.

Apply the “make invalid states unrepresentable” principle:

  • replicas: Required integer, minimum 1.
  • resources.cpu and resources.memory: Required strings matching Kubernetes resource quantity format.
  • image: Required string. The webhook validates it’s a valid image reference with a digest (not a tag).
  • workload.resources: A map where each key is a resource name. Values have a type (enum: service, secret, volume, external-service, gpu), and service resources require a direction (enum: inbound, outbound, both). The type field determines which params are valid — secret resources have keys, volume resources have size and accessMode, external services have endpoints. This uses the same field name as container resource requests (containers.<name>.resources), but they’re at different nesting levels — workload.resources for dependencies, workload.containers.<name>.resources for CPU/memory. The reference implementation follows the Score specification (a CNCF project defining a portable workload format) for workload portability, which uses resources for external dependencies.
status:
phase: Ready
observedGeneration: 3
conditions:
- type: ImageVerified
status: "True"
reason: CosignVerified
message: "Image signed by key production-signing-key"
- type: PolicyAuthorized
status: "True"
- type: SecretsResolved
status: "True"
- type: Compiled
status: "True"
message: "13 resources compiled"
compiledResources:
count: 13
types:
Deployment: 1
Service: 1
ServiceAccount: 1
ConfigMap: 1
CiliumNetworkPolicy: 2
AuthorizationPolicy: 2
ExternalSecret: 2
VMServiceScrape: 1
PodDisruptionBudget: 1
TracingPolicyNamespaced: 1

A developer reading this status knows: the image is verified, all policies passed, secrets are resolved, compilation succeeded, and 13 resources were produced. If any condition is False, the message tells them why. (This example uses raw Kubernetes types like CiliumNetworkPolicy in the status. Exercise 4.6 challenges this — raw types couple the status to the implementation. Abstract categories like networkPolicy: 2 would survive a migration from Cilium to another CNI.)

Now consider what a failed compilation looks like. The developer submits a spec that references a secret they’re not authorized to access:

status:
phase: Failed
observedGeneration: 4
conditions:
- type: ImageVerified
status: "True"
reason: CosignVerified
- type: PolicyAuthorized
status: "False"
reason: AccessSecretDenied
message: >
Service commerce/checkout is not permitted to access
secret payments/stripe-webhook-secret. No Cedar policy
grants AccessSecret for this principal/resource pair.
Contact the payments team to request access, or see
docs.example.com/secrets/authorization for policy examples.
- type: SecretsResolved
status: "Unknown"
reason: BlockedByAuthorization
- type: Compiled
status: "False"
reason: AuthorizationFailed
compiledResources:
count: 0

The developer sees: image verification passed, but policy authorization failed. The message tells them which secret, which policy type, and what to do about it. Secrets are Unknown (not yet attempted, because authorization failed first). Compilation produced zero resources — the all-or-nothing rule from Chapter 8.

This is the status working as documentation. The developer doesn’t need to search Slack, read a wiki, or file a ticket. The status tells them what happened, why, and how to fix it. If your platform’s status can do this consistently, you’ve eliminated an entire class of support interactions.

The schema supports more than the minimal spec shows. A developer who needs a custom health check sets the optional livenessProbe field. A developer who needs a non-standard port adds service.ports. A developer who needs a volume declares a type: volume resource. These are all in the schema — they’re just not in the typical spec.

Where does the CRD actually fall short?

  • A team needs hostNetwork: true for a network monitoring agent. This requires the runtime extension block and is gated by a Cedar policy (Chapter 13) because it has security implications.
  • A team needs to run third-party software (Redis, Prometheus) that doesn’t fit the LatticeService model. They use the package escape hatch — a LatticePackage CRD that orchestrates Helm installation with mesh integration via LatticeMeshMember (Chapter 11).
  • A team needs a workload pattern the service CRD doesn’t model — distributed training with gang scheduling. They use LatticeJob (Chapter 16), which shares the WorkloadSpec for containers and resources but has different top-level fields for batch semantics.

These boundaries are deliberate. The service CRD handles the common case for long-running services. Other CRDs handle other workload types. The package escape hatch handles third-party software. Chapter 11 develops these patterns in detail.

The point of this exercise is not the final schema. It is the process: start with the developer’s mental model, reduce surface, add constraints, design the feedback loop, and identify boundaries. Your organization’s CRD will differ — different defaults, different required fields, different extension mechanisms. The principles are the same.

This chapter has focused on a single CRD — the service. But a real platform has multiple CRDs: services, jobs, models, mesh members, packages, compliance profiles, and more. How do they relate?

Shared workload spec. In the reference implementation, LatticeService, LatticeJob, and LatticeModel all share a common WorkloadSpec type for container definitions and resource declarations. This means the container format (image, variables, resources, files), the resource declaration format (type, direction, params), and the secret reference syntax (${resources.name.key}) are identical across all workload CRDs. A developer who learns to write a service spec already knows how to write the container and dependency sections of a job spec.

Different lifecycle, different top-level fields. The CRDs diverge where the workload types diverge. LatticeService has replicas, autoscaling, ingress, and deploy (deployment strategy). LatticeJob has tasks (multiple pod groups with separate specs), queue, schedule (for cron jobs), and minAvailable (for gang scheduling). LatticeModel has roles (predictor/decoder), routing (inference engine configuration), and modelSource (model artifact location). The shared workload spec covers what’s common; the CRD-specific fields cover what’s different.

Peripheral CRDs. Not every CRD is a workload. LatticeMeshMember brings non-compiled workloads (Helm-installed software) into the bilateral agreement model. CedarPolicy defines authorization rules. SecretProvider configures secret backends. TrustPolicy configures image signing keys. These CRDs are operated by the platform team or by team leads, not by every developer. They have different audiences, different change frequencies, and different versioning requirements.

Discovery across CRDs. A developer deploying a new service shouldn’t need to know about CedarPolicy or SecretProvider — those are platform configuration. They should know about LatticeService (for their service) and possibly LatticeMeshMember (if they need to integrate with a non-platform workload). The platform’s documentation should clearly separate “developer CRDs” (what you write to deploy) from “platform CRDs” (what the platform team configures). Chapter 23 returns to this in the context of the platform as a product.

Part I is complete. The reader now has the conceptual foundation for everything that follows:

  • Chapter 1: The platform problem. Kubernetes is not a platform. A platform delivers features — security, observability, governance — as automatic properties of deployment.
  • Chapter 2: Intent over infrastructure. The developer declares what they need; the platform derives the infrastructure. The declaration survives changes to compliance, tooling, and policy.
  • Chapter 3: Code over configuration. The derivation logic should be a real program — with types, tests, and abstractions — not YAML templates or configuration languages.
  • Chapter 4: The input language. The CRD schema that developers write against, designed with minimal surface, structural constraints, and a feedback loop through the status subresource.

Part II shifts from theory to infrastructure: how does the platform manage the clusters it compiles to?

4.1. [M10] A platform team designs a CRD with 45 fields in the spec. When they survey their users, they find that the median service uses 8 of them. The other 37 fields are used by fewer than 10% of services. What is the cost of maintaining these 37 fields? What would you recommend the team do, and what resistance would you expect from the users who rely on those fields?

4.2. [H30] Design a resources (external dependencies) schema for a service CRD that supports the following use cases: (a) outbound dependency on another service, (b) inbound acceptance from another service, (c) secret dependency with specific keys, (d) secret dependency where all keys are needed, (e) outbound dependency on an external host (not a platform service). Your schema must make it impossible to express a dependency that is both a service and a secret, and must make direction required for service dependencies but meaningless for secret dependencies. Show at least two alternative schema designs and argue for your preferred one.

4.3. [R] Section 4.3 argues for required fields over defaults on the grounds that defaults hide decisions. Take the opposing position: argue that a CRD with many required fields is a worse developer experience than one with sensible defaults, and that the “hidden decision” problem is solvable through other means (documentation, CLI tooling, warning messages). Identify the conditions under which each approach is superior. Is there a hybrid strategy that captures the benefits of both?

4.4. [H30] A platform team’s CRD is at v1beta1 with 200 services deployed. They need to make a semantic change: services with replicas: 1 will now get a default PodDisruptionBudget (previously they got none). This improves resilience but changes the behavior of existing services during node drains — a single-replica service with a PDB will block node drains until the pod is rescheduled, which may cause drain timeouts. Design a migration plan that gets the fleet to the new behavior without breaking any existing service. Consider: feature flags, version bumps, opt-in vs. opt-out, timeline, communication.

4.5. [R] The worked example in Section 4.8 reduces the container spec from ~20 Kubernetes fields to ~5 CRD fields. For each field that was removed, identify a scenario where a developer legitimately needs it. Now categorize these scenarios: which ones should be handled by extending the CRD, which by annotation-based overrides, which by a separate CRD, and which by the package escape hatch (Chapter 11)? What principles guide your categorization?

4.6. [H30] The status design in Section 4.8 includes a compiledResources field that lists the types and counts of compiled resources. A security-conscious reader objects: “This leaks implementation details. Now every developer can see that we use CiliumNetworkPolicy instead of native NetworkPolicy, and that we use ExternalSecret instead of native Secrets. When we change implementations, the status changes, and tools that parse the status will break.” Evaluate this objection. Is implementation detail in the status a liability? How much visibility should the status provide? Construct a status design that is transparent enough for debugging but abstract enough to survive implementation changes.

4.7. [R] CRD schemas are limited by the OpenAPI v3 subset that Kubernetes supports. They cannot express: union types (field is either a string or an object), dependent required fields (field B required if field A is present), or recursive structures. How much does this matter in practice? For each limitation, either (a) show that it forces a worse CRD design than would otherwise be possible, with a concrete example, or (b) argue that the limitation is irrelevant because the constraint can be handled by a webhook without meaningful cost.