Chapter 5: Provisioning Clusters Declaratively

Part I established what a platform does for services: the developer declares intent, the platform derives infrastructure. Part II applies the same principle one level down — to the compute pools the services run on.

Part I’s service CRD assumes a cluster exists to deploy to. This chapter addresses what comes before that: where does the cluster come from? The platform must produce a running compute pool from a declarative spec, the same way it produces a running service from a CRD.

A definition before we go further. This book uses “cluster” to mean a pool of compute with a control plane — machines (physical or virtual) managed as a unit, with an API for scheduling workloads onto them. Kubernetes is the most common implementation and the one this book targets. The principles in this chapter — declarative lifecycle, provider abstraction, state machines, bootstrapping — are not Kubernetes-specific, but the examples are. If you’re running Nomad, ECS, or bare metal, the ideas transfer; the tooling does not.

Consider the difference. Without declarative cluster provisioning, standing up a new environment looks like this: an infrastructure engineer runs Terraform to provision VMs and networking, then runs an Ansible playbook (or cloud-init scripts) to bootstrap kubeadm, then runs helm install for Cilium, Istio, cert-manager, the secret backend, and monitoring — each with its own values file — then deploys the platform operator, verifies everything is healthy, and updates a wiki with the new cluster’s endpoint. Even with good automation for each step, the full sequence takes hours and the steps are maintained independently. If the cluster needs to be recreated — after a disaster, for a new region, for a test environment — the engineer runs the same sequence and hopes nothing has drifted between the Terraform state, the Ansible playbook, and the Helm values.

With declarative provisioning, the same engineer applies a LatticeCluster CRD. The platform provisions the infrastructure, bootstraps every component in dependency order, and reports Ready in the status. Under 15 minutes on the Docker provider used for development and testing — cloud providers add variable time (see Section 5.4). If the cluster needs to be recreated, they apply the same CRD. The result is identical.

If your platform can’t provision its own clusters, then somewhere upstream there’s a human performing those manual steps. That human is a dependency your platform can’t reconcile away.

5.1 Clusters as Cattle

The “pets vs. cattle” metaphor is usually applied to pods. It applies to clusters too, and most organizations get it wrong at the cluster level.

A pet cluster is named, hand-configured, and irreplaceable. It has a runbook for patching. The team has an on-call rotation specifically for its control plane. When it has a problem, people gather in a war room. When it needs to be upgraded, there’s a change advisory board meeting. Nobody would casually destroy it and create a new one, because recreating it would take days of manual work.

A cattle cluster is declared in a spec, provisioned by automation, and reproducible. If you need to recover from a disaster, you apply the spec and get the same cluster back. If a cluster is misbehaving and debugging isn’t converging, you can drain its workloads and replace it. Upgrades are rolling replacements, not in-place mutations that require change advisory board approval.

This is true even if you only have one cluster. Cattle is not about how many you have — it’s about whether the cluster can be reproduced from its declaration. A single cluster that’s declared in a spec and provisioned by a controller is cattle. A single cluster that was hand-built over six months is a pet. The question is not “how many do I have?” but “if this one disappeared, how long until I have it back?”

The prerequisites:

Provisioning must be fast. Minutes, not hours. The reference implementation provisions a full cluster — control plane, workers, CNI, mesh, platform operator — in under 15 minutes on the Docker provider (CAPD); cloud providers add variable overhead depending on instance launch times and API throttling. If creating a cluster takes a day of manual steps, you won’t create a replacement under pressure.
Provisioning must be reproducible. The same spec must produce the same cluster every time. No manual configuration steps that vary by cluster, no SSH sessions, no one-off patches.
Workloads must be movable. To replace a cluster, you need to move workloads off it first. For stateless services, this is cordon and drain — standard Kubernetes. For stateful workloads, it’s genuinely hard: PersistentVolumeClaims may be bound to specific availability zones and can’t follow a pod to a new node, StatefulSets have identity requirements, and databases need coordinated failover. A self-managing cluster that decides to replace a node hosting a 10TB database and gets the storage reattachment wrong causes a multi-hour outage. The “cattle” model works cleanly for stateless services. For stateful workloads, the platform should handle node-level operations carefully (drain before replace, wait for PV detach, verify reattachment) and the organization should consider dedicated clusters with longer lifecycle expectations for their most critical stateful workloads. The book is honest that its strongest examples are stateless — stateful lifecycle management at this level of automation is a harder problem that deserves its own treatment. Chapter 20 (Backup and Disaster Recovery) addresses the data protection side.

Part I’s thesis applies here directly, one level down. In Part I, the developer declares intent for a service and the platform derives infrastructure — and the intent survives changes to compliance, tooling, and policy. In Part II, the operator declares intent for a cluster and the platform derives infrastructure — and the intent survives changes to the provider, the machine types, and the bootstrapping process. The cluster spec says “3 control plane nodes, 10 workers, Kubernetes 1.32.” Whether that’s provisioned on Proxmox, AWS, or Docker doesn’t change the spec. The platform derives the correct provider-specific infrastructure from the same intent.

5.2 The Cluster as a CRD

In the reference implementation, a cluster is declared with a LatticeCluster CRD:

apiVersion: lattice.dev/v1alpha1
kind: LatticeCluster
metadata:
  name: backend
  labels:
    lattice.dev/environment: homelab
    lattice.dev/role: backend
spec:
  latticeImage: "ghcr.io/evan-hines-js/lattice:latest"
  providerRef: proxmox
  provider:
    kubernetes:
      version: "1.32.0"
      bootstrap: kubeadm
    config:
      proxmox:
        sourceNode: "poweredge-lg"
        templateId: 9000
        controlPlaneEndpoint: "10.0.1.110"
        bridge: "vmbr1"
        ipv4Pool:
          range: "10.0.1.111-120/24"
          gateway: "10.0.1.1"
  nodes:
    controlPlane:
      replicas: 1
      instanceType:
        cores: 4
        memoryGib: 16
        diskGib: 50
    workerPools:
      default:
        replicas: 3
        instanceType:
          cores: 12
          memoryGib: 64
          diskGib: 250
  services: true
  monitoring:
    enabled: false
  backups:
    enabled: false

This is a real spec from the reference implementation’s homelab examples — a backend workload cluster running on Proxmox bare metal.

Notice the structure. providerRef points to an InfraProvider CRD that holds credentials. provider.kubernetes specifies the version and bootstrap method. provider.config.proxmox holds provider-specific infrastructure details. nodes declares the control plane and named worker pools with instance sizing. services, monitoring, and backups are platform feature flags — what infrastructure the cluster should include.

Notice that the cluster CRD is a hybrid of intent and infrastructure — unlike the service CRD from Part I, which is pure intent. The nodes block and feature flags (services, monitoring) are intent: what kind of cluster, how big, what capabilities. The provider.config.proxmox block is infrastructure: bridge names, IP ranges, template IDs. This is unavoidable. Cluster provisioning requires infrastructure-level decisions (which network, which IP range, which VM template) that service deployment does not. A service doesn’t care what subnet it runs on. A cluster’s control plane needs a specific endpoint IP.

Yes, the cluster CRD has enabled: false flags — the same pattern Chapter 2 argues against for service specs. The distinction: service features (network policies, observability, disruption budgets) should be automatic because every service needs them. Cluster features (monitoring stack, backup scheduling) are genuinely optional depending on the cluster’s role — a development cluster doesn’t need the same backup discipline as production. The enabled flag is appropriate when the feature is a real choice, not when it’s a security property that should never be off.

The intent/infrastructure boundary in the cluster CRD is at the provider.config block. Everything outside it is portable — if this cluster moved to AWS, the nodes, services, monitoring, and backups fields would stay the same. The provider.config block would change entirely. The platform team manages the provider-specific configuration; the operator who creates the cluster primarily cares about the intent fields.

The platform derives the CAPI resources from this spec — Cluster, ProxmoxCluster, KubeadmControlPlane, ProxmoxMachineTemplate, MachineDeployment — using template rendering that combines the cluster spec with the provider configuration. The operator doesn’t write CAPI resources. They declare what they need, and the platform handles the provisioning layer.

For clusters that accept child connections (management clusters, edge clusters), the spec includes a parentConfig block:

  parentConfig:
    grpcPort: 50051
    bootstrapPort: 8443
    service:
      type: LoadBalancer

This tells the platform to expose the gRPC cell endpoint and bootstrap webhook that child clusters connect to. The key architectural decision: the workload cluster initiates the connection outbound to the parent. The parent never dials into the workload cluster. This means workload clusters need zero inbound ports open — they work behind restrictive firewalls, NAT, and private networks without any network configuration on the workload side. For edge deployments and private clouds, this “reverse-dial” architecture is often the only viable approach. Chapter 7 covers the full outbound-only communication model.

5.3 Provisioning Tools

The reference implementation uses Cluster API because it manages the full Kubernetes cluster lifecycle — control planes, workers, health checks, rolling upgrades — through the same CRD/controller model as the rest of the platform. But CAPI is not the only option, and the choice depends on what you’re managing.

If you use managed Kubernetes (EKS, GKE, AKS): CAPI works fine here. CAPA manages EKS clusters — it configures the managed control plane, node groups, networking, and IAM through the same CRD model as self-managed clusters. The cloud provider handles the control plane internals (etcd, API server HA); CAPI handles the declarative lifecycle on top. Crossplane and Terraform are also valid options. The principles of this chapter — declarative specs, lifecycle state machines, bootstrapping — apply regardless of whether the control plane is self-managed or cloud-managed.

If you manage your own control planes (on-premise, bare metal, Proxmox, private cloud): CAPI is the strongest fit. It handles control plane initialization, etcd management, node provisioning, and rolling upgrades. The reference implementation uses CAPI with Docker (for development), Proxmox (for homelab), and AWS providers.

If you use Terraform or Pulumi today: You can apply the lifecycle state machine and bootstrapping sequence on top of Terraform-provisioned infrastructure. Terraform handles the provisioning phase; a Kubernetes controller handles bootstrapping and ongoing lifecycle. The trade-off is that Terraform runs point-in-time while a controller reconciles continuously — drift between Terraform runs is invisible unless you add detection.

The principle is declarative cluster lifecycle management with provider abstraction. The examples use CAPI. The ideas work with other tools.

5.4 The Lifecycle State Machine

A cluster’s lifecycle is an explicit state machine. Each phase represents a distinct operational state. The controller advances one phase per reconciliation cycle — observable, debuggable, recoverable.

Pending. The CRD is created. The webhook validates: does the provider exist? Are credentials valid? Is the Kubernetes version supported? The controller checks preconditions before committing any cloud resources.

Provisioning. The controller renders CAPI resources from the cluster spec and provider configuration, then creates them. CAPI controllers provision cloud infrastructure — VMs, load balancers, networks. The platform controller watches CAPI status and reflects it: conditions: [ControlPlaneInitialized: True, NodesReady: 1/3].

Bootstrapping. Nodes are up. The Kubernetes API server is reachable. The platform installs its prerequisites — the stack that makes this cluster a platform member. Section 5.5 covers this in detail.

Ready. All prerequisites are healthy. Workloads can be deployed. The platform monitors cluster health continuously and reports conditions.

Scaling. Worker pool replica counts are changing. CAPI provisions or decommissions machines. The platform ensures workloads are drained before node removal.

Upgrading. The Kubernetes version or machine image is changing. CAPI performs rolling replacement: cordon, drain, delete, provision new. The platform ensures ordering (control plane first) and health gating (don’t proceed if the replaced node isn’t healthy).

Deleting. The CRD is deleted. The platform drains workloads, deletes CAPI resources, waits for cloud infrastructure teardown, and cleans up external resources.

Each phase transition is a single reconciliation step. If the controller crashes during Bootstrapping, it resumes at Bootstrapping on restart — not at Pending. The status always reflects the current phase, and the conditions show exactly what’s happening within it.

Here’s what provisioning looks like in practice. The timings below are from the reference implementation’s E2E tests using the Docker provider (CAPD), which provisions containers instead of VMs — the fastest possible provider, since there’s no hypervisor or cloud API in the loop. On Proxmox (the provider in the CRD example above), VM cloning from a template produces similar timings. On AWS, add 3-5 minutes for EC2 instance launch and variable time for ELB provisioning and ENI attachment. The phase sequence and bootstrapping order are consistent across providers; the wall-clock time is not.

t=0. Operator applies the LatticeCluster CRD. Webhook validates. Status: phase: Pending.
t=5s. Controller renders CAPI resources, creates them. Status: phase: Provisioning.
t=3m. Infrastructure is up. Control plane and worker nodes are running. CAPI reports the cluster as provisioned. Status: phase: Bootstrapping.
t=4m. Controller installs Cilium. Waits for the DaemonSet to be running on all nodes and pod-to-pod connectivity to work. conditions: [CNI: True].
t=6m. Installs cert-manager. Waits for it to issue its first certificate. conditions: [CNI: True, CertManager: True].
t=7m. Installs Istio ambient mesh (ztunnel). Waits for ztunnel running on all nodes. conditions: [CNI: True, CertManager: True, Mesh: True].
t=9m. Installs External Secrets Operator, metrics agent, and the platform operator. Each step gates on the previous component’s health.
t=12m. All prerequisites healthy. Agent establishes outbound gRPC connection to the parent. Status: phase: Ready.

Under 15 minutes from spec to functioning platform member.

Every phase is recorded in the status. If something goes wrong at minute 7, the status says phase: Bootstrapping, conditions: [CertManager: True, CNI: True, Mesh: False, reason: ztunnel DaemonSet 3/4 ready]. The operator sees exactly where bootstrapping stalled and why.

5.5 Bootstrapping: What Goes on the Cluster

A bare cluster is not a platform member. It needs a stack of prerequisites before the platform’s features work. The bootstrap sequence is a dependency graph, and getting the order wrong produces a cluster that’s “up” but not functional.

The order, and why:

Container networking (CNI). Cilium. First, because everything else runs as pods, and pods need a network. Nothing works until the CNI is running.
cert-manager. Second, because many subsequent components need TLS certificates. The platform operator’s webhook, ingress endpoints, and Hubble UI all need certs. cert-manager itself runs as pods — which is why it needs the CNI first.
Service mesh. Istio ambient mesh (ztunnel). Without it, L7 policy enforcement and mTLS don’t work. The mesh provides SPIFFE identity — the cryptographic identity layer that the platform’s authorization model depends on. Ztunnel must be running before the platform operator starts, so the operator’s own traffic is meshed from the beginning.
Secret sync. External Secrets Operator with the appropriate ClusterSecretStore. Without it, services can’t resolve ${resources.name.key} references.
Metrics. VMAgent or Prometheus agent, pointed at the metrics backend. Without it, no observability.
Platform operator. The platform’s own controller. Once running, the cluster can process LatticeService, LatticeJob, and LatticeModel CRDs.

What goes wrong. Bootstrapping is the most fragile phase because it has the most dependencies and the least infrastructure in place.

CNI fails to start. Every subsequent component (which runs as pods) can’t install. The controller retries with backoff. Status: CNI: False, reason: DaemonSet cilium not ready (2/10 nodes). Common cause: node networking misconfiguration, particularly on bare metal where bridge interfaces and IP pools must be correct.
Mesh can’t reach peers. Usually means CNI is reporting healthy but pod-to-pod networking has a subtle issue (MTU mismatch, routing table problem). The controller reports the condition. The dependency ordering means mesh isn’t attempted until CNI passes health checks — but “healthy CNI DaemonSet” doesn’t guarantee “working pod network.”
cert-manager can’t issue. DNS or network problem — ACME challenge can’t complete, CA endpoint unreachable. The controller retries. The cluster stays in Bootstrapping.
Platform operator can’t reach the management cluster. The outbound gRPC connection (Chapter 7) isn’t established. Usually a firewall rule, a missing LoadBalancer IP, or mTLS certificate mismatch. The agent reports the failure; the cluster stays in Bootstrapping until the connection succeeds.

Each failure is a condition on the CRD. The controller retries at each step. If a component that was previously healthy becomes unhealthy after Ready, the controller detects the regression — the cluster stays Ready (it’s still operational) but the degraded condition is visible. This is the operational advantage of a reconciliation loop over a script: continuous detection, not one-shot success-or-failure.

5.6 Provider Abstraction

The platform supports multiple infrastructure providers behind a single cluster CRD. The operator writes providerRef: proxmox or providerRef: aws and gets a cluster. The infrastructure differences are the platform’s problem.

Each CAPI provider has different infrastructure resources. AWSCluster has VPC and subnet configuration. ProxmoxCluster has node and storage pool configuration. The cluster CRD abstracts over these through two mechanisms:

The InfraProvider CRD. A separate resource that the platform team manages, holding credentials and provider-specific defaults. One per provider per environment. The cluster CRD references it by name.

Template rendering. The cluster controller renders provider-specific CAPI manifests through templates (minijinja in the reference implementation) that consume both the cluster spec and the provider configuration. The provider-specific block in the cluster spec (provider.config.proxmox, provider.config.docker, provider.config.aws) provides the values the templates need.

Adding a new provider means writing new templates and adding a new config variant. The cluster CRD’s core schema — nodes, services, monitoring, feature flags — doesn’t change. The intent is the same; the derived infrastructure is different.

This is the same pattern as service derivation from Part I. The cluster spec is intent. The CAPI resources are derived infrastructure. If the platform team changes how Proxmox clusters are provisioned — new VM template, different network layout — they update the templates. No cluster spec changes. The intent survives.

5.7 The Hub-and-Spoke Risk

Everything described so far has a structural problem: the management cluster is a single point of failure.

CAPI controllers run on the management cluster. The bootstrap controller runs on the management cluster. The template rendering, the provider abstraction, the lifecycle state machine — all of it executes on the management cluster. Every workload cluster depends on the management cluster for scaling, upgrades, and node replacement.

At 3 workload clusters, this is manageable. At 30, it’s a significant blast radius. The management cluster’s API server becomes a bottleneck — every cluster’s reconciliation competes for API capacity. The resource fan-out is concrete: 30 workload clusters with 10 worker nodes each means the management cluster is tracking 300+ Machine objects, 30+ Cluster objects, 30 kubeconfig Secrets, and hundreds of MachineHealthCheck evaluations. Every CAPI controller watches every resource of its type. An outage on the management cluster means no workload cluster can self-heal.

This is the problem Chapter 6 solves. The pivot pattern transfers infrastructure ownership to the workload cluster itself, so every cluster manages its own lifecycle independently. The management cluster provisions; the workload cluster takes over. The management cluster can be deleted and the workload cluster continues operating.

But the pivot only works if the cluster was provisioned declaratively in the first place. You can’t transfer ownership of infrastructure that was hand-configured — there’s nothing to transfer. The declarative lifecycle in this chapter is a prerequisite for the self-management in the next.

Exercises

5.1. [M10] A team provisions clusters with Terraform. They want to adopt the lifecycle state machine described in Section 5.4 without switching to CAPI. Which phases can be implemented with Terraform, and which cannot? What’s missing, and how would you bridge the gap?

5.2. [H30] Section 5.5 describes a bootstrap dependency graph. Design a bootstrap controller that installs components in order, handles failures at each step, and reports progress in the cluster CRD’s status. What is the correct behavior when a component that was previously healthy becomes unhealthy after the cluster reaches Ready? Should the cluster phase change?

5.3. [R] The chapter defines “cluster” as “a pool of compute with a control plane” and acknowledges this is Kubernetes-focused. Consider an organization using ECS. Which principles from this chapter apply (declarative provisioning, lifecycle state machines, bootstrapping)? Which don’t (self-management via pivot, CAPI)? Is there a general framework for “declarative compute pool management” that transcends Kubernetes?

5.4. [H30] The provider abstraction uses template rendering. Design the template interface: what variables does the template receive? How does it access provider-specific configuration? How do you test that a template produces valid CAPI resources? What happens when a provider adds a new required field — does every cluster need to be reprovisioned, or can the update be applied non-disruptively?

5.5. [R] Section 5.1 argues that even a single cluster should be cattle — reproducible from its declaration. Consider an organization running stateful workloads (databases, message brokers) on clusters with specific storage configurations and compliance requirements. Can these clusters be cattle? What properties make a cluster a pet, and can those properties be captured in a spec? Where is the boundary?

5.6. [M10] The lifecycle state machine has a Deleting phase that cleans up external resources. What happens if cleanup fails — the cloud API returns an error deleting a load balancer? Design the error handling. Should the CRD remain in Deleting indefinitely? Should it be deleted with the orphaned resource? What are the trade-offs?