Chapter 6: The Self-Managing Cluster
Chapter 5 ended with a problem: every workload cluster depends on the management cluster that provisioned it. The CAPI controllers that manage its infrastructure — scaling, upgrades, node replacement — run on the management cluster. If the management cluster goes down, every workload cluster loses the ability to manage itself. The blast radius of a management cluster failure is the entire fleet.
This chapter eliminates that dependency.
The idea: after provisioning, transfer ownership of the CAPI resources from the management cluster to the workload cluster. The workload cluster’s own CAPI controllers take over reconciliation. The workload cluster manages its own infrastructure — scaling, upgrades, node health — locally. The management cluster can be deleted. The workload cluster continues operating.
This is the pivot pattern. It is the architectural idea that separates a platform from a fleet manager.
6.1 The Problem
Section titled “6.1 The Problem”Chapter 5 provisioned clusters declaratively — but all the lifecycle management (scaling, upgrades, node health) runs on the management cluster. If the management cluster goes down, no workload cluster can replace a failed node or roll an upgrade. The more clusters you manage, the larger the blast radius. At 30+ clusters, the management cluster is an existential single point of failure.
This is the hub-and-spoke model. It works at small scale where the management cluster failure risk is low and the complexity of an alternative isn’t justified. It becomes a liability at fleet scale or at the edge, where network partitions between hub and spokes are common.
6.2 What “Self-Managing” Means
Section titled “6.2 What “Self-Managing” Means”A self-managing cluster is one that owns its own infrastructure lifecycle. Precisely:
- Owns its CAPI resources. The Cluster, Machine, MachineDeployment, and InfrastructureMachineTemplate resources live on the workload cluster itself. The CAPI controllers reconciling them run locally.
- Scales independently. Adding or removing nodes is a local operation. Change the MachineDeployment replica count on the workload cluster; local CAPI provisions or decommissions machines.
- Upgrades independently. Update the Kubernetes version or machine image in the local MachineDeployment. Local CAPI performs rolling replacement. No external coordination.
- Self-heals. MachineHealthCheck runs locally. A node that goes NotReady for 5 minutes is detected and replaced — locally, without contacting any parent.
- Survives total isolation. Sever every network connection to every other cluster. The workload cluster continues operating. Workloads run. Infrastructure reconciles. Nothing degrades.
What self-managing does not mean:
- Not unmanaged. The cluster still has a declarative spec. Changes to that spec trigger reconciliation. Self-managing means self-operating, not self-governing. A human or automation system still decides when to scale, when to upgrade, and what version to target.
- Not disconnected by preference. Self-managing clusters typically maintain connections to parent clusters for coordination, observability, and policy distribution (Chapter 7). The point is that these connections are useful, not required. If they drop, the cluster keeps running.
- Not self-provisioned. Something must create the cluster in the first place — a parent cluster, a bootstrap script, an operator with kubectl. Self-management begins after provisioning is complete.
6.3 The Pivot
Section titled “6.3 The Pivot”The pivot is the operation that makes a cluster self-managing. It transfers CAPI resource ownership from the management cluster to the workload cluster.
graph LR
subgraph Before Pivot
MC1[Management Cluster<br/>CAPI Resources:<br/>Cluster, MachineDeployment,<br/>Machine ×N, InfraTemplates]
WC1[Workload Cluster<br/>No CAPI resources<br/>Cannot self-manage]
MC1 -->|manages| WC1
end
subgraph After Pivot
MC2[Management Cluster<br/>CAPI resources deleted]
WC2[Workload Cluster<br/>CAPI Resources:<br/>Cluster, MachineDeployment,<br/>Machine ×N, InfraTemplates<br/>Self-managing]
end
MC1 -.->|pivot transfers<br/>CAPI resources| WC2
Why not clusterctl move? CAPI ships a tool for migrating resources between clusters. It works for simple cases but has three limitations that make it unsuitable for platform automation: it requires direct network connectivity from source to target (the workload cluster must accept inbound connections from the management cluster), it’s a one-shot CLI operation with no reconciliation-based retry, and it doesn’t handle the bootstrapping problem — CAPI controllers must be running on the target before resources are moved to it.
The distributed pivot protocol:
-
Parent provisions the workload cluster. Standard CAPI provisioning as described in Chapter 5. The parent owns all CAPI resources.
-
Workload cluster is bootstrapped. The platform installs prerequisites including CAPI controllers and the agent component (Chapter 5, Section 5.5). The workload cluster now has the machinery to manage CAPI resources — it just doesn’t have any yet.
-
Agent establishes outbound gRPC stream to the parent. The workload cluster’s agent component initiates an outbound TCP connection to the parent’s cell component and opens a bidirectional gRPC stream, authenticated with mTLS. The connection is outbound from the workload cluster — no inbound ports are opened. Chapter 7 covers the full communication model, but the key point for the pivot is: the stream exists, it’s bidirectional, and the parent can send commands over it. A security note: in a standard
clusterctl move, the management cluster needs the target’s kubeconfig — a highly sensitive credential stored on the management cluster. In this model, the workload cluster pulls its state over an authenticated tunnel it initiated. The parent never holds the workload cluster’s kubeconfig. -
Parent sends the pivot command. Over the gRPC stream, the parent serializes the CAPI resources — Cluster, InfrastructureCluster, KubeadmControlPlane, MachineDeployment, Machine, InfrastructureMachineTemplate — and sends them to the agent as a structured message.
-
Agent imports resources locally. The agent creates the CAPI resources on the workload cluster’s API server. This requires careful ordering: resources are imported in topological order (dependency-free resources first, then resources that reference them), with owner references remapped to local UIDs.
-
Local CAPI controllers begin reconciling. The imported resources are now local. The CAPI controllers on the workload cluster detect them and begin reconciliation. The cluster is now self-managing.
-
Parent cleans up. The parent deletes its copy of the CAPI resources. The pivot is complete.
What the pivot looks like in practice:
-
t=0. The cluster is in phase
Bootstrapping. The agent has just connected to the parent and sentAgentReadywithstate: Provisioning. The parent’s cell receives the connection and recognizes the cluster by its mTLS certificate CN. -
t=1s. The agent sends a
Heartbeatwithstate: ProvisioningandcapiControllersReady: true, indicating CAPI controllers are installed and reconciling on the workload cluster. The parent’s cluster controller sees this flag and begins the pivot. -
t=2s. The cell builds the dependency graph of CAPI resources on the parent:
Cluster→ProxmoxCluster,KubeadmControlPlane→ProxmoxMachineTemplate,MachineDeployment→ProxmoxMachineTemplate, and allMachine→ProxmoxMachinepairs. It sorts topologically — templates first, then the resources that reference them. -
t=3s. The cell sends the first
MoveObjectBatchover the gRPC stream. Each batch contains serialized CAPI resources as JSON manifests with their source UIDs and owner references. The batch index and total batch count let the agent track progress. -
t=4s. The agent receives the batch, creates each resource on the local API server with the correct owner references, remapping UIDs from the parent’s namespace to the local namespace. It responds with
MoveObjectAckcontaining the UID mappings (source UID → local UID) so the cell can track which resources landed. -
t=5-8s. More batches arrive and are acknowledged. The agent creates all CAPI resources locally. The resources are created in paused state — CAPI controllers see them but don’t reconcile yet, because the infrastructure already exists and we don’t want CAPI to try to provision machines that are already running.
-
t=9s. The cell sends
MoveComplete. This signals: all batches are transferred. The message also carries distributable resources — Cedar policies, SecretProvider CRDs, TrustPolicy CRDs, OIDC providers — that the workload cluster needs to operate as a platform member. -
t=10s. The agent unpauses the CAPI resources. Local CAPI controllers detect the unpaused resources and begin reconciling. They discover that the machines described by the resources already exist — the reconciliation converges immediately without provisioning anything new.
-
t=12s. The agent verifies: are the local CAPI controllers reconciling? Are the Machine objects reporting healthy status? Once verified, the agent sends
MoveCompleteAckand transitions tostate: Ready. The cluster CRD status on the parent showspivotComplete: true. -
t=13s. The parent deletes its copy of the CAPI resources. The pivot is complete. The workload cluster is self-managing.
Total pivot time: ~13 seconds. This is the CAPI resource transfer — the final step of the provisioning sequence. The full sequence from Chapter 5 (provision infrastructure → bootstrap components → pivot) takes under 15 minutes. The pivot itself is the fast part because it’s transferring metadata (resource definitions), not provisioning infrastructure. The cluster was running workloads the entire time. No pods were restarted. No traffic was interrupted. The only thing that changed is which cluster owns the CAPI resources.
Design requirements that make this work:
- Idempotent. If the pivot fails midway — agent crash, network drop, partial import — rerunning it completes the transfer. Each resource has a deterministic name; reimporting an existing resource is a no-op. The move ID tracks which pivot attempt is in progress.
- Ordered. The dependency graph ensures templates are imported before the resources that reference them. Owner references are remapped using the UID mapping from each batch acknowledgment.
- No downtime. The CAPI resources describe infrastructure that already exists. Importing them doesn’t provision new machines. Unpausing them lets CAPI reconcile against existing state.
- Verifiable. The agent doesn’t acknowledge completion until local CAPI controllers are reconciling and reporting healthy. The parent doesn’t clean up until it receives the acknowledgment.
- No split-brain. Between step 8 (agent unpauses) and step 10 (parent deletes), both clusters have CAPI resources. To prevent dual-controller reconciliation of the same machines, the parent pauses its copies before signaling MoveComplete. The handoff sequence is: parent pauses → agent imports and unpauses → agent acknowledges → parent deletes. At no point are both sides actively reconciling.
- Reversible. The protocol supports unpivot — if the cluster is deleted, the agent sends
ClusterDeletingwith all CAPI objects back to the parent. The parent imports them (same UID remapping process in reverse) and resumes management. Self-management is not a one-way door.
6.4 Life After Pivot
Section titled “6.4 Life After Pivot”After the pivot, the workload cluster’s day-2 operations are entirely local.
Scaling. An operator (or autoscaler) changes the MachineDeployment’s replica count on the workload cluster. Local CAPI provisions new machines (for scale-up) or cordons, drains, and deletes machines (for scale-down). The parent is not involved. The parent doesn’t even know about the change unless it queries through the API proxy (Chapter 7).
Upgrades. The Kubernetes version or machine image is updated in the local MachineDeployment. Local CAPI performs rolling replacement: cordon a node, drain its workloads, delete the machine, provision a new one with the updated image, wait for it to be healthy, then proceed to the next. The platform ensures ordering — control plane before workers — and health gating — don’t proceed if the replaced node isn’t healthy.
Node failure. A MachineHealthCheck running on the workload cluster detects an unhealthy node (NotReady for a configurable duration). CAPI deletes the Machine and provisions a replacement. This is automatic and local — no parent intervention required.
Cluster-level failure. If the control plane itself is lost — all control plane nodes down, etcd quorum unrecoverable — the cluster cannot self-heal. Self-management handles node-level failures, not cluster-level catastrophes. This is an honest limitation. Recovery requires external intervention: restore from backup (Chapter 20) or reprovision from a parent. The blast radius of this failure is one cluster — which is the point. In the hub-and-spoke model, a management cluster failure affects every cluster. In the self-managing model, a cluster failure affects only that cluster.
etcd is the most critical component post-pivot. After the pivot, the workload cluster’s etcd holds the CAPI resources that define its own infrastructure. Lose etcd quorum and the cluster can’t scale, upgrade, or replace nodes — the self-management machinery is gone. Before the pivot, etcd held application state (Deployments, Services); after the pivot, it holds infrastructure state too (Cluster, MachineDeployment, Machine). The backup strategy (Chapter 20) must account for this: etcd snapshots on self-managing clusters are infrastructure backups, not just application backups.
The safety net is gone. In hub-and-spoke, a dangerous kubectl delete machinedeployment on the management cluster is a serious mistake — but the platform team controls access to the management cluster. After pivot, that same command on the workload cluster deletes the cluster’s own worker nodes. The CAPI controllers will happily comply — they manage what they’re told to manage. Local RBAC on the workload cluster must restrict who can modify CAPI resources, because there’s no parent to backstop a bad command.
6.5 The Independence Test
Section titled “6.5 The Independence Test”How do you know your architecture is actually self-managing? You test it.
- Provision a workload cluster from a management cluster.
- Deploy workloads to the workload cluster.
- Complete the pivot.
- Delete the management cluster.
- Verify: workloads continue running. Nodes can be scaled. Failed nodes are replaced. Upgrades proceed.
If any of these fail, you have a dependency you haven’t eliminated. This test should be in your E2E test suite.
The reference implementation runs this test as part of its E2E suite. It provisions a management cluster and a workload cluster using the Docker provider, completes the pivot, destroys the management cluster entirely, then verifies the workload cluster can scale its MachineDeployment and replace a deliberately-killed node — all without any parent. It’s the definitive test for self-management.
Variations worth testing:
- Network partition instead of deletion. Tests graceful degradation — the parent is unreachable but not gone.
- Scale during pivot. Tests that the transition doesn’t create a window where neither cluster is managing the infrastructure.
- Multiple child clusters pivoted, then parent deleted. Tests that pivot is complete for all children, not just one.
6.6 The Trade-Offs
Section titled “6.6 The Trade-Offs”Self-management buys you blast radius containment, no single point of failure, and independence under network partition. These are real gains, especially at the edge or at fleet scale. They are not free.
Complexity. The pivot protocol is real engineering. Serializing CAPI resources, remapping UIDs, importing in topological order, verifying local reconciliation, handling partial failure and retry — this is months of work to build and test. Hub-and-spoke doesn’t require any of it. You provision a cluster, CAPI manages it from the hub, and you’re done. The operational model is simpler because it’s centralized.
Distributed fragility. The pivot doesn’t eliminate the single point of failure — it distributes it. In hub-and-spoke, one etcd cluster is infrastructure-critical. After pivot, N etcd clusters are infrastructure-critical — one per self-managing cluster. At 100 clusters, that’s 100 etcd instances whose corruption doesn’t just lose application state but loses the cluster’s ability to manage its own infrastructure. This is a real operational tax. The mitigation is automated etcd backups (Chapter 20), monitoring for etcd health, and the discipline to test restore procedures. The blast radius of any single etcd failure is smaller (one cluster, not the fleet), but the probability of some etcd failure across the fleet is higher. You’re trading centralized risk for distributed responsibility.
Distributed ownership. After pivot, each cluster manages its own upgrades, its own scaling, its own node health. There is no single place that knows the state of the fleet. Fleet-wide operations (upgrade every cluster to 1.33) require coordinating across independent clusters rather than issuing one command on the hub. Monitoring is federated — you aggregate from each cluster through the outbound stream (Chapter 7) rather than querying one API server. This works, but it’s more infrastructure to build and operate.
Debugging is harder. When something goes wrong on a self-managing cluster, the debugging context is local. The parent doesn’t have the CAPI resources anymore — they were transferred. If the local CAPI controller has a bug, the cluster must diagnose and fix it locally. In hub-and-spoke, the platform team debugs CAPI issues from one cluster with one set of tools.
When hub-and-spoke is the right call. If you run 3 clusters, the management cluster failure risk is low, and the complexity of the pivot protocol is hard to justify. Hub-and-spoke is simpler, faster to build, and easier to operate at small scale. The risk is real but contained — you have one management cluster to keep healthy, and if you do, the model works.
When self-management is the right call. If you run 30+ clusters, if clusters are at the edge with unreliable network connectivity to the hub, if you’re in a regulated environment where blast radius containment matters, or if you simply don’t want a single point of failure at the infrastructure level — self-management is worth the complexity. The pivot is hard to build. The operational model it produces is more resilient.
This is a real trade-off. Complexity is a cost. Blast radius containment is a benefit. The right answer depends on your fleet size, your network topology, and how much you trust a single management cluster.
Exercises
Section titled “Exercises”6.1. [M10] The pivot protocol transfers CAPI resources from parent to child. What happens if the parent crashes after sending the pivot command but before cleaning up its copy of the CAPI resources? Now both clusters have the same CAPI resources. Two sets of CAPI controllers are reconciling the same machines. What happens? How should the protocol prevent this?
6.2. [H30] Section 6.4 says self-management handles node-level failures but not cluster-level failures. Define precisely what “cluster-level failure” means. Is losing 1 of 3 control plane nodes cluster-level? Is losing 2 of 3? Is losing all 3 but with a healthy etcd backup? For each scenario, determine whether the self-managing cluster can recover without external intervention, and what the recovery path is.
6.3. [R] The independence test (Section 6.5) deletes the management cluster to verify self-management. An adversary argues: “This test only proves the cluster works without the parent right now. It doesn’t prove it works after a month of independent operation — node replacements accumulating, certificates expiring, provider credentials rotating.” Design a long-running independence test that validates self-management over time. What failures would this test catch that the deletion test misses?
6.4. [H30] Section 6.6 acknowledges the loss of centralized visibility. Design the aggregation layer that restores it. How does a fleet dashboard query node counts across 50 self-managing clusters? What is the latency of the query? What happens when 5 of the 50 clusters are unreachable — does the dashboard show stale data, partial data, or an error? What is the correct answer?
6.5. [R] The pivot pattern assumes CAPI-managed infrastructure. Consider an organization using EKS (managed Kubernetes). The control plane is AWS-managed — there are no CAPI control plane resources to transfer. Can you still make an EKS cluster “self-managing” in the sense of this chapter? What does self-management mean when the control plane is operated by a cloud provider? Is the concept still meaningful?
6.6. [M10] The pivot transfers CAPI resources in topological order. A Machine references a MachineDeployment. An InfrastructureMachine references an InfrastructureMachineTemplate. Draw the dependency graph for a cluster with 1 KubeadmControlPlane, 2 MachineDeployments, 3 Machines per MachineDeployment, and their associated InfrastructureMachine and InfrastructureMachineTemplate resources. What is the correct import order?