Outbound-Only Cluster Communication
Every inbound port on a workload cluster is a firewall ticket, an attack surface, and a dependency on network connectivity you don’t control. I eliminated all of them.
The problem
Section titled “The problem”Count the inbound ports on a typical managed workload cluster. The API server (6443). Prometheus scraping metrics from the management cluster. The management cluster’s CAPI health checks. Webhook callbacks. CI/CD deployment triggers. That’s at least 5 inbound ports, each requiring authentication, authorization, TLS, and firewall rules.
Now multiply by your fleet size.
You know this architecture is wrong when adding a cluster requires a 3-week change request. Open port 6443 — ticket. Open the metrics scrape path — ticket. CI webhook — ticket. Each one goes through the security review queue, which is backed up because every other team is filing the same tickets for their clusters. By the time the firewall rules are in place, the project timeline has slipped and someone is asking why “spinning up a new cluster” takes a month.
Ten clusters, five inbound ports each — that’s 50+ firewall rules to maintain. The rule set grows with every cluster and every management tool you bolt on. Nobody budgets for this. Everyone pays for it.
The constraint
Section titled “The constraint”Management clusters need to interact with workload clusters — push configurations, collect health data, proxy API requests. The question is who initiates the connection.
If the parent initiates (inbound to workload): the workload cluster must expose ports, maintain firewall rules, and be network-reachable from the parent. This is the model everyone defaults to, and it’s the source of all those tickets.
If the child initiates (outbound from workload): the workload cluster needs one outbound connection. No inbound ports. No firewall rules allowing traffic to the cluster. One rule — allow outbound TCP to the parent’s endpoint — replaces the entire matrix.
The catch: the connection must be bidirectional. The parent needs to send commands and requests to the child, not just receive data. An outbound-only model only works if you can multiplex bidirectional communication over a connection the child initiated.
The solution: bidirectional gRPC over an outbound stream
Section titled “The solution: bidirectional gRPC over an outbound stream”The workload cluster’s agent establishes an outbound gRPC connection to the parent cluster’s cell component. mTLS on both sides — certificates issued during provisioning. Once the outbound TCP connection is established, the gRPC stream is fully bidirectional.
flowchart LR
subgraph "Workload Cluster"
A[Agent]
end
subgraph "Parent Cluster"
C[Cell]
end
A -- "outbound gRPC + mTLS" --> C
C -. "commands, API requests" .-> A
A -. "heartbeats, API responses" .-> C
What flows over the stream:
- Agent to Cell: Ready signal, heartbeats with health status, pivot acknowledgments, API responses.
- Cell to Agent: Pivot commands (transfer CAPI resources), Kubernetes API requests (proxied through the tunnel), policy updates.
As an operator on the parent, you can kubectl get pods on a workload cluster that has zero inbound ports. The request is serialized over the gRPC stream, executed locally on the child’s API server, and the response comes back through the tunnel. It feels like direct access. It’s not. The workload cluster never accepted an inbound connection. The parent never opened a port on the child. Every byte traveled over a connection the child chose to establish.
That distinction matters when your security team asks “what ports are open on the workload cluster?” The answer is none. Not “well, technically 6443 but it’s behind mTLS” — none.
Why gRPC over alternatives: HTTP/2 bidirectional streaming with multiplexing, typed message definitions via Protobuf, mature TLS support. A message broker (NATS, Kafka) adds a dependency — if the broker is down, management stops. The gRPC stream is point-to-point with no intermediary.
Failure mode: If the stream drops, the agent reconnects with exponential backoff. During the gap, the workload cluster continues operating — it’s self-managing. The parent loses visibility but the cluster is unaffected. Communication failure degrades management, not operation. That’s the right failure mode. The wrong failure mode is the one where a network blip between two data centers takes down a healthy cluster because the management plane can’t reach it.
Key takeaways
Section titled “Key takeaways”- Every inbound port is attack surface and a firewall ticket. Multiply by fleet size and the management burden compounds faster than anyone budgets for.
- Child-initiated connections eliminate inbound ports entirely. One outbound gRPC stream replaces the entire matrix of inbound rules. N rules per cluster becomes exactly 1.
- Communication failure degrades management, not operation. If the stream drops, you lose visibility, not availability. The workload cluster keeps running.