Skip to content

Chapter 22: Cryptographic Discipline

The mesh CA issues certificates for every service. The platform CA signs agent certificates for every cluster. cert-manager provisions TLS for every ingress endpoint. A single weak algorithm, misconfigured library, or expired certificate can undermine every security guarantee from Part IV.

The decisions:

  • How do you choose a crypto stack? (Validation status, memory safety, FIPS — and why “just use OpenSSL” is a decision you should make deliberately, not by default.)
  • How do you manage three CAs with different lifetimes and blast radii? (Mesh CA, platform CA, and cert-manager each have different compromise implications.)
  • How do you make the insecure path impossible? (If plaintext is an option, someone will use it. If weak ciphers are accepted, legacy clients will negotiate them.)

The reference implementation uses rustls with aws-lc-rs:

Cargo.toml
rustls = { version = "0.23", default-features = false, features = ["aws-lc-rs", "std"] }
aws-lc-rs = { version = "1.12", features = ["fips"] }

This is one valid choice. The decision framework matters more than the specific choice:

Validation status. Is the library FIPS 140-2/140-3 validated? aws-lc-rs has FIPS validation with the fips feature. ring (another Rust crypto library) does not. OpenSSL has a FIPS module but it’s operationally complex. If FIPS isn’t required today, the restricted algorithm set (no MD5, no SHA-1, no non-approved curves) is good hygiene regardless.

Memory safety. rustls itself is pure Rust — no C code in the TLS implementation, no buffer overflows, no use-after-free. (Note: the crypto backend aws-lc-rs has a C core, discussed below in the trade-off analysis.) OpenSSL is C with a long CVE history (Heartbleed, etc.). BoringSSL (Google’s OpenSSL fork) is better maintained but still C.

Maintenance and audit history. Who maintains it? How quickly are CVEs patched? Has it been independently audited? rustls has public audit reports. ring has been formally verified in key areas.

Language ecosystem:

  • Rust: rustls + aws-lc-rs. The reference implementation’s choice.
  • Go: crypto/tls with BoringCrypto (via GOEXPERIMENT=boringcrypto). FIPS-validated.
  • JVM: Conscrypt (Google’s Java security provider) or BouncyCastle FIPS.
  • Python: pyca/cryptography backed by OpenSSL or aws-lc.

Design for FIPS from the start. Retrofitting FIPS is expensive — different libraries, different APIs, different algorithm sets, different test suites. If there’s any chance your platform will serve FIPS-requiring organizations (government, finance, healthcare), use FIPS-validated implementations from day one. The cost is near-zero upfront and months of work to retrofit.

What the evaluation looks like in practice. The reference implementation evaluated three options for TLS:

CriterionOpenSSLrustls + aws-lc-rsrustls + ring
FIPS validatedYes (module)Yes (aws-lc-rs)No
Memory safetyNo (C)Yes (Rust + C core)Yes (Rust + asm)
CVE historyExtensiveMinimalMinimal
EcosystemUniversalGrowingMature
Audit reportsManyaws-lc auditring formal verification

The choice was rustls + aws-lc-rs: FIPS validation (required for potential government users), memory safety (eliminates a class of vulnerabilities), and acceptable ecosystem maturity (kube-rs supports rustls natively).

The trade-off the table doesn’t capture. aws-lc-rs has a C core (from BoringSSL). rustls + ring is pure Rust — no C, no buffer overflows in the crypto layer at all. The question is: Is FIPS validation worth reintroducing C code? The answer depends on your threat model. If your adversaries are sophisticated enough to exploit memory corruption in a crypto library, pure Rust matters more than FIPS. If your adversaries are auditors, FIPS matters more than the language the primitives are written in. Most platform builders face auditors more often than nation-state exploit developers. The reference implementation chose accordingly — but if FIPS isn’t in your compliance requirements, ring is the better security posture.

Pure Rust with FIPS doesn’t exist yet. When it does, the decision framework changes — you get both properties without the trade-off. Design for that day by keeping the crypto backend pluggable (Section 22.5).

If the reference implementation were in Go, the choice would be crypto/tls with GOEXPERIMENT=boringcrypto — FIPS-validated, maintained by Google, and the standard Go TLS library. If in Java, Conscrypt or BouncyCastle FIPS. The evaluation criteria are the same regardless of language.

The platform operates three certificate authorities. Each has a different scope, different lifetime, and different compromise blast radius.

The platform CA. Signs agent certificates for the gRPC stream (Chapter 7). Generated by the platform operator during installation. ECDSA P-256 key pair stored as a Kubernetes Secret on the management cluster. Agent certificates are short-lived (24-72 hours, configurable) and rotated through the gRPC stream before expiry.

Compromise blast radius: Every agent certificate is compromised. An attacker can impersonate any workload cluster. This is the most sensitive credential on the management cluster.

The mesh CA. Signs SPIFFE SVIDs for service identity (Chapter 15). Managed by Istio’s istiod. SVIDs are short-lived (24 hours) and auto-rotated by the mesh data plane. The workload never manages its own certificate.

Compromise blast radius: Every service identity is compromised. An attacker can impersonate any service. Detection: monitor istiod’s issuance logs and alert on anomalous patterns (burst of issuance, issuance for unknown service accounts).

cert-manager. Signs application TLS certificates (ingress endpoints, webhooks). Managed by cert-manager with configurable issuers (Let’s Encrypt for public TLS, internal CA for private endpoints). Lifetimes vary (90 days for Let’s Encrypt, shorter for internal).

Compromise blast radius: Depends on the issuer. A compromised Let’s Encrypt account affects public TLS. A compromised internal CA affects internal webhooks.

Monitoring. Alert when any certificate is within N days of expiry. The platform manages three rotation schedules — mesh SVIDs (24 hours), agent certificates (24-72 hours), application TLS (30-90 days). Each needs monitoring. An expired mesh SVID breaks mTLS between services. An expired agent certificate breaks the gRPC stream. An expired application certificate breaks HTTPS for users.

Revocation. Short-lived certificates are a form of implicit revocation — a compromised 24-hour SVID is useless after 24 hours. But “useless after 24 hours” means “useful for 24 hours.” For the platform CA and mesh CA, the question is whether short lifetimes are sufficient or whether explicit revocation is needed.

CRL (Certificate Revocation Lists) require every verifier to check the list — operationally expensive in a mesh where every mTLS handshake is a verification event. OCSP (Online Certificate Status Protocol) requires a responder to be reachable — another dependency in a system that already has three CAs. OCSP stapling moves the check to the presenter, but if the OCSP responder is down, the presenter can’t staple a fresh response and the verifier must decide: reject (safe but fragile) or accept (available but vulnerable).

The reference implementation relies on short lifetimes rather than explicit revocation. The reasoning: a 24-hour SVID limits the compromise window. A compromised platform CA requires re-provisioning agent certificates (which is a manual rotation operation regardless of revocation mechanism). And the operational cost of CRL/OCSP infrastructure across a fleet of self-managing clusters is substantial — every cluster needs a reachable responder.

This is a defensible choice, not an obviously correct one. If your threat model includes real-time certificate compromise (not just stale credentials), explicit revocation is worth the operational cost. The decision should be revisited as the fleet grows.

Protecting the root. The platform CA’s private key is stored as a Kubernetes Secret. An attacker with RBAC read access to the platform namespace can read it. This is the most dangerous credential on the management cluster — and it’s stored in the least secure place.

The alternatives, with their real trade-offs:

  • Cloud KMS (AWS KMS, GCP Cloud KMS). The key never leaves the HSM. Signing latency increases (~10-50ms per signature). Works for cloud deployments. Doesn’t work for air-gapped or edge deployments without cloud connectivity. Cost is per-operation — at scale, the signing volume for agent certificates generates real bills.

  • HashiCorp Vault (Transit engine). Key material stays in Vault. Signing is an API call. Adds a hard dependency — if Vault is down, no agent certificates can be issued. Vault itself needs HA, backups, and its own key management. You’re solving the “where does the key live” problem by adding another system whose keys also need protection.

  • Hardware Security Module (on-premise HSM, e.g., Thales Luna, YubiHSM). Strongest protection. Highest cost. Requires physical access for provisioning. Doesn’t scale horizontally without multiple HSMs. Makes sense for the root CA of a large fleet — not for every workload cluster.

The reference implementation stores the key as a Kubernetes Secret because the alternative infrastructure doesn’t exist on every deployment target (especially bare-metal and edge). For production fleets, this is not optional: use Cloud KMS when cloud is available, Vault when it’s not. A compromised platform CA key means every cluster’s agent identity is compromised — this is the one credential that justifies the operational cost of external key management. Short-lived agent certificates (24 hours) are a necessary mitigation regardless, limiting the window a stolen key is useful, but they are not a substitute for protecting the key itself.

Walk through what happens when certificate management fails — because it will.

Scenario. The mesh CA (istiod) is restarted during a Kubernetes upgrade. During the restart window (~30 seconds), SVID rotation requests from services are rejected. Services whose SVIDs are expiring during this window can’t rotate.

What happens:

  1. t=0. istiod restarts. SVID rotation endpoint is unavailable.
  2. t=10s. A service’s SVID (issued 23 hours ago, expires in 1 hour) tries to rotate. Request fails. The mesh data plane (ztunnel) retries with backoff.
  3. t=30s. istiod is back. The retry succeeds. New SVID issued. Crisis averted for this service.
  4. What if istiod was down for 2 hours? Services whose SVIDs expire during the outage lose mTLS identity. mTLS handshakes fail because the certificate is expired. Traffic between those services is denied by the mesh. L4 (Cilium) still allows the packet — but the packet arrives without a valid mTLS handshake, and Istio’s STRICT PeerAuthentication rejects it.

Detection. The compliance controller (Chapter 14) probes certificate health. The DCGM-style monitoring for certificates: track issuance rate, rotation success rate, and time-to-expiry across the fleet. Alert when: issuance failures exceed threshold, any SVID is within 10% of expiry without a rotation scheduled, or the CA itself is unreachable.

The lesson. Short-lived certificates (24 hours) are safer against compromise but more operationally fragile than long-lived ones (90 days). A 24-hour SVID means every service must successfully rotate every day. If the CA is unreachable for 25 hours, every service loses identity. A 90-day certificate tolerates a day-long CA outage without impact. The reference implementation uses 24-hour SVIDs because the compromise risk (stale certificate used by an attacker) outweighs the operational risk (CA outage during rotation window) — but this choice requires robust CA monitoring.

If the platform makes plaintext internal traffic possible, someone will use it.

  • mTLS everywhere. The mesh enforces mTLS on all service-to-service traffic. No “disable mTLS” option.
  • Restricted cipher suites. TLS 1.2 minimum, TLS 1.3 preferred. Prefer GCM over CBC-mode ciphers. No RSA key exchange (use ECDHE). No SHA-1. Note: NIST SP 800-52 Rev. 2 still permits AES-CBC with HMAC in TLS 1.2 — banning CBC goes beyond NIST guidance. The reference implementation bans it anyway because GCM is simpler (no padding oracle risk) and universally supported in modern stacks. If your compliance framework requires strict NIST alignment, CBC-with-HMAC is permissible.
  • No manual certificate management. Developers don’t generate, install, or rotate certificates. The mesh handles service identity. cert-manager handles application TLS.
  • Dependency auditing. Periodically scan the dependency tree for non-approved cryptographic libraries. CI checks that fail if a non-approved crypto dependency appears.

The legacy compatibility problem. “No CBC-mode ciphers, no RSA key exchange, TLS 1.2 minimum” is correct as a target. In practice, the platform will encounter: monitoring agents that speak TLS 1.0, legacy load balancers that can’t negotiate ECDHE, compliance scanners that require RSA key exchange for their own TLS client. Saying “reject them all” is easy to write and hard to enforce when the monitoring agent is the one that pages you at 3am.

The resolution is segmentation, not compromise. Internal mesh traffic (service-to-service) enforces the strict cipher suite — no exceptions. Edge traffic (ingress) enforces TLS 1.2+ with the restricted suite — clients that can’t negotiate are clients that shouldn’t be connecting. Out-of-band infrastructure traffic (monitoring, backup agents, legacy integrations) gets a separate network path with its own, documented, cipher policy. The separate path has an expiration date and an owner. “Legacy cipher support for monitoring agent X, owned by the observability team, expires within 12 months.” Without an owner and a deadline, the exception becomes permanent.

Certificate rotation at 24-hour intervals generates more CA operations. FIPS libraries may have slightly different performance characteristics. These are real costs — but they’re operational costs with known mitigations, not security compromises.

Certificates rotate automatically — the mesh CA issues a new SVID every 24 hours, cert-manager renews before expiry, the platform CA issues new agent certificates through the gRPC stream. This is well-understood and well-automated.

Key rotation is harder. The CA’s private key — the key that signs all certificates — rotates rarely and carefully. A new key means every certificate signed by the old key is no longer verifiable against the new key. The rotation must overlap: both keys are trusted during a transition window, new certificates are signed with the new key, and old certificates are valid until they expire naturally.

Mesh CA key rotation. Istio’s istiod supports root CA rotation through a trust bundle that includes both old and new root certificates. During the transition, services accept certificates signed by either root. After all SVIDs have rotated (24 hours for 24-hour SVIDs), the old root is removed from the trust bundle. If a service misses its rotation window, it holds an SVID signed by the old root — which is still trusted during the overlap. After the old root is removed, that service’s mTLS fails.

Platform CA key rotation. The platform CA signs agent certificates for the gRPC stream (Chapter 7). Rotating this key requires: generate a new key pair, distribute the new CA certificate to all connected agents (through the existing gRPC stream), begin signing new agent certificates with the new key, and after all agents have received certificates signed by the new key, retire the old key. A disconnected cluster during rotation doesn’t receive the new trust bundle — when it reconnects, its agent certificate (signed by the old key) may be rejected. The transition window must exceed the maximum expected disconnection time.

The stance on CA key storage. For development and edge deployments, the CA key as a Kubernetes Secret is acceptable — the convenience outweighs the risk when the cluster is ephemeral or non-production. For production fleets, KMS or Vault is not a recommendation — it is a requirement. This is the highest-value key in the system. A compromised platform CA compromises every cluster’s agent identity. Treat this key with the same gravity as a root CA in a traditional PKI: it belongs in hardware or a dedicated secrets engine, not in etcd.

22.6 Trust Bootstrap and Clock Synchronization

Section titled “22.6 Trust Bootstrap and Clock Synchronization”

How does a new cluster trust the CA? During provisioning (Chapter 5), the bootstrap process installs the platform CA’s public certificate into the agent’s trust store. The agent uses this certificate to verify the cell’s mTLS identity during the initial gRPC handshake. The trust root is distributed through the provisioning channel — kubeadm’s postKubeadmCommands write the CA certificate to the node before the agent starts. This is the “bootstrap trust” problem: you need a secure channel to distribute the trust root, and that channel is the provisioning process itself.

Clock synchronization. Certificates have validity windows — notBefore and notAfter. If a node’s clock is wrong, a valid certificate appears expired or not-yet-valid. This is a classic crypto failure mode. All nodes must maintain accurate time via NTP. A clock skew of more than the certificate’s validity window causes mTLS failures that look like certificate expiry but aren’t. The platform should monitor clock skew as an infrastructure health signal — alert when any node’s clock differs from the control plane’s by more than 30 seconds.

Cryptography has real CPU and latency costs.

mTLS handshake overhead. Every new connection between meshed services requires a TLS handshake. With short-lived connections (HTTP/1.1 without keepalive), the handshake overhead is significant — 1-3ms per connection on modern hardware. With connection pooling and HTTP/2 (which most mesh data planes use), the handshake happens once per connection and is amortized over many requests. The platform should default to connection reuse.

Certificate rotation storms. With 24-hour SVIDs and 200 services, istiod processes ~200 CSR/sign operations per day. At 2,000 services, that’s 2,000 per day — distributed over 24 hours, it’s ~1.4 per minute. Not a problem. But if istiod restarts and all SVIDs rotate simultaneously (they were all issued at the same time), the burst can overwhelm istiod. Staggered issuance (randomize the rotation offset per service) prevents this.

KMS signing latency. If the platform CA key is in Cloud KMS, every agent certificate signing operation is a network round-trip to the KMS service. At 10-50ms per signing, this is acceptable for agent certificates (signed once per 24-72 hours). It would be unacceptable for mesh SVIDs (signed every 24 hours per service) — which is why the mesh CA uses a local intermediate key, not the platform CA.

Section 22.3 covers the istiod restart scenario. Two additional failure modes complete the operational picture.

Expired platform CA. The platform CA certificate itself has a validity period (typically 1-10 years). If it expires without rotation, every agent certificate renewal fails — the CA can’t sign. Clusters remain connected (existing agent certificates are valid until they expire), but once the agent certificate expires, the gRPC stream drops and the cluster is disconnected from the parent. At scale, this is a fleet-wide disconnection event. Monitor the platform CA’s expiry date and alert at 90 days remaining.

cert-manager failure. cert-manager manages TLS certificates for ingress endpoints. If cert-manager’s pods are down, certificate renewal fails. Let’s Encrypt certificates (90-day lifetime) have a 30-day renewal window — a cert-manager outage of less than 30 days is tolerable. Internal CA certificates with shorter lifetimes (30 days) have a tighter window. If renewal fails and the certificate expires, the Gateway serves expired TLS — browsers show warnings, API clients reject the connection. The mitigation: monitor cert-manager’s Certificate resources for Ready: False conditions and alert immediately.

Post-quantum cryptography is on the horizon. NIST has standardized ML-KEM and ML-DSA. TLS 1.3 supports hybrid key exchange (classical + post-quantum).

Design for it now by choosing pluggable backends. rustls abstracts the crypto backend — swapping from aws-lc-rs to a post-quantum-capable backend is a configuration change, not a rewrite. Hardcoding algorithm choices deep in the codebase makes future migration expensive.

Don’t implement post-quantum today unless your threat model specifically requires it (government, long-lived secrets that might be retrospectively decrypted). Do ensure your stack can be swapped.

22.1. [M10] The platform uses three CAs. Which should have the shortest-lived certificates? Which has the largest compromise blast radius? Are they the same CA?

22.2. [H30] The platform CA’s private key is a Kubernetes Secret. An attacker with RBAC read access to the platform namespace can read it. Design the mitigation: HSM? Vault? Cloud KMS? Trade-offs of each in complexity, cost, and signing latency?

22.3. [R] Section 22.4 says “make the secure path the only path.” But escape hatches exist (Chapter 11). A LatticeMeshMember can set peerAuth: Permissive. A LatticePackage can install plaintext HTTP endpoints. Should the platform forbid permissive mTLS? Where is the line between mandatory security and necessary escape hatches?

22.4. [M10] A monitoring alert fires: “certificate expiring in 12 hours.” Three CAs with different schedules. How does the operator determine which CA issued it? What information should the alert include?

22.5. [H30] Design the root CA rotation process. The platform CA signs agent certificates for 20 clusters. How do you rotate the root without invalidating every agent certificate simultaneously? What is the migration window? What happens if a cluster is disconnected during rotation?