Skip to content

Chapter 12: Security Architecture

The derivation pipeline from Part III produces Deployments, Services, network policies, and secrets. But how do you know the produced infrastructure is secure? How do you know a compromised service can’t reach everything in the cluster? How do you know a missing policy doesn’t silently open a path?

Three design decisions shape the security architecture:

  • What’s the default posture? (Everything allowed until denied? Or everything denied until allowed?)
  • How many enforcement layers? (One layer that does everything? Or multiple independent layers that each catch what the others miss?)
  • Should the layers be independent or derived from each other? (One controller generating all policies? Or separate systems with separate failure modes?)

These decisions are not Lattice-specific. Any platform that derives infrastructure must answer them. The wrong answers produce a platform that looks secure in the dashboard and fails under attack.

Every layer of the platform starts closed.

Network (L4). A CiliumClusterwideNetworkPolicy with empty ingress: [] and empty egress: [] — deny all in both directions. Both are required. A policy with only ingress: [] denies inbound traffic but leaves egress unconstrained — any pod can still reach the internet. The platform installs both rules during bootstrapping. No pod can send or receive traffic unless a per-service CiliumNetworkPolicy explicitly allows it.

Note: this is Cilium-specific behavior. In the native Kubernetes NetworkPolicy spec, an empty ingress: [] denies ingress, but egress is only affected if an egress block is present. If you’re using a different CNI, the deny-all semantics may differ. The examples in this chapter use Cilium’s CRDs throughout.

Identity (L7). An Istio AuthorizationPolicy with empty spec: {} — deny all requests. No service identity can make a request to any other service unless a per-service AuthorizationPolicy permits it.

Authorization (Cedar). Every gate defaults to deny. If no policy matches a request — including DeployImage, AccessSecret, OverrideSecurity, AccessExternalEndpoint, AllowTagReference, or AccessVolume — the answer is deny.

Admission. Invalid specs are rejected. Missing required fields, malformed references, unknown values — all rejections.

The accumulation principle: every capability a service has was explicitly granted. The audit question “what can service X access?” is answered by reading its dependency declarations and matching policies — a bounded query. In a default-allow model, the same question requires scanning every deny rule in the cluster and assuming everything else is allowed — unbounded.

The cost. Default-deny is operationally expensive. Every dependency must be declared. Every secret must be authorized. New services deploy with zero access until declarations and policies are in place. “Why can’t my service reach X?” becomes the most common debugging question. The platform absorbs this cost through the derivation pipeline — the developer declares dependencies, the platform generates policies — but the cost is real for the platform team (more support questions, more debugging) and for developers during onboarding (nothing works until you declare everything).

Why the cost is worth paying. Blast radius containment: a compromised service can only reach its declared dependencies, not the entire network. Compliance: NIST 800-53, SOC 2, and CIS benchmarks require explicit access controls, which default-deny provides structurally. Drift resistance: removing a permit rule closes access rather than opening it. The drift direction is toward security.

No single mechanism catches every threat. The design principle is independent enforcement — multiple layers, each operating at a different point in the stack, each failing independently. The question isn’t “how many layers?” but “what properties does your enforcement architecture need?”

The properties:

  • Pre-runtime and runtime enforcement must be independent. Catching a bad image at derivation time is different from catching a bad network path at runtime. If one system handles both, a single bug opens both.
  • Network-level and request-level enforcement must be independent. A packet filter can’t see HTTP methods. A proxy can’t survive its own crash. Each covers what the other can’t.
  • No single failure should open all access. If the mesh goes down, network policies still deny. If the network policy engine has a bug, identity authorization still checks.

The reference implementation uses four layers. Three layers work if you merge admission and policy authorization (both are pre-runtime). Five work if you split runtime binary enforcement (Tetragon) into its own layer. The number is a design choice — the independence property is the requirement.

A note on terminology: these are platform enforcement layers, not OSI layers. The numbering reflects the order a request encounters them, not the network stack.

graph TD
    Dev[Developer applies spec] --> EL1[Admission Webhook<br/>Validates schema, rejects malformed specs]
    EL1 --> EL2[Cedar Authorization<br/>Evaluates gates at derivation time<br/>DeployImage, AccessSecret, etc.]
    EL2 --> Derive[Derivation Pipeline produces resources]
    Derive --> EL3[Network Enforcement — Cilium eBPF<br/>Kernel-level packet filtering<br/>Default-deny CiliumNetworkPolicy]
    EL3 --> EL4[Identity Authorization — Istio Ambient<br/>Request-level authorization<br/>SPIFFE identity + AuthorizationPolicy]
    EL4 --> App[Application receives request]
    EL1 -->|invalid spec| Reject1[Rejected at admission]
    EL2 -->|gate denies| Reject2[Rejected at derivation<br/>Zero resources created]
    EL3 -->|no matching policy| Drop[Packet dropped in kernel]
    EL4 -->|identity not authorized| Deny[Connection reset or HTTP 403]

Admission webhooks. Catch malformed specs — wrong types, missing fields, invalid references. Operate at CRD submission time. Stateless (or lightly stateful). Can’t enforce runtime behavior.

Cedar policy authorization. Catch unauthorized actions — deploying unsigned images, accessing secrets without permission, escalating privileges. Operate at derivation time, before Kubernetes resources exist. Can’t enforce runtime traffic patterns.

Network enforcement (Cilium, network-level). Catch unauthorized network paths — pod A tries to reach pod B without a policy allowing it. Operate in the kernel via eBPF. See packets (source, destination, protocol, port). Can’t see request content. Hard to bypass from userspace — even an attacker with root inside a container can’t modify the eBPF programs or the netfilter rules that Cilium manages, because those are loaded in the host kernel, outside the container’s namespace.

Identity-based authorization (Istio, request-level). Catch unauthorized requests on authorized paths — pod A can reach pod B (network policy allows it), but A’s SPIFFE identity isn’t authorized for the request. Operate in userspace via waypoint proxies. See requests (method, path, headers, cryptographic identity) for HTTP/gRPC. For raw TCP connections, Istio AuthorizationPolicy still evaluates — it checks the SPIFFE identity and port, even without HTTP semantics. Can be bypassed if the mesh is down.

The mesh also provides encryption. All traffic between meshed services is mTLS — encrypted in transit with mutual authentication. This isn’t an enforcement layer (it doesn’t deny traffic based on content), but it’s a security property the four layers depend on. Without mesh encryption, an attacker who can observe network traffic between pods sees plaintext requests — even if L4 and L7 policies are correct. Chapter 15 covers the mesh’s identity and encryption model in detail; Chapter 22 covers the cryptographic choices.

ThreatAdmissionCedarL4 (Cilium)L7 (Istio)
Malformed specCatches
Unauthorized imageCatches
Unauthorized network pathCatches
Authorized path, unauthorized requestCatches
Unauthorized secret accessCatches
Privilege escalationCatches
Compromised pod, lateral movementLimitsLimits

No column catches everything. Remove any layer and a threat goes unaddressed. A platform with fewer layers (e.g., no L7 identity authorization) accepts that “authorized path, unauthorized request” is undetected. That may be acceptable for your threat model — the table helps you decide.

Layer independence. Each layer must be independently configured and independently auditable. If L7 policies are derived from L4 policies, a bug in the derivation compromises both. In the reference implementation, L4 policies (CiliumNetworkPolicy) and L7 policies (AuthorizationPolicy) are generated by the same controller from bilateral declarations — but using different identity models (Cilium endpoint identity vs. SPIFFE ID) and different policy formats. They share a source, not a derivation path. Cedar operates independently of both network layers. Admission webhooks operate independently of everything.

The trade-off. More layers means more systems to deploy, monitor, and debug. When a request is denied, determining which layer denied it requires checking each layer’s logs independently. The reference implementation provides unified debugging (platform debug connectivity checkout payments) that checks all layers and reports which one denied the traffic. Without unified debugging, independent layers become independent debugging nightmares.

Fewer layers. A platform without a service mesh has no L7 identity authorization. Network policies (L4) are the runtime enforcement layer. This works for platforms where services don’t need request-level access control — every allowed network path is fully trusted. A platform without Cedar has no pre-runtime policy authorization — admission webhooks are the only check before derivation. This works for platforms where the derivation logic itself encodes all the rules and there’s no need for external policy evaluation. Each layer you remove simplifies operations and removes a class of protection. The right number depends on your threat model, your compliance requirements, and the operational capacity of your platform team.

To ground this in reality, trace a request from the checkout service to the payments service through all four layers.

Setup. Checkout’s LatticeService spec declares payments: type: service, direction: outbound. Payments’ spec declares checkout: type: service, direction: inbound. Both are deployed and Ready.

The derivation pipeline (before any traffic flows):

  1. Admission. Both specs are validated by the webhook — valid resource types, valid direction enums, valid image references.
  2. Cedar. The DeployImage gate verifies both images are signed. The AccessSecret gate authorizes both services’ secret references. Both pass.
  3. Derivation. The pipeline produces Deployments, Services, ConfigMaps, Secrets, ExternalSecrets for both services. It also produces a LatticeMeshMember for each.
  4. Mesh-member controller. Matches the bilateral declarations. Produces: CiliumNetworkPolicy on checkout (egress to payments:8080), CiliumNetworkPolicy on payments (ingress from checkout), AuthorizationPolicy on payments (permit checkout’s SPIFFE identity).
  5. Both services reach Ready. Network policies are in place before any pod handles traffic.

A request at runtime:

  1. Checkout’s pod sends an HTTP request to payments:8080.
  2. L4 (Cilium). The eBPF program on checkout’s node checks the egress policy. The destination matches payments’ endpoint identity on port 8080. L4 permits the packet.
  3. L7 (Istio). The ztunnel on checkout’s node initiates mTLS to payments’ ztunnel. The waypoint proxy on payments’ namespace evaluates the AuthorizationPolicy. The source SPIFFE identity (spiffe://cluster.local/ns/commerce/sa/checkout) matches the permit policy. L7 permits the request.
  4. Payments’ pod receives the request and responds.

What happens if checkout tries to reach a service it didn’t declare:

  1. Checkout’s pod sends an HTTP request to inventory:8080.
  2. L4 (Cilium). No egress rule for inventory. Packet is dropped. Checkout sees a connection timeout.
  3. The request never reaches L7 because L4 already blocked it.

What happens if checkout declared the dependency but payments didn’t:

  1. Both L4 and L7 policies are absent (the bilateral match failed — only one side declared).
  2. Checkout’s egress rule doesn’t exist (the mesh-member controller only generates policies for matched agreements).
  3. Checkout sees a connection timeout. The status reports: “outbound dependency payments declared but no matching inbound declaration found on service payments.”

What happens if Istio’s waypoint proxy crashes:

  1. L7 enforcement is gone for the affected namespace.
  2. L4 enforcement (Cilium) is unaffected — it runs in the kernel, independently.
  3. In ambient mode, ztunnel still attempts to route L7 traffic through the waypoint. If the waypoint is unreachable and PeerAuthentication is set to STRICT, the behavior is fail-closed — ztunnel can’t forward the request and the connection fails. Checkout cannot reach payments until the waypoint recovers. (In PERMISSIVE mode — sometimes used during mesh migration — traffic may fall back to plaintext, which is fail-open. The reference implementation requires STRICT mode precisely to maintain this fail-closed property. See Chapter 15 for the implications of each mode.)
  4. This is actually safer than fail-open (no unauthorized requests slip through), but it means a waypoint crash causes an availability disruption for affected services. The trade-off: L7 failure becomes an availability problem, not a security problem.
  5. L4 is still the independent backstop — even if ztunnel were somehow bypassed, Cilium’s kernel-level enforcement remains.

Default-deny with multiple enforcement layers sounds like operational hell. In practice, the derivation pipeline makes it manageable — the developer doesn’t interact with the layers directly.

The developer writes:

workload:
resources:
payments:
type: service
direction: outbound

From this single declaration, the platform:

  • Evaluates Cedar authorization (is this dependency permitted?)
  • Generates a CiliumNetworkPolicy egress rule (L4)
  • Generates an AuthorizationPolicy permit (L7, after matching with payments’ inbound declaration)
  • Reports the result in the CRD status

The developer didn’t write a NetworkPolicy, an AuthorizationPolicy, or a Cedar policy. They declared a dependency. The platform derived the security infrastructure across all four layers. This is the same intent-over-infrastructure principle from Part I, applied to security. Default-deny is the platform’s concern, not the developer’s burden.

System namespace exemptions. Infrastructure namespaces (kube-system, istio-system, cilium-system, cert-manager) are exempt from default-deny. The platform’s own components must communicate freely. The lattice-system namespace is not exempt — it uses MeshMember coverage like any application namespace. The exemption list is small, explicit, and audited.

The bootstrap TOCTOU window. Default-deny depends on the CiliumClusterwideNetworkPolicy and the mesh-wide AuthorizationPolicy being installed before any workload traffic flows. During cluster bootstrapping (Chapter 5), there’s a window between “Cilium is running” and “the default-deny policies are applied.” If a workload is scheduled before the deny-all policies exist, it has unrestricted network access.

This is a time-of-check-to-time-of-use problem: the platform assumes default-deny is in place, but the assumption is only true after a specific bootstrapping step completes. The reference implementation handles this by making the platform operator a prerequisite for workload scheduling — the operator installs the default-deny policies during bootstrapping, before any LatticeService CRD can be processed. Workloads that bypass the platform (raw Deployments applied directly) are not protected during this window.

The harder version: what about non-platform pods that exist before the platform operator is ready? System Pods in kube-system (CoreDNS, kube-proxy) are running before Cilium. These pods have unrestricted access until Cilium’s policies take effect. The exemption list for system namespaces makes this explicit rather than hiding it — but it means the first minutes of a cluster’s life are not default-deny. For environments that require default-deny from first boot, the bootstrapping order must be: install Cilium and default-deny policies → then install all other components.

Debugging. The platform provides:

  • CRD status conditions: PolicyAuthorized: False, reason: AccessSecret denied for secret payments/api-key
  • ztunnel logs: kubectl logs -n istio-system -l app=ztunnel | grep RBAC
  • Hubble: L4 packet drops with the denying policy
  • A diagnostic CLI that checks all layers:
$ platform debug connectivity checkout payments
Admission: PASS (spec valid)
Cedar: PASS (AccessSecret permitted, DeployImage permitted)
L4 (Cilium): PASS (CiliumNetworkPolicy checkout-egress allows payments:8080)
L7 (Istio): DENY (AuthorizationPolicy payments-auth: identity
spiffe://cluster.local/ns/commerce/sa/checkout not in
allowed principals)
Result: BLOCKED at L7. Bilateral inbound declaration on payments
does not include checkout.

12.1. [M10] A developer deploys a new service with no resources block. The service starts but all outbound requests fail. Walk through what the developer sees at each step: pod status, application logs, and CRD status. What is the first thing they should check? What change fixes it?

12.2. [H30] Section 12.3 traces a request through all four layers. Extend this: trace a request where the Cedar AccessExternalEndpoint gate permits egress to api.stripe.com, but no bilateral agreement exists (it’s an external service, not a platform service). What resources does the platform produce for external egress? How does L4 handle FQDN-based egress rules?

12.3. [R] The four-layer model means a request that should be denied might be allowed if any layer has a misconfiguration. But a request that should be allowed might be denied if any layer is misconfigured. The failure modes are asymmetric: false-denials are common (developer forgot a declaration), false-permits are rare (requires a bug in the policy). Is this asymmetry desirable? What would it take to make false-permits as likely as false-denials, and would that be worse?

12.4. [H30] The mesh-member controller generates both L4 and L7 policies from the same bilateral declarations. A potential concern: this violates layer independence because a bug in the controller affects both layers simultaneously. Design an alternative where L4 and L7 policies are generated by separate controllers consuming the same LatticeMeshMember CR. What changes? What do you gain? What do you lose?

12.5. [M10] A compromised checkout service can reach payments (bilateral agreement exists) and access orders-db (secret authorized). The attacker has access to checkout’s declared dependencies. Is default-deny’s blast radius containment meaningful here, or does the attacker already have everything they need?

12.6. [R] Default-deny with four layers is operationally expensive. An organization runs 20 services on one cluster with a small team. Is this architecture justified, or is it over-engineering? Where is the threshold — in fleet size, team size, regulatory requirements, or threat model — where four layers become necessary? Can you run fewer layers and still call it “defense in depth”?