Chapter 21: Testing the Platform

The platform makes strong guarantees: default-deny, bilateral enforcement, derivation-time authorization, self-managing clusters. Can you prove they hold?

Not “we deployed it and nothing broke” — that’s absence of observed failure. Can you prove that a service without a bilateral agreement gets denied? That an unsigned image is rejected? That a self-managing cluster survives its parent’s deletion?

The decisions:

How do you split fast tests from slow tests? (Unit tests take milliseconds, E2E tests take 25 minutes. The developer needs feedback in 2 minutes. The fleet needs full validation nightly.)
How do you test cross-service guarantees? (Bilateral agreements span services. Fixed test scenarios miss edge cases. Randomized testing catches them — but how many random topologies is enough?)
How do you prevent flaky tests from destroying trust? (A 5% flake rate across 50 tests means 2.5 failures per run. The team ignores all failures. Real bugs hide in the noise.)

21.1 Two Testing Layers

Integration tests. Assume infrastructure exists. A cluster is running, the operator is deployed, the mesh is configured. Tests validate behavior against existing infrastructure. The reference implementation has ~45 integration test modules covering: mesh bilateral agreements (fixed and randomized), secret routing (all five paths), Cedar authorization (each gate), image trust (cosign verification, unsigned rejection), compliance (NIST, CIS, SOC 2 probes), packages (Helm install, values interpolation, dependency ordering), webhooks (service, job, model, Cedar policy validation), Tetragon (binary allowlists), gateway and route discovery, cost estimation, quotas, OIDC, and observability.

Integration tests run against existing clusters:

LATTICE_WORKLOAD_KUBECONFIG=/tmp/workload-kubeconfig \
cargo test --features provider-e2e --test e2e test_mesh_standalone \
  -- --ignored --nocapture

Fast — seconds to minutes per test module. Run repeatedly against the same cluster during development.

E2E tests. Assume nothing. The reference implementation’s E2E suite (unified_e2e.rs) provisions a management cluster and a workload cluster from scratch, completes the pivot, deploys services, runs 34+ test suites, and tears down. The full suite takes 20-30 minutes.

cargo test --features provider-e2e --test e2e unified_e2e -- --nocapture

The E2E test composes integration tests. It provisions infrastructure, then calls the same test functions at appropriate lifecycle phases:

Provision management cluster.
Provision workload cluster (Chapter 5 lifecycle).
Complete the pivot (Chapter 6).
Run mesh tests (bilateral agreements, randomized topologies).
Run secret tests (all five routing paths, Cedar authorization).
Run image trust tests (signed acceptance, unsigned rejection).
Run compliance tests (probe evaluation).
Run package tests (Helm install, MeshMember integration).
Run webhook validation tests.
Tear down.

Same test code in both contexts. Adding an integration test automatically adds it to the E2E suite.

21.2 Testing Bilateral Agreements

The strongest tests in the suite — and the most architecturally interesting.

Fixed agreement tests. Define known services with specific bilateral agreements. Deploy them. Run traffic generators that attempt all connections. Verify: allowed pairs succeed, denied pairs fail with policy denial (not timeout — a timeout might mean something else).

Randomized agreement tests. Generate N services with random bilateral agreements — random dependency graphs with random inbound/outbound declarations. Deploy them. Run traffic generators that attempt every pair. Verify enforcement. Each run tests a different topology. Edge cases that fixed tests miss — unusual graph shapes, namespace boundary interactions, circular dependencies — appear naturally.

The reference implementation generates 20+ random topologies per E2E run. Each topology has 5-10 services with random bilateral agreements.

Cycle-based verification. Don’t use flat timeouts (“sleep 30 seconds, then check”). Traffic generators emit cycle markers:

info!("===CYCLE_START===");
// attempt all outbound connections
info!("===CYCLE_END===");

Tests wait for N complete cycles (typically 3-5), then read logs. This ensures the test checks results after traffic generators have actually run, not after an arbitrary wait. Slow environments (overloaded CI, resource-constrained clusters) just take more time per cycle — the test still passes because it waits for work, not time.

Cycle-based verification catches failures faster too: if the first cycle shows a denied connection that should be allowed, the test fails immediately — it doesn’t wait the full 30-second timeout.

21.3 A Randomized Bilateral Test, Step by Step

The E2E suite generates a random bilateral topology:

Generate. Create 8 services with random names. For each service, randomly assign 1-3 outbound dependencies (other services in the set) and the corresponding inbound declarations on the targets. Some services get no outbound dependencies. Some get cyclic references (A→B, B→A — both outbound, both inbound). The generator records the expected traffic matrix: which pairs should work and which should be denied.
Deploy. Apply all 8 LatticeService CRDs. Wait for all to reach Ready (derivation succeeded, mesh member Ready, policies in place).
Generate traffic. Each service runs a traffic generator container (a lightweight HTTP client) that attempts to connect to every other service in the set on each cycle. Connections to declared outbound dependencies should succeed (HTTP 200). Connections to non-declared services should fail with a connection reset (L4 deny) or HTTP 403 (L7 deny).
Wait for cycles. The test waits for 3 complete cycles from every traffic generator. Each generator logs ===CYCLE_START=== and ===CYCLE_END=== with the results of every connection attempt between them.
Verify. Parse the logs. For each service pair: if the pair is in the expected-allowed set, verify the connection succeeded. If not, verify it failed — and verify the failure is a policy denial (not a timeout, not a DNS error, not an application error). A timeout could mean the policy is correct but the mesh is slow. A policy denial means the bilateral model is enforcing correctly.
Cleanup. Delete all 8 services. Garbage collection removes all derived resources.

This test runs 20+ times per E2E suite with different random topologies. Each run tests a different shape of dependency graph. Over hundreds of runs, the test has caught: label selector bugs in CiliumNetworkPolicy generation, SPIFFE identity formatting errors in AuthorizationPolicy, race conditions between mesh-member controller reconciliation and traffic generator startup, and namespace-boundary edge cases in cross-namespace bilateral agreements.

The randomized approach finds bugs that hand-crafted test scenarios miss because the test author’s assumptions about “interesting” topologies are a subset of the topologies that actually occur in production.

21.4 Testing the Independence Guarantee

The independence test. Provision a management cluster and a workload cluster. Complete the pivot. Delete the management cluster. Verify: workloads continue running, nodes can be scaled, a deliberately-killed node is replaced by MachineHealthCheck.

The reference implementation runs this as docker_independence_e2e.rs — a separate E2E test specifically for the self-management guarantee. It’s the definitive test for Chapter 6’s architecture.

Pivot resilience. Interrupt the pivot midway (kill the agent process). Restart. Verify the pivot completes — the idempotent protocol (Chapter 6, Section 6.3) handles retry without duplicating resources.

21.5 Testing Supply Chain

Deploy a service with an unsigned image. Verify: DeployImage gate rejects, status reports ImageVerified: False, no Deployment is created.
Deploy a service with a signed image. Verify: gate passes, Deployment is created.
Deploy a service with a tag reference (not a digest). Verify: AllowTagReference gate rejects.
Add an AllowTagReference Cedar policy. Verify: tag is accepted.
Rotate the signing key in the TrustPolicy. Verify: images signed with the old key are rejected on the next derivation cycle.

Each of these is a unit-testable derivation phase AND an integration test that verifies the end-to-end behavior on a real cluster.

21.6 Unit Testing the Operator

Integration and E2E tests prove the platform works on a real cluster. Unit tests prove the derivation logic is correct without one.

The operator’s core job is compilation: take a LatticeService spec as input, produce Kubernetes resources as output. That function is pure — no API server calls, no network, no state. Which means it’s unit-testable in milliseconds.

Testing derivation directly. Construct a LatticeService spec in code. Call the derivation function. Assert the output: the Deployment has the right image, the Service has the right ports, the CiliumNetworkPolicy allows exactly the declared bilateral pairs, the ExternalSecret references the right store. No cluster. No reconciliation loop. Just input and output.

let spec = lattice_service_spec("my-service")
    .with_image("registry.example.com/app:v1.2.3@sha256:abc...")
    .with_dependency("other-service")
    .with_secret("db-password", SecretSource::Vault("secret/db"));

let resources = derive(&spec, &platform_config);

assert_eq!(resources.deployment.spec.containers[0].image, spec.image);
assert!(resources.network_policies.iter().any(|p|
    p.spec.egress_allows("other-service")
));
assert!(resources.external_secrets.iter().any(|es|
    es.spec.data[0].remote_ref.key == "secret/db"
));

This test runs in under a millisecond. A developer changing derivation logic gets feedback instantly.

Snapshot testing. Render the full set of derived resources for a known spec to YAML or JSON. Store the output as a snapshot file checked into the repository. On each test run, re-derive and compare against the stored snapshot. When derivation logic changes intentionally, the snapshot diff shows exactly what changed across every derived resource — Deployments, Services, NetworkPolicies, ExternalSecrets, AuthorizationPolicies, all of it. Review the diff, update the snapshot, commit. When derivation logic changes unintentionally, the snapshot test fails and the diff shows exactly what broke. This catches the class of bugs where fixing one resource’s derivation silently changes another.

Property-based testing. Fixed specs test known scenarios. Property-based tests generate random specs and assert invariants that must hold regardless of input:

Every derived Deployment has CPU and memory limits set.
Every service that declares outbound dependencies produces CiliumNetworkPolicies for exactly those dependencies.
Every secret reference produces an ExternalSecret with the correct store binding.
Every service produces a ServiceAccount with a name matching the service.
No derived resource references a namespace other than the service’s own namespace (unless it’s a cross-namespace bilateral agreement).

Generate hundreds of random specs per test run. Each run explores a different region of the input space. Over time, property-based tests find edge cases that hand-written specs never cover: empty dependency lists, services with 50 dependencies, secret names containing special characters, image references without tags.

Testing Cedar policies. Cedar’s authorization evaluator is a pure function: policy set + request → permit or deny. No cluster needed. Construct a policy set from the platform’s Cedar files, submit known authorization requests, and assert the decision:

let policies = PolicySet::from_str(include_str!("../policies/platform.cedar"))?;
let request = Request::new(
    r#"LatticeService::"frontend""#.parse()?,
    r#"Action::"DeployImage""#.parse()?,
    r#"Image::"unsigned-image""#.parse()?,
    context,
);
let response = authorizer.is_authorized(&request, &policies, &entities);
assert_eq!(response.decision(), Decision::Deny);

Test every gate: DeployImage, AccessSecret, AllowTagReference, OverrideSecurity, AccessExternalEndpoint, AccessVolume. Test the positive case (policy permits) and the negative case (policy denies). Test edge cases: what happens when no policy matches (default deny)? What happens when two policies conflict? Run these in CI on every commit. Policy regressions are caught before they reach a cluster.

Mock API server vs. real cluster. Unit tests should never talk to a real API server. The API server introduces latency, flakiness, and ordering dependencies that destroy the speed and reliability that make unit tests valuable. Use an in-process test API server (a minimal etcd and API server spun up for testing) for tests that need to exercise the reconciliation loop, or use a mock client that records created, updated, and deleted resources for tests that only need to verify what the operator would do. Integration tests (Section 21.1) are where real clusters belong.

The split is clear: unit tests validate logic, integration tests validate behavior. Both are necessary. Unit tests run in seconds across the entire derivation surface. Integration tests run in minutes against real infrastructure. The developer runs unit tests on every save. CI runs integration tests on every PR.

21.7 Coverage Discipline

80% minimum. Development halts below this. The reference implementation enforces this in CI.
90% target. The standard the platform team aims for.
95% on critical paths. Provisioning, pivot, Cedar evaluation, secret resolution, network policy derivation, image verification.

Why strict coverage for platform code: untested application code fails visibly — the feature doesn’t work. Untested platform code fails invisibly — a policy isn’t enforced, a secret isn’t authorized, a network path isn’t denied. The consequences are security gaps, not user-facing bugs.

21.8 Avoiding Flaky Tests

Flaky tests in a platform suite are worse than no tests — they train the team to ignore all failures.

Cycle-based verification (Section 21.2) over flat timeouts. Wait for work, not time.

Wait for conditions (resource status, pod readiness, policy enforcement) instead of sleeping.

Diagnostic dumps on failure. When a test fails, automatically collect: pod status, CRD status, event logs, ztunnel logs, Cilium policy audit, Cedar evaluation logs. The reference implementation’s test harness dumps diagnostics for every failed test — turning a “flaky test” into a “test that failed with enough context to diagnose why.”

Deterministic resource names with test-specific prefixes to avoid collisions between concurrent runs.

Cleanup on failure. Tests that fail must clean up their resources. Leaked resources from a failed test cause cascading failures in subsequent tests.

The stance on flaky tests: fix immediately or delete. Quarantine is a short-term measure with a 1-week SLA — if the root cause isn’t found in a week, delete the test and file a bug to rewrite it. A quarantined test is a silently dead test. A test suite with 5 quarantined tests is a test suite with 5 blind spots that nobody is actively fixing. The discipline is uncomfortable but necessary: if you can’t make a test reliable, you don’t have a test — you have noise.

21.9 Deterministic Reproduction

Every randomized test must log its seed. This is non-negotiable. A randomized bilateral test that generates 8 services with random dependencies and fails on one topology is useless if you can’t reproduce it. The test harness logs the seed at the start of each randomized run:

info!("Randomized bilateral test seed: 0x1a2b3c4d");

Re-running with --seed 0x1a2b3c4d reproduces the exact topology that failed. Without seed logging, debugging randomized failures becomes “run it again and hope it fails the same way” — which it usually doesn’t.

Determinism is a first-class design goal. The platform depends on reconciliation, eventual consistency, and distributed state — all sources of non-determinism. Tests must fight this:

Use deterministic resource names (not generateName).
Wait for specific conditions, not time.
Log enough context that a failure can be replayed from the log alone.
Assert on stable properties (policy exists, connection denied) not transient state (pod count during scaling).

If a test can’t be reproduced from its output, the test is incomplete.

21.10 Fault Injection and Chaos Testing

Correctness tests verify the platform works. Fault injection tests verify it recovers.

The reference implementation includes chaos tests (chaos.rs) that inject failures during critical operations:

Kill the controller mid-reconcile. The derivation pipeline is processing a service — halfway through Layer 1, the controller process is killed. On restart, the controller sees a partially-derived service (some resources exist, some don’t). Does it recover cleanly (re-derive from scratch) or leave orphaned resources?
API server latency injection. Add 500ms latency to every API server call. Does the controller time out? Do watch events back up? Does the reconciliation queue grow unbounded? The reference implementation should handle this gracefully — slower derivation, not broken derivation.
Kill ztunnel during a bilateral test. A mesh failure during traffic verification. Does the test correctly distinguish “policy denied” from “mesh down”? Does the platform’s compliance controller detect the ztunnel failure?
Network partition between agent and cell. The gRPC stream drops. Does the workload cluster continue operating? Do cached policies persist? Does the agent reconnect with exponential backoff? This validates the Chapter 7 “useful, not required” principle.
Interrupt the pivot. Kill the agent process mid-pivot. Restart. Does the idempotent protocol complete without duplicating resources? This is already covered in Section 21.4 but belongs in the chaos testing framework.

The gap this fills. Integration tests prove the platform works when everything is healthy. Chaos tests prove it recovers when things break. For a platform that manages its own infrastructure (self-managing clusters, Chapter 6), resilience under failure is as important as correctness under normal conditions.

CAPD limitations. The reference implementation runs E2E tests on Docker-based clusters (CAPD). This is fast and reproducible — but it hides entire classes of bugs that only appear on real infrastructure: cloud networking (VPC routing, security groups), IAM and identity propagation, storage latency (EBS vs local NVMe), and load balancer provisioning. CAPD is necessary for fast iteration but insufficient for production confidence. Organizations should run a subset of E2E tests on cloud infrastructure periodically (weekly or before releases) to catch provider-specific regressions.

21.11 Cost and Confidence

Testing is not free. Each tier has a cost:

Unit tests: Near-zero. CPU time on the developer’s machine or a CI runner. Run thousands per minute.
Integration tests: Moderate. Requires a long-lived cluster. The cluster costs money (or requires someone to maintain a shared test environment). Randomized tests multiply the cost — 20 topologies × 8 services each = 160 service deployments per run.
E2E tests: Expensive. Provisioning a cluster from scratch takes 15-20 minutes of cloud resources. Nightly runs cost ~$5-15 per run depending on the provider. Cloud-based E2E tests cost more.
Chaos tests: Variable. Some (kill a process) are cheap. Others (network partition simulation) require infrastructure tooling (Chaos Mesh, Litmus) that adds operational overhead.

The trade-off: more testing buys more confidence but costs more money and developer attention. The reference implementation’s approach: maximize unit tests (cheap, fast, high coverage), run integration tests on every PR (moderate cost, catches behavior bugs), run E2E nightly (expensive, catches lifecycle bugs), and run chaos tests weekly (catches resilience bugs). This is not the only valid structure — but the principle is: spend the most on the cheapest tests and the least on the most expensive tests.

21.12 The CI Pipeline

The test suite is only useful if it runs automatically. The reference implementation structures CI into three tiers based on feedback speed and blast radius.

Every commit: unit tests. Derivation logic, Cedar policy evaluation, snapshot comparison, property-based tests. These run without a cluster in under 2 minutes. A developer pushing a branch gets pass/fail before they context-switch. Unit tests catch the majority of regressions — wrong resource output, broken policy evaluation, unintended snapshot drift — at near-zero infrastructure cost.

Every PR: integration tests. The full integration suite runs against an existing long-lived cluster. Tests validate bilateral agreements (fixed and randomized topologies), secret routing, image trust, compliance probes, webhook validation, and mesh behavior on real infrastructure. Target: under 10 minutes. PRs are blocked until both unit and integration tests pass. No exceptions — a green CI badge means the derivation logic is correct (unit) and the behavior is correct on a real cluster (integration).

Nightly: full E2E. The complete lifecycle test: provision a cluster from scratch, complete the pivot, deploy services, run the full test suite (34+ modules), tear down. Target: under 30 minutes. Nightly runs catch infrastructure-level regressions — bootstrapping failures, pivot edge cases, provider-specific issues — that integration tests against a stable cluster cannot detect. Nightly failures trigger investigation but do not block daytime development. If a nightly failure persists for two consecutive runs, it escalates to a blocking issue.

The reference implementation runs this pipeline but does not prescribe a CI system. GitHub Actions, GitLab CI, Jenkins, and Buildkite all work. The pipeline structure — what runs when, what blocks what, what feedback latency the developer experiences — matters more than the tool. The key constraint is the 2-minute unit test target: if unit tests slow past that threshold, developers stop running them locally and the fast feedback loop breaks.

21.13 Part VII Complete

The operational layer:

Chapter 19: Observability by default — metrics, tracing, and logging as automatic platform properties.
Chapter 20: Backup and disaster recovery — the derivation pipeline as a recovery mechanism.
Chapter 21: Testing — proving security guarantees and policy enforcement across unit, integration, E2E, and chaos tests.

Part VIII addresses two cross-cutting concerns: cryptographic discipline (Chapter 22) and the platform as product (Chapter 23).

Exercises

21.1. [M10] A developer changes the network policy derivation logic. The E2E suite takes 25 minutes. Design the feedback loop: what runs on every commit? What runs on PR merge? What runs nightly?

21.2. [H30] Randomized bilateral testing generates random dependency graphs. How many random configurations provide meaningful coverage? Is there a formal bound? How do you ensure reproducibility — if a random test fails, can you replay the exact graph?

21.3. [R] The platform has 90% test coverage. The untested 10% includes: error handling for malformed Cedar syntax that passes the webhook, edge cases in mixed-content secret interpolation, and the unpivot path when the parent’s API server is slow. Is 90% enough for critical-path platform code? What is the risk of the untested 10%?

21.4. [M10] A flaky test fails 5% of the time. The E2E suite has 50 tests. On average, 2.5 tests fail per run. The team ignores all failures. How do you fix this — quarantine flaky tests, delete them, or fix the root cause?

21.5. [H30] The E2E suite uses Docker-based clusters (CAPD). Production runs on AWS/Proxmox. What does the Docker test catch that’s production-relevant? What does it miss? Should the platform also run E2E on cloud infrastructure? What’s the cost?