Skip to content

Chapter 19: Observability by Default

A team deploys a service. It runs for three weeks. Then it has an incident. The on-call engineer opens the dashboards and discovers: there are no dashboards. Nobody configured observability.

The decisions — automatic vs. opt-in observability, platform vs. application instrumentation, resilience of the observability layer — are universal. The reference implementation uses VictoriaMetrics; the same patterns apply to Prometheus, Thanos, Mimir, or any metrics stack that supports service discovery. The decisions:

  • Should observability be opt-in or automatic? (Opt-in means some services have no metrics. Automatic means every service is observable from deploy — but you pay the storage cost for services nobody looks at.)
  • What does the platform instrument vs. the application? (Infrastructure metrics vs. business metrics — who configures what?)
  • What happens when observability itself fails? (The metrics agent is a pod. It can crash. Should the platform depend on its own observability?)

The derivation pipeline’s extension phases (Chapter 8) include VMServiceScrape generation. For every compiled service, the pipeline produces:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
name: checkout
namespace: commerce
ownerReferences:
- apiVersion: lattice.dev/v1alpha1
kind: LatticeService
name: checkout
spec:
selector:
matchLabels:
lattice.dev/service: checkout
endpoints:
- port: http
interval: 30s
path: /metrics

The developer’s spec doesn’t mention metrics. There is no metrics.enabled: true. The scrape target exists because the service exists. It’s owned by the LatticeService CRD through owner references — delete the service and the scrape target is garbage-collected.

VMAgent discovers this VMServiceScrape and begins scraping the service’s /metrics endpoint every 30 seconds. From the moment the first pod starts responding to health checks, its metrics are collected.

What the platform collects automatically:

  • Request rate, latency distribution, error rate (RED metrics) — from the mesh data plane, not the application.
  • Resource utilization per pod (CPU, memory, network) — from kubelet metrics.
  • Network policy decisions — from Cilium Hubble.
  • Derivation status — compilation success/failure rates, derivation latency.

What the application adds: Business metrics (orders/second, revenue, model accuracy). The application exposes these on the same /metrics endpoint. The VMServiceScrape already targets it — no additional configuration needed. The platform provides the plumbing; the application provides the meaning.

Collection. VMAgent (VictoriaMetrics) or Prometheus agent runs on each cluster. It discovers scrape targets through VMServiceScrape and PodMonitor resources. The scrape interval defaults to 30 seconds — configurable in the platform’s observability configuration, overridable per service through the observability block in the LatticeService spec.

Storage. Each cluster runs its own VictoriaMetrics instance for fast local queries. Long-term storage can be centralized (VictoriaMetrics cluster mode with S3 backend, or Prometheus + Thanos) for cross-cluster dashboards and historical analysis.

Multi-cluster. Self-managing clusters (Chapter 6) must be observable locally — they can’t depend on a central metrics backend on another cluster. If the outbound gRPC stream (Chapter 7) drops, the cluster continues collecting and storing metrics locally. Cross-cluster dashboards use the centralized storage, which aggregates from each cluster’s local store.

GPU metrics. DCGM metrics (Chapter 17) are collected through the same infrastructure. The DCGM exporter runs as a DaemonSet on GPU nodes. The platform generates a scrape target during cluster bootstrapping (Chapter 5). GPU metrics and service metrics live in the same backend.

Cost. Automatic observability means every service is scraped. At 200 services, 30-second intervals, ~5-10KB per scrape (a minimal service with only RED metrics might produce ~1KB, but services with rich instrumentation — histograms, custom counters, per-endpoint breakdowns — routinely produce 10-50KB): ~2-4MB/minute, ~3-6GB/day, ~90-180GB/month. Manageable, but not negligible. At 2,000 services with high-cardinality custom metrics, storage becomes a real cost. The platform should provide retention policies (how long to keep high-resolution data), downsampling (reduce resolution for older data), and per-team storage quotas.

Waypoint proxies inject W3C Trace Context headers (traceparent, tracestate) into proxied requests. Ztunnel operates at L4 and cannot modify HTTP headers. Traces span service calls within a cluster. Cross-cluster traces work when both clusters have waypoint proxies and direct mesh connectivity between clusters (Chapter 15). The sending cluster’s waypoint proxy adds trace headers, ztunnel encrypts for transit, and the receiving cluster’s ztunnel terminates mTLS. The receiving waypoint proxy sees the HTTP request — including trace headers — and produces its own spans. The parent is not involved in the data path; it only compiled the Istio multi-cluster configuration that makes the direct connection possible.

What the platform traces automatically: L4 connection metrics come from ztunnel (connection duration, bytes transferred). HTTP-level request spans — method, path, status code, latency — come from waypoint proxies. Together these show inter-service latency and where time is spent in the mesh, but namespaces without a waypoint proxy only get L4 metrics, not HTTP-level traces.

What the application traces: Internal spans — database queries, cache lookups, external API calls, business logic. The platform provides the propagation context (the traceparent header via the waypoint proxy). But the application must propagate it — if a service receives a request with a traceparent header but doesn’t include it in its outbound call to a dependency, the trace breaks at that service. The mesh can’t fix this because it only sees the connection, not the application’s HTTP client code. Most frameworks (Spring, Express, Flask with OpenTelemetry middleware) auto-propagate. Custom HTTP clients need explicit header forwarding.

Sampling. Full tracing (every request) is expensive. The platform provides a default sampling rate (e.g., 1% of requests, or all requests with errors). Teams can override per service. Head-based sampling (decide at trace start) is simpler. Tail-based sampling (decide after trace completion) captures errors but requires buffering all spans.

Distributed tracing requires the most effort per unit of value. The platform provides mesh-level spans automatically. Application-level spans require the developer to instrument their code. Teams that don’t instrument get mesh-level traces (useful for inter-service latency debugging) but no application-level traces (useless for business logic debugging). The platform can encourage instrumentation but can’t enforce it.

Metrics tell you what happened. Traces tell you where. Logs tell you why. The platform’s logging story is simpler than metrics or tracing — but it has its own decisions.

The platform doesn’t derive log pipelines. Unlike metrics (VMServiceScrape is auto-generated) and tracing (mesh injects headers automatically), the platform doesn’t create per-service logging configuration. Logging is infrastructure-level: a DaemonSet (Fluent Bit, Vector, or a similar agent) runs on every node, tails container stdout/stderr from the kubelet’s log directory, and ships to the configured backend (Loki, Elasticsearch, CloudWatch).

Every container that writes to stdout gets its logs collected. No per-service configuration. No opt-in. The developer writes println or uses their language’s logger; the platform ships it.

Structured vs. unstructured. The platform can’t enforce structured logging — it can’t control what the application writes to stdout. But it can encourage it: the service spec’s observability block can declare logging.format: json, and the platform can validate that the service’s container logs parse as JSON during integration tests. Services that emit unstructured logs still get collected, but they’re harder to query and filter.

Log levels and volume. A service emitting 10KB/s of logs at DEBUG level consumes ~26GB/month per replica. At 200 services, that’s 5TB/month — a real storage cost. The platform should provide default log retention (7 days at full resolution, 30 days at reduced/sampled resolution) and per-namespace volume quotas. When a service exceeds its logging quota, the platform drops logs rather than crashing the log agent — the service keeps running, but the excess logs are lost.

Correlation. The mesh injects trace IDs (traceparent header) into proxied requests. If the application logs include the trace ID (most logging frameworks support this with OpenTelemetry integration), logs and traces can be correlated: click a trace span, see the logs from that request. Without trace ID in logs, the correlation requires timestamp matching — imprecise and fragile.

What the platform logs automatically:

  • Platform operator events (derivation success/failure, policy changes, Cedar evaluation results).
  • Mesh proxy logs (connection attempts, authorization decisions, mTLS handshake failures).
  • Kubernetes events (pod scheduling, container starts/restarts, health check failures).

These platform logs are the first place an on-call engineer looks when a service is failing. They don’t require application instrumentation and they cover the infrastructure layer that the application can’t see.

The metrics agent is a pod. The metrics backend is a service. Both can fail.

VMAgent down. Scrapes stop. The gap is lost — time-series data that wasn’t collected can’t be recovered. The platform should monitor VMAgent health through Kubernetes liveness probes and a separate lightweight monitor that alerts if VMAgent stops reporting.

Metrics backend down. VMAgent buffers locally (configurable). When the backend recovers, the buffer is flushed. If the outage exceeds the buffer, data is lost. Size the buffer for the expected maximum outage.

The platform shouldn’t depend on its own observability. Self-managing clusters operate without the parent. They should also operate without their metrics backend. A metrics outage affects visibility, not workload health.

Alerts split into two categories with different owners.

Infrastructure alerts are the platform team’s responsibility. The platform installs these during bootstrapping — they exist before any service is deployed:

  • Node NotReady for > 5 minutes
  • Control plane component unhealthy
  • ztunnel DaemonSet not fully ready
  • Default-deny CiliumClusterwideNetworkPolicy missing
  • Certificate expiring within 10% of lifetime
  • Compilation failure rate > threshold
  • VMAgent scrape failures

These alerts fire to the platform team’s on-call rotation. They indicate the platform itself is degraded — not a specific service.

Service alerts are derived from automatically-collected RED metrics. The reference implementation does not yet derive alert rules from the service spec — this is a gap. The current model: the platform team configures global alert rules (e.g., “alert when any service’s error rate exceeds 5% for 5 minutes”) that apply to all services. Per-service thresholds are not configurable in the spec.

The natural evolution is an observability.alerts block in the service spec:

observability:
alerts:
errorRate:
threshold: "5%"
window: "5m"
p99Latency:
threshold: "500ms"
window: "5m"

The pipeline would derive alert rules (VMRule or PrometheusRule) from this block, the same way it derives VMServiceScrape from the service’s existence. This is not implemented — it’s the obvious next step for a platform that derives everything else.

Alert routing. Who gets paged? Infrastructure alerts go to the platform team. Service alerts should go to the service owner. The platform can derive routing from the service’s namespace or team label — but the routing destination (PagerDuty service, Slack channel, email group) must be configured by the team, not derived by the platform. The reference implementation uses namespace-level annotations for alert routing targets.

The boundary. The platform alerts on what it can observe without application knowledge: request rate, error rate, latency, resource utilization, certificate health, policy presence. The application alerts on what only it knows: business metrics (orders/second below threshold), queue depth, model accuracy degradation. The platform provides the alerting infrastructure (VMAlert, Alertmanager); the application provides the rules for business-specific conditions.

Scenario: silent metrics gap. VMAgent crashes on a cluster at 2 AM. Kubernetes restarts it in 30 seconds. During those 30 seconds, one scrape cycle is missed. No data is collected. No alert fires — the metrics system can’t alert about its own failure because the metrics system is down.

At 2:15 AM, a service has a spike in error rate. At 9 AM, the on-call engineer investigates. They open the metrics dashboard and see: the error rate graph has a 30-second gap at 2:00 AM, then the spike at 2:15 AM. Did the spike start during the gap? Was the VMAgent crash related to the service issue? They can’t tell — the data from those 30 seconds is gone.

The lesson. Meta-monitoring (monitoring the monitoring) is not optional. A separate lightweight process — not VMAgent itself — must watch VMAgent’s health and alert through a different channel (Kubernetes events, a separate alerting pipeline) when scraping stops. The reference implementation uses VMAgent’s own liveness probe plus a compliance controller probe that verifies scrape freshness: “is the most recent data point for this service less than 2 minutes old?”

Scenario: dashboard shows stale data during network partition. A multi-cluster setup with centralized dashboards. Cluster-west loses connectivity to the central VictoriaMetrics. Local metrics continue collecting (the cluster’s own VMAgent + local VM still work). But the central dashboard shows cluster-west’s metrics frozen at the time of the partition.

The on-call engineer sees “cluster-west: 10 nodes, P99 latency 200ms” on the dashboard — but that data is 2 hours old. Cluster-west actually has 3 nodes down and P99 at 5 seconds. The dashboard doesn’t indicate staleness.

The lesson. Cross-cluster dashboards must show data freshness timestamps. “Last updated: 2 hours ago” is more useful than showing stale numbers without context. The central VictoriaMetrics should track the last-received timestamp per remote-write source and expose it as a metric that the dashboard consumes.

19.8 Debugging an Incident with Platform Observability

Section titled “19.8 Debugging an Incident with Platform Observability”

Walk through what happens when a service has a latency spike.

t=0. The checkout service’s P99 latency jumps from 200ms to 2 seconds. The platform’s alerting fires: P99Latency > 1s for service commerce/checkout (threshold: 500ms).

t=1m. The on-call engineer opens the service’s metrics dashboard (auto-generated from the VMServiceScrape). They see:

  • Request rate is steady — no traffic spike.
  • Error rate jumped from 0.1% to 15%.
  • P99 latency correlates with the error rate spike.

t=2m. The engineer checks the mesh-level traces. The automatic mesh spans show: requests from checkout to payments are taking 1800ms. Requests from checkout to orders-db are normal (5ms). The latency is in the payments dependency, not in checkout itself.

t=3m. The engineer checks payments’ metrics. Payments is healthy — low latency, low error rate for most callers. But the traces show checkout’s requests to payments are being queued at the waypoint proxy. The AuthorizationPolicy evaluation is slow.

t=5m. Root cause: a Cedar policy change added a complex forbid rule. The derivation pipeline re-compiled the bilateral agreements and produced a more complex AuthorizationPolicy — the waypoint proxy evaluates this derived policy on every request. The evaluation time increased from 1ms to 50ms, and under checkout’s request volume (500 RPS), the queuing cascades.

What the platform provided without anyone configuring it:

  • The latency alert (derived from RED metrics automatically collected by the VMServiceScrape).
  • The per-dependency latency breakdown (HTTP-level spans from waypoint proxies).
  • The correlation between checkout’s latency and payments’ authorization evaluation (trace propagation through the bilateral agreement path).

What the platform couldn’t provide: The root cause — that a Cedar policy change caused the latency. The engineer needed to correlate the timing of the policy update (from CedarPolicy CRD events) with the latency spike. The platform could improve by exposing policy evaluation latency as a metric on the waypoint proxy.

This incident took 5 minutes to diagnose because the observability was already in place. Without automatic metrics and tracing, the engineer would have started with “something is slow” and spent 30 minutes instrumenting before they could begin diagnosing.

The platform derives metrics scrape targets automatically but not dashboards. A developer deploys a service, VMServiceScrape is created, metrics flow into VictoriaMetrics — and then the developer has to build their own Grafana dashboard manually. This is a gap in the “observable from deploy” promise.

The natural extension: for every VMServiceScrape, derive a GrafanaDashboard resource (or a ConfigMap-based Grafana provisioner) that shows RED metrics for that service — request rate, error rate, and latency distribution. The design is a dashboard template parameterized by service name and namespace. The derivation pipeline fills in the parameters at compile time. The developer gets a working dashboard the moment their service is deployed, the same way they get a scrape target.

The reference implementation does not do this yet — it is listed in the roadmap appendix as a not-yet-implemented feature. The template approach is straightforward (Grafana’s JSON model supports variable substitution), but the details matter: which panels, which time ranges, how to handle services with custom metrics alongside the standard RED panels. The goal is a useful default that covers 80% of debugging scenarios, not a comprehensive dashboard that tries to show everything.

19.1. [M10] A service exposes custom metrics on port 9090, not the default 8080. The VMServiceScrape targets port http (the service’s primary port). The custom metrics are never collected. How should the developer configure a non-default metrics port?

19.2. [H30] Design the multi-cluster metrics architecture for 50 clusters. Each has VMAgent + local VictoriaMetrics. A central cluster aggregates. What’s the data flow? What happens when 5 clusters are disconnected? Stale data, gaps, or errors on the dashboard?

19.3. [R] Automatic observability collects RED metrics for every service. A security team argues that request rate and latency reveal business activity patterns and collecting them without consent violates data classification policy. Can observability conflict with security? How should the platform handle it?

19.4. [M10] The platform defaults to 30-second scrape intervals. A team wants 5-second scrapes for a latency-sensitive service. What is the impact on VMAgent and storage? Should the platform allow per-service overrides? What guardrails prevent abuse?

19.5. [H30] Storage grows to 1.8TB/month at 2,000 services. Design retention and downsampling: how long at full resolution? When does downsampling start? How do you balance query accuracy against storage cost?