Skip to content

Chapter 10: Autoscaling, Quotas, and Cost

The derivation pipeline sees every resource request before it becomes a pod. This is a unique enforcement point — between the developer’s intent and the scheduler’s allocation. Three decisions arise at this point:

  • When do you enforce resource limits? (At scheduling time, when it’s too late to give useful feedback? Or at derivation time, when you can reject with a clear message?)
  • What do you scale on? (CPU utilization is the default and often wrong. The metric you choose determines whether scaling actually helps.)
  • What do you protect automatically? (PodDisruptionBudgets, cost estimates — should these be developer configuration or platform features?)

In the standard Kubernetes model, resource governance happens at scheduling time. The developer creates a pod, the scheduler tries to place it, and if the namespace’s ResourceQuota is exceeded, the pod sits in Pending. The developer discovers the problem by inspecting pod events — not by reading the service spec’s status.

In the derivation model, the pipeline checks resources before any pod exists. The LatticeService spec says replicas: 3 with cpu: 500m each — that’s 1500m CPU. The pipeline checks this against the team’s budget. If it exceeds the budget, derivation fails:

status:
phase: Failed
conditions:
- type: QuotaExceeded
status: "True"
reason: InsufficientBudget
message: "Service checkout requests 1500m CPU
(commerce team budget: 2000m, 1200m in use, 800m available)"

The developer sees the problem in the CRD status immediately — not in a pod event hours later. The pipeline rejected the spec before creating any Kubernetes resources. The scheduler never sees a pod that can’t fit.

How quotas are configured. The platform team sets budgets per namespace or per team through the platform’s quota system. The derivation pipeline sums the resource requests of all services in the namespace and checks whether the new service would push the total over budget. This is the same pattern as the rest of the pipeline — the developer declares what they need, the platform evaluates it against constraints.

The trade-off. Derivation-time quota enforcement catches overcommitment early. But it creates a new failure mode: a service that was within budget yesterday might fail derivation today because another service in the same namespace consumed more resources overnight. The developer didn’t change their spec — something else changed. The status message must be clear about this: “your service hasn’t changed, but the namespace’s resource usage has. Contact the platform team for a budget increase or reduce usage elsewhere.”

The race condition. Two developers apply specs simultaneously. Each requests 500m CPU. The namespace has 800m available. Each derivation checks quota independently, sees 800m available, and proceeds. Both succeed — but the namespace is now 200m over budget (1000m requested, 800m budgeted). This is the classic TOCTOU problem: the time-of-check (quota evaluation) and time-of-use (resource creation) are not atomic.

Mitigation options: pessimistic locking (serialize all derivations per namespace — correct but slow, blocks concurrent deployments), optimistic locking (derive concurrently, check quota again at apply time, retry on conflict — faster but complex), or soft quotas (allow temporary overcommitment, alert when exceeded, rely on the next derivation cycle to detect and report — simplest to implement but weakens the guarantee). The reference implementation uses optimistic locking with a namespace-scoped resource version check at apply time. If two derivations race, one succeeds and the other retries with updated quota state.

When the service spec includes scaling parameters, the pipeline generates a KEDA ScaledObject:

spec:
autoscaling:
minReplicas: 2
maxReplicas: 10
metrics:
- type: cpu
target: 70

The pipeline produces:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: checkout
namespace: commerce
spec:
scaleTargetRef:
name: checkout
minReplicaCount: 2
maxReplicaCount: 10
triggers:
- type: cpu
metadata:
type: Utilization
value: "70"
cooldownPeriod: 300
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300

The developer wrote 5 lines of scaling intent. The pipeline produced a ScaledObject with sensible defaults for cooldown (5 minutes) and stabilization (5 minutes of sustained low utilization before scaling down). These defaults prevent flapping — rapid scale-up/scale-down cycles that waste resources and destabilize the service.

KEDA is the scaling engine in the reference implementation because its ScaledObject model maps cleanly to the derivation pipeline. Native HPA is also viable for CPU/memory-only scaling. The pattern is the same regardless of engine: the developer declares scaling intent, the platform derives the autoscaling resource.

The decision: what to scale on. CPU utilization is the default metric and it’s often wrong for microservices. A service that calls a database spends most of its time waiting for the response — CPU is low, but the request queue is growing and latency is degrading. Scaling on CPU sees “20% utilization, no scaling needed” while users experience 5-second response times.

Better metrics for most services:

  • Request rate (requests/second) — scales proportional to load. The platform can derive this from the mesh’s RED metrics without the application exporting anything.
  • Request queue depth — scales when the service can’t keep up. Requires the application or the mesh to expose queue length.
  • Custom application metrics — the application exports a metric (e.g., pending_orders) and the ScaledObject targets it.

The platform should default to a metric that works without application instrumentation (CPU or request rate from the mesh) and let developers override with custom metrics when they know better. The reference implementation’s autoscaling.metrics block supports both KEDA built-in triggers (cpu, memory) and custom metrics from the application or mesh.

Scaling to zero. KEDA supports scaling to zero replicas — useful for dev/staging environments and event-triggered workloads. The pipeline supports this but it’s a significant decision: a service at zero replicas has no availability. The first request after scale-down waits for a pod to start. The spec must explicitly set minReplicas: 0 — there’s no accidental scaling to zero.

Quota interaction. If a service has maxReplicas: 10 at cpu: 500m each, the potential resource consumption is 5000m CPU. Should the quota check validate against the minimum (1000m) or the maximum (5000m)?

The reference implementation validates against the maximum. This is conservative — it reserves more quota than the service typically uses. But the alternative is worse: the service deploys fine at minReplicas, then fails to scale when it needs to because the quota is exhausted. Discovering a quota problem during a traffic spike is more expensive than discovering it at derivation time.

The trade-off. Validating against maxReplicas means teams must have budget for their worst-case resource consumption, even if they rarely use it. A service with maxReplicas: 50 at 500m CPU reserves 25 vCPU (25000m) of quota even if it usually runs at 2 replicas. This incentivizes teams to set realistic maxReplicas rather than padding them “just in case.” Whether this is the right incentive depends on the organization — some teams prefer padding for safety, and the platform should let the platform team decide the quota policy.

Every service with more than one replica gets a PDB. This is a platform feature, not a developer configuration.

The logic:

  • 1 replica: no PDB. There’s nothing to protect — the single pod must be evicted for node drains.
  • 2 replicas: maxUnavailable: 1. At least one pod is always running during voluntary disruptions.
  • 3+ replicas: maxUnavailable: 1. At most one pod offline at a time during drains and upgrades.

The developer doesn’t configure PDBs. They set replicas: 3 and the pipeline derives the appropriate disruption budget. If the developer needs custom PDB behavior (higher maxUnavailable for services that can tolerate more disruption), the spec can override — but the default protects availability.

PDBs interact with cluster upgrades (Chapter 6). When the platform drains a node during a Kubernetes upgrade, PDBs constrain how many pods can be evicted simultaneously. Aggressive PDBs (maxUnavailable: 1 on every service) slow down drains. The platform team balances availability guarantees against operational velocity — this is a policy decision encoded in the derivation logic.

The trade-off. Automatic PDBs protect availability but can block node drains if a service is already degraded. A 3-replica service with 1 pod in CrashLoopBackOff has 2 healthy pods. The PDB says maxUnavailable: 1, but one pod is already unavailable. The drain controller can evict at most 0 more pods — the drain is blocked. The platform should detect this (PDB blocking drain with pre-existing unhealthy pods) and either override the PDB or alert the operator.

The pipeline knows resource requests, replica counts (or scaling bounds), and the platform’s pricing model. It can estimate cost per service.

status:
cost:
estimatedMonthlyCost: "$142.80"
breakdown:
cpu: "$86.40"
memory: "$56.40"
basis: "maxReplicas (10) × requests"

This appears in the CRD status. It’s informational, not blocking — don’t reject a deployment because it’s expensive. Quotas are the enforcement mechanism. Cost is a signal to the developer and their manager.

The pricing model is a configurable input — a ConfigMap, a CRD, or an external API. Cloud pricing changes. Internal chargeback rates change. Stale pricing is worse than no pricing because it creates false confidence.

The trade-off. Derivation-time cost estimates are approximate. Cloud pricing is per-second with spot, reserved, and on-demand tiers. GPU instances cost 10-100x more than CPU instances. The estimate uses a simplified model (hourly rate × hours/month × resource quantity). This is accurate enough to tell a developer “your service costs ~$150/month” and wrong enough that the finance team shouldn’t use it for billing. It’s a guardrail, not an invoice.

The Vertical Pod Autoscaler watches pod resource usage and adjusts CPU and memory requests to match actual consumption. The derivation pipeline also sets resource requests — it reads them from the LatticeService spec and writes them to the Deployment. These two systems conflict: the pipeline sets cpu: 500m, VPA observes 80m and patches the Deployment to cpu: 120m, the next derivation cycle overwrites VPA’s change back to 500m.

The reference implementation does not use VPA. The developer sets requests in the spec, the pipeline derives them, and no one changes them at runtime. HPA (via KEDA) handles horizontal scaling; node autoscaling (Section 10.6) handles cluster capacity. Vertical right-sizing is the developer’s responsibility.

The cost is real: requests drift from reality. A service deployed at 500m CPU six months ago might use 80m today, wasting 420m of quota and cluster capacity. The developer must manually right-size, and they rarely do.

VPA can work alongside a derivation pipeline — but the pipeline must be designed for it. The naive conflict (pipeline writes cpu: 500m, VPA patches to 120m, next derivation overwrites back to 500m) only occurs if the pipeline blindly overwrites the resources field on every reconciliation. If the pipeline uses server-side apply with a named field manager and doesn’t claim the resources field when VPA is active, VPA’s patches survive. Alternatively, the pipeline can run VPA in recommendation-only mode and surface suggestions in the CRD status for the developer to adopt manually. Either approach works — the design choice is whether the pipeline or VPA owns the resource requests field. The reference implementation hasn’t built either integration yet.

When KEDA scales a Deployment from 2 to 8 replicas and the cluster doesn’t have capacity for 6 new pods, those pods sit in Pending. Cluster Autoscaler (or Karpenter) sees Pending pods, evaluates their resource requests, and provisions nodes to fit them. The derivation pipeline doesn’t control this — node autoscaling is a cluster-level concern managed by the platform team, not by service owners.

But the pipeline’s resource requests are the input to node provisioning decisions. A pod requesting 500m CPU and 512Mi memory needs a node with at least that much allocatable capacity. Cluster Autoscaler picks the cheapest node type that fits. Karpenter is more flexible — it bins multiple pending pods together and selects an instance type that fits the batch. Either way, the resource requests on the pod determine what gets provisioned.

Over-requesting wastes nodes. A service requesting 500m CPU but using 80m gets scheduled onto a node with 500m reserved. The node’s allocatable capacity is consumed on paper, but the actual utilization is 16%. Cluster Autoscaler provisions more nodes to fit more over-requesting pods. The cluster grows to 20 nodes when 4 would suffice at actual utilization. The cloud bill reflects the requested capacity, not the used capacity.

Under-requesting risks instability. A service requesting 100m CPU but bursting to 800m during load spikes gets scheduled onto a node that’s packed tightly based on requests. When multiple co-located pods burst simultaneously, the node’s CPU is saturated. Pods experience throttling. Memory under-requesting is worse — the kernel OOM-kills the pod, and it restarts on the same overcrowded node.

The derivation pipeline sits between these extremes. It knows the declared requests (from the spec). If the platform connects derivation to metrics (Section 10.7), the pipeline can also surface how far requests diverge from actual usage — giving the developer the information they need to right-size before node waste becomes a cost problem.

Pod overhead matters. The derivation pipeline calculates quota against container resource requests. But the scheduler sees more: CNI agent overhead, mesh proxy memory (ztunnel doesn’t add per-pod overhead in ambient mode, but waypoint proxies consume namespace-level resources), and Kubernetes system reservations. A service requesting 500m CPU may need 550-600m of node capacity once overhead is accounted for. The platform team should factor this into quota budgets — either by adding a percentage buffer to quota calculations or by documenting that the budget represents container requests, not total node consumption.

The reference implementation does not modify Cluster Autoscaler or Karpenter configuration per service. Node autoscaling is cluster policy. The pipeline’s contribution is ensuring that the resource requests it derives are accurate — because those requests cascade into every node provisioning decision the cluster makes.

The derivation pipeline knows what a service requests. The metrics backend knows what the service actually uses. Connecting these two data sources is the obvious next step: surface recommendations in the CRD status so developers can adjust their requests based on observed consumption.

The reference implementation doesn’t do this yet. It’s a natural extension — the pipeline has the request, the metrics backend has the usage, and the CRD status is the feedback channel. A future version could query p95 CPU and memory usage over a rolling window and report: “service checkout requests 500m CPU but uses 82m. Consider reducing to 200m.”

The design constraints for anyone building this:

  • CPU over-provisioning wastes money but doesn’t crash anything. A generous buffer (2-3x observed) is safe.
  • Memory under-provisioning kills pods. The buffer must be tighter but the floor must be higher.
  • New services have no data. Skip recommendations until enough metrics exist (7+ days).
  • Don’t auto-apply. Surfacing recommendations respects the developer’s intent. Auto-applying bypasses it.

This is mentioned because it’s the natural evolution of derivation-time resource governance — and because the gap between declared and actual resource usage is one of the largest sources of waste in Kubernetes clusters.

Scenario: quota exhaustion from within the same team. The commerce team’s quota is 4000m CPU. They have 3 services using 3200m. A fourth service is deployed requesting 1200m — total would be 4400m. The pipeline rejects: QuotaExceeded: True, reason: service checkout requests 1200m CPU (commerce team budget: 4000m, 3200m in use, 800m available).

The developer didn’t change anything — their spec is the same as yesterday. But yesterday the namespace had 2800m in use and today it has 3200m because another team member scaled up a different service. The error message must make this clear: the problem is the namespace’s total usage, not this spec.

Scenario: KEDA scales past the budget. A service has maxReplicas: 10 at 500m CPU. The quota check validated against max (5000m) at derivation time — and passed, because the team had budget. Six months later, other services have consumed more quota. KEDA scales from 2 to 8 replicas during a traffic spike. The 7th and 8th replicas create pods that exceed the namespace’s Kubernetes ResourceQuota (if one is configured) and sit in Pending. Or worse — if no ResourceQuota exists, the pods are scheduled and the namespace exceeds its budget without anyone noticing until the monthly cost report.

The gap: derivation-time quota checks validate at deploy time. KEDA scales at runtime. The pipeline doesn’t re-validate quotas on every scale event. The reference implementation addresses this by setting maxReplicaCount on the ScaledObject to the quota-safe maximum at derivation time — but if the team’s available budget shrinks after derivation, the ScaledObject’s max may exceed the current budget.

The honest answer: derivation-time quotas catch the common case (deploying a service that’s too large). Runtime scaling quotas require Kubernetes-native ResourceQuota as a backstop — or a custom KEDA scaler that checks the platform’s quota system before scaling. Neither is free.

Chapters 8-10 covered the derivation pipeline: structure, secrets, and resource governance. Together they handle the common case — a LatticeService spec derived into a complete, policy-compliant, observable deployment.

Chapter 11 addresses the uncommon case: what happens when something doesn’t fit?

10.1. [M10] A service has minReplicas: 2 and maxReplicas: 10 with cpu: 500m per replica. The quota check validates against max (5000m). A developer objects: “I’m reserving 5000m of quota but only using 1000m. This wastes our team’s budget.” Design a quota model that addresses this — how do you balance worst-case reservation against typical usage? Is there a middle ground between “validate against min” and “validate against max”?

10.2. [H30] Section 10.3 says services with 1 replica get no PDB. A platform team wants to change this — single-replica services should get a PDB to prevent eviction during drains. This is a semantic change (same spec, different behavior). Design the migration: feature flag, opt-in period, default flip, and what happens to services that block node drains because their single pod has a PDB.

10.3. [R] Cost estimation uses a simplified pricing model. A team runs GPU workloads — 8 A100 GPUs at ~$3.50/hour each. The cost estimate says “$20,160/month.” The actual bill is $95,000 because they use spot instances and the GPUs are idle 40% of the time. Is the estimate useful or harmful? At what accuracy threshold does cost estimation become net-negative — creating more confusion than clarity?

10.4. [H30] The quota check runs at derivation time, but KEDA scales the Deployment at runtime. If KEDA scales from 2 to 8 replicas and the team’s quota only allows 6, what happens? The pods are created by KEDA, not by the derivation pipeline, so the pipeline’s quota check doesn’t fire. Design runtime quota enforcement that closes this gap without blocking legitimate scaling.

10.5. [M10] A 3-replica service has one pod in CrashLoopBackOff. The PDB says maxUnavailable: 1. A node drain needs to evict one of the 2 healthy pods. The drain is blocked because 1 pod is already unavailable. What should the platform do — override the PDB, alert the operator, or both? What information does the alert need to include?