Mastering Kubernetes Horizontal Pod Autoscaler

May 1, 2026•CloudCops

horizontal pod autoscaler

kubernetes scaling

platform engineering

devops

prometheus

Mastering Kubernetes Horizontal Pod Autoscaler

One of the most expensive Kubernetes mistakes looks harmless at first. A service runs with a fixed replica count because nobody wants surprise scaling behavior in production. Then traffic changes, pods saturate, alerts fire, and the team starts editing replica counts by hand while users feel the slowdown. The opposite failure is just as common. Teams pin replicas high enough to survive peak load and then pay for idle capacity the rest of the time.

That’s the operating gap the horizontal pod autoscaler is meant to close. Not as a checkbox feature, but as a control mechanism that keeps workloads responsive without forcing platform teams into constant manual intervention. In production, though, HPA only works well when the surrounding system is healthy: requests and limits are sensible, metrics are trustworthy, startup behavior is understood, and node capacity can follow pod demand.

Basic CPU autoscaling gets you started. Production-grade HPA requires more. It requires choosing metrics that reflect real load, tuning behavior so replicas don’t flap, and coordinating with Vertical Pod Autoscaler and Cluster Autoscaler so the control loops help each other instead of fighting.

Why Automatic Scaling Is Mission-Critical

A common production failure starts the same way. Traffic climbs faster than expected, pods hit their limits, latency stretches, and the on-call engineer starts raising replica counts by hand while trying to keep the service alive. A week later, the same team keeps replicas artificially high to avoid a repeat, and the cloud bill reflects that decision.

Static replica counts fail in both directions. They fail during demand spikes, and they fail during normal low-traffic periods when expensive capacity sits idle. HPA exists to close that gap, but its primary importance is operational: it turns scaling from a human reaction into a control loop.

That shift changes how platform teams run Kubernetes.

In a live cluster, load is uneven. Releases change resource usage. Sidecars consume headroom. Background jobs compete with user-facing services. Manual scaling can work for a handful of workloads, but it does not hold up across dozens of services with different traffic patterns and different failure modes. Someone reacts late, or someone leaves too much buffer in place, and both choices cost money.

The two failure modes teams keep repeating

The first failure mode is under-scaling. Request volume rises, pods saturate, queues build, retries increase, and the symptom spreads beyond the original service. What looks like an application problem often starts as a capacity problem that was handled too slowly.

The second is over-provisioning as a safety habit. Teams set a high replica baseline because fixed excess capacity feels safer than automation. At small scale, that decision is easy to justify. Across a shared platform, it turns into a steady tax on node capacity, cluster growth, and finance reviews.

A production-ready HPA setup helps in three concrete ways:

It reduces incident-driven scaling work: engineers stop changing replica counts during live events and spend that time finding the actual bottleneck.
It aligns capacity with real demand: workloads add pods during sustained pressure and release them when pressure drops.
It exposes weak platform assumptions: bad requests, noisy metrics, and long startup times become visible quickly because autoscaling depends on getting those details right.

Practical rule: If replica count changes only during incidents or budget reviews, scaling is still manual. The work is just delayed.

Why HPA matters beyond basic CPU scaling

HPA has been part of Kubernetes since the early days, which is why many teams enable it early and then stop there. That is usually enough to get basic CPU-based scaling running. It is not enough to make autoscaling reliable in production.

The hard part is not turning HPA on. The hard part is making sure it reacts to signals that match real demand, stays stable during noisy traffic, and works with the rest of the platform instead of creating new contention. A CPU target can be fine for a stateless web service. It can also be the wrong signal for queue workers, event consumers, or services where latency degrades before CPU rises.

That is where mature HPA practice starts to separate from the default setup. Teams need multi-metric scaling, custom metrics from Prometheus, and behavior tuning that prevents replica flapping during short spikes. They also need to coordinate HPA with VPA and the Cluster Autoscaler, because pod-level scaling without node capacity, or with badly tuned requests, creates a different class of outage.

If you want a quick orientation before getting into those production patterns, this overview of what is HPA in Kubernetes is a useful starting point.

For platform teams, automatic scaling is less about convenience and more about control. It protects response times during growth, limits waste during quiet periods, and forces the engineering discipline required to run shared Kubernetes infrastructure well.

How the Horizontal Pod Autoscaler Really Works

A checkout service hits a traffic spike, latency starts to creep up, and the deployment still shows the same replica count. That gap between real demand and available pods is the problem HPA is built to close. In production, the important detail is not that HPA scales. It is how it decides, how quickly it reacts, and where that decision can go wrong.

The HPA controller runs as a Kubernetes control loop. On each evaluation, it reads the target metric, compares current value to desired value, and computes a new replica count with desiredReplicas = ceil(currentReplicas * currentMetricValue / desiredMetricValue). The formula is simple. The operational consequences are not.

A diagram illustrating how the Kubernetes Horizontal Pod Autoscaler collects metrics and adjusts deployment pod counts automatically.

If you want a shorter orientation before getting into production tuning, this overview of what is HPA in Kubernetes is a useful companion.

What happens during each control loop

HPA starts with scaleTargetRef, finds the workload, and uses that workload’s selector to identify the pods that belong to it. It then requests metrics from the appropriate API. CPU and memory usually come through Metrics Server. Custom and external metrics come through adapters that expose Kubernetes metrics APIs, often backed by Prometheus.

That dependency chain is where many production incidents start. HPA does not inspect your application directly. It trusts the metrics pipeline. If Metrics Server is stale, if the custom metrics adapter drops series, or if your label mapping is wrong, HPA makes a scaling decision from partial information.

That is one reason platform teams spend time validating metric pipelines, not just HPA manifests. If you already use Prometheus for cluster observability, kube-state-metrics for Kubernetes object telemetry helps teams correlate replica changes, deployment state, and autoscaler behavior during incident review.

How the math turns metrics into replicas

The controller compares the observed metric against the target you set. If the current value is above target, it increases replicas. If the value is below target, it reduces them, subject to the limits and behavior rules in the policy. The ceil() step rounds up, which biases the outcome toward preserving capacity instead of risking an undersized result.

That behavior is easy to underestimate. A target of 50 percent average CPU utilization does not mean every pod sits neatly at 50 percent. It means HPA keeps adjusting toward that level over time while respecting minReplicas and maxReplicas. In a noisy workload, the live system is always moving around that target, not resting on it.

Why HPA often feels conservative

Teams usually notice scale-down behavior first. HPA is intentionally cautious when metrics are missing or inconsistent. If some pod metrics are unavailable, the controller assumes missing data should not trigger an aggressive reduction in capacity. That protects availability, but it also leaves extra pods running longer than people expect.

The same principle shows up in downscale stabilization. HPA keeps a history of recent recommendations and uses that history to avoid dropping replicas too quickly after a short-lived spike. The trade-off is straightforward. You pay for some temporary overprovisioning in exchange for fewer oscillations, fewer cold starts, and lower risk of scaling in just before the next burst.

For production services, that is usually the right bias.

What changed with autoscaling v2

The v2 API changed HPA from a basic CPU scaler into a policy engine. You can evaluate multiple metrics and let the autoscaler choose the replica count required by the most demanding signal. In practice, that means a frontend can scale on CPU and request rate, while a worker service can scale on CPU and queue depth, with one metric acting as the floor when the other lags.

HPA starts to fit real systems instead of demo workloads. CPU still matters, but many services fail on latency, backlog, or connection pressure before CPU looks high enough to trigger a response. Multi-metric HPA closes that gap, provided the metrics are clean and the scaling behavior is tuned carefully.

Mastering HPA Metrics for Smart Scaling

A service can look healthy on CPU and still be falling behind. The pattern shows up in production all the time. Request latency climbs, queues build, or connection counts pin a pod at its practical limit while CPU stays moderate. Teams that scale only on CPU usually discover this during an incident, not in a load test.

That is why metric choice determines whether HPA behaves like a safety mechanism or a source of false confidence.

The autoscaling v2 API gives you four useful metric families: Resource, Pods, Object, and External. Each answers a different question about workload pressure. The right choice depends on what fails first in your service, how quickly the signal appears, and whether the metric stays trustworthy during partial outages.

A hand-drawn diagram illustrating how a horizontal pod autoscaler monitors metrics to scale Kubernetes pods automatically.

Resource metrics work when requests reflect reality

Resource metrics are the default starting point. CPU and memory come from the resource metrics API, usually via Metrics Server. They are easy to enable and easy to explain to application teams.

CPU is still the best first metric for many stateless APIs, but only if requests are credible. HPA calculates utilization against requested CPU, not node capacity. Understated requests make pods look overloaded and push unnecessary scale-outs. Inflated requests hide real saturation and delay scaling until user-facing latency is already visible.

Memory needs more skepticism. Some services hold memory by design because they cache aggressively, maintain large heaps, or warm expensive state on startup. In those cases, high memory does not always mean more replicas will help. It may point to a sizing problem, a leak, or a workload that needs larger pods instead of more pods.

A simple decision frame helps:

Metric type	Best fit	Main risk
CPU	Stateless APIs, compute-heavy handlers	Bad request values distort utilization
Memory	Intentionally memory-bound workloads	High usage may reflect caching, not demand

If you need more context before choosing a signal, kube-state-metrics for Kubernetes observability gives platform teams a clearer view of replica counts, rollout state, and workload conditions that often explain odd HPA behavior.

Pods metrics capture pressure per replica

Pods metrics scale on an average value across pods. That makes them useful when each replica carries a measurable share of work and infrastructure metrics lag behind the bottleneck.

Good examples include:

Requests per pod for APIs where throughput scales roughly with replica count
Jobs processed per pod for worker pools
Active sessions per pod for services that hit connection or concurrency limits before CPU gets hot

This is often the point where HPA starts matching how the application performs. A frontend might struggle on concurrent connections long before CPU looks urgent. A worker deployment might need to scale on queue drain rate because CPU stays low while I/O waits dominate.

The trade-off is operational overhead. Pods metrics usually depend on Prometheus plus a custom metrics adapter. That introduces more moving parts, more failure modes, and more ownership questions. In exchange, you get a signal that is much closer to user demand.

If platform and application teams keep debating whether CPU reflects load, use that as a warning sign. The workload probably needs an application metric.

Object metrics fit shared demand signals

Object metrics come from a Kubernetes object instead of individual pods. They work well when demand is best measured at a shared entry point or coordination layer.

Ingress traffic is the common example. So are custom resources that track pending work, active tenants, or shared session state. In these cases, measuring pressure at the object level can produce cleaner scaling behavior than waiting for average pod utilization to catch up after the surge has already hit.

Object metrics also help when pods are too far downstream from the source of demand. For bursty internet-facing traffic, edge-level request rate can give HPA an earlier and more stable signal than CPU alone.

External metrics cover work that starts outside Kubernetes

External metrics represent demand from systems outside the cluster. Queue depth in a managed messaging service is the classic case. Consumer lag, third-party event volume, and cloud service backlog fit the same pattern.

These metrics are often the most useful for asynchronous systems because they measure pending work directly. They are also the easiest to get wrong. Metric semantics need to stay stable, the adapter needs to stay available, and everyone involved needs to agree on what a rise or drop in the metric means.

A worker that scales on external queue depth can respond faster than one waiting for CPU to rise. It can also overreact if the queue includes poison messages, delayed retries, or traffic that another consumer group is supposed to handle. Metric design matters as much as metric plumbing.

Multi-metric scaling is the production pattern

Single-metric HPA works for simple services. Production systems usually need more than one signal because different failure modes show up in different metrics.

A practical setup might use CPU as a guardrail and request rate per pod as the primary driver. For a queue consumer, CPU can stay in place to catch expensive message processing, while queue depth handles backlog growth. HPA evaluates each metric and applies the highest replica recommendation. That behavior is useful because it preserves capacity for the most stressed dimension instead of averaging signals into something misleading.

This is the difference between demo autoscaling and production autoscaling.

The mature pattern is not "pick the perfect metric." It is "combine a fast, workload-aware metric with a conservative fallback, then validate how both behave under failure." That usually means Prometheus-backed custom metrics for real demand, CPU for coverage, and regular testing to confirm that bad telemetry does not trigger expensive or unstable scaling.

Configuring HPA with Practical Examples

A production HPA should be predictable under stress, boring during incidents, and explicit about what it is allowed to do. Good YAML does not just describe a target metric. It encodes operating assumptions about demand, startup time, and cost.

Before applying any of these examples, make sure your deployment process does not keep writing a static replica count back onto the workload. Teams often diagnose “HPA not working” when the underlying issue is a GitOps rule, Helm value, or rollout job resetting replicas on every deploy. If you are tightening that workflow too, review this guide to deploying workloads safely to Kubernetes.

A straightforward CPU-based HPA

CPU is still the cleanest starting point for a stateless API, assuming requests are set correctly and CPU usage tracks user demand closely enough to be useful.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-cpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

Three details matter more than teams expect:

averageUtilization: 50 targets half of the pod’s requested CPU. It does not refer to node-level CPU usage.
minReplicas and maxReplicas define the safety rails. Set them from real capacity limits, not guesswork.
autoscaling/v2 gives you room to add behavior controls and more metrics later without reworking the object.

This works well for APIs where CPU rises with traffic and new pods become ready quickly. It works poorly when JVM warmup, cache priming, or one noisy sidecar distorts the signal. In those cases, CPU often stays as a fallback metric, not the primary one.

A Prometheus-backed custom metric HPA

For many customer-facing services, request pressure shows up in application metrics before CPU becomes a problem. That is why platform teams often scale on a Prometheus-backed Pods metric such as requests per second, in-flight work, or per-pod throughput.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-traffic-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: "1k"

The YAML is the easy part. The hard part is making the metric worth trusting in production.

You need an application metric that reflects real load, a Prometheus scrape path that stays healthy during deploys, and a custom metrics adapter that exposes the series in the shape HPA expects. You also need agreement on metric semantics. If one team defines traffic to include retries and another excludes them, replica recommendations will drift from reality and incident reviews will get messy fast.

A practical rule is simple. Scale on a metric the service owner can explain during an outage.

An external metric HPA for asynchronous workers

Queue consumers, batch processors, and event-driven workers usually care more about backlog than CPU. A worker can sit idle at low CPU while latency to drain a queue gets worse by the minute. In that case, an external metric is usually the right control signal.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: worker-queue-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: External
    external:
      metric:
        name: queue_length
      target:
        type: Value
        value: "100"

This pattern is common, but it carries operational risk. Your scaling path now depends on the queue system, the metrics export path, the adapter, and the HPA controller all behaving well at the same time.

Kubernetes handles missing or uncertain metrics conservatively, which reduces the chance of an unsafe downscale during partial visibility. That helps, but it does not make the dependency chain safe by default. If backlog is the metric that protects your SLO, monitor the adapter and the source metric like production infrastructure, not like supporting telemetry.

A multi-metric example for a production API

Single-metric HPA is a good starting point. Multi-metric HPA is the pattern that holds up better in production because it covers different failure modes at once.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-multi-metric-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  - type: Pods
    pods:
      metric:
        name: packets-per-second
      target:
        type: AverageValue
        averageValue: "1k"

HPA evaluates each metric independently and uses the highest replica recommendation. That is exactly what you want for a production API. Throughput can drive scale-out early, while CPU still catches expensive code paths, bad cache behavior, or noisy requests that do not show up cleanly in traffic metrics.

This is also where cost control starts to matter. Every extra metric can protect availability, but every bad metric can trigger expensive over-scaling. The right pattern is usually one fast workload-aware signal, one conservative fallback, and enough testing to confirm both still behave during deploys, telemetry gaps, and partial outages.

Tuning HPA for Stability and Cost Efficiency

A common production failure looks like this. Traffic spikes for a few minutes, HPA adds replicas, demand drops, and the workload starts shedding pods before the new baseline is clear. Ten minutes later the next spike hits, pods are cold again, latency climbs, and the platform team pays twice. Once in user-facing errors and again in node hours from bad scaling decisions.

That pattern usually comes from treating HPA as a default-on feature instead of an operational control loop. Real systems have startup lag, uneven traffic, delayed metrics, connection draining, and caches that need time to warm. A stable HPA setup accounts for those behaviors. A cheap-looking setup often becomes expensive because it churns pods, misses SLOs, and forces the cluster to react to noise.

The first guardrail is the built-in downscale stabilization window. HPA holds onto the highest recent recommendation before reducing replicas, which is exactly the bias most production services need. Premature scale-down is usually harder to recover from than a short period of extra capacity.

Why defaults still fall short

Defaults are conservative, but they are not workload-aware. A JVM API with a long warm-up curve, a queue worker that drains unevenly, and a Go service that starts in seconds should not share the same scaling posture. If they do, one of them will waste money and another will still scale too late.

behavior in autoscaling/v2 gives you the controls that matter in practice. You can tune scaleUp and scaleDown separately, which is how platform teams stop replica thrash without making scale-out sluggish.

A production stance is usually simple:

Scale up fast enough to protect the SLO. If request latency or queue age is already rising, elegant ramp curves do not help.
Scale down slowly enough to survive noise. Traffic drops are often less trustworthy than traffic spikes.
Set policy based on pod readiness time. If pods need time to become useful, HPA has to react before users feel the load.

What to tune first

Start with the controls that change replica movement, not the ones that look neat in YAML.

Downscale stabilization window
Keep it long enough to absorb normal traffic wobble. Shortening it often saves little and increases churn fast.
Scale-down policies
Limit how many replicas can disappear in one period. This matters for services with long-lived requests, connection draining, or caches that get expensive when capacity drops too quickly.
Scale-up policies
Put a ceiling on sudden growth. This protects the cluster from bad metrics, adapter failures, or a temporary signal spike that would otherwise create a large and unnecessary scale-out.

A trimmed example looks like this:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60
  scaleUp:
    policies:
    - type: Pods
      value: 4
      periodSeconds: 15
    - type: Percent
      value: 100
      periodSeconds: 15

Those numbers are placeholders, not a template. The right values depend on three things: how quickly pods become ready, how expensive an extra replica is, and how much instability your service can tolerate during scale-down. For internet-facing APIs, I usually accept slower cost recovery if it avoids repeated pod terminations. For batch workers, I am more willing to reclaim capacity aggressively if the queue signal is clean and startup is predictable.

The expensive mistake is optimizing for the fastest possible contraction. Fast scale-down looks efficient in graphs. In production it often increases restart churn, cache misses, and follow-on scale-ups that erase the savings.

Cost efficiency starts before HPA

HPA cannot fix bad resource requests. If CPU or memory requests are inflated, every scale event multiplies waste. If requests are too low, the workload may pack tightly, hit contention, and trigger scaling behavior that looks smart but is really compensating for poor sizing. Teams that want predictable HPA behavior should review Kubernetes resource limits best practices alongside autoscaling policy, not as a later cleanup task.

Multi-metric HPA makes this even more important. A fast custom metric can trigger scale-out early, while CPU or memory serves as a backstop. That combination works well in production, but only if pod sizing is grounded in reality and scale-down behavior is tuned to resist noise. Stable autoscaling is not about reaching the smallest replica count. It is about holding enough capacity to absorb real demand without paying for constant correction.

Coordinating HPA with VPA and Cluster Autoscaler

A production Kubernetes platform usually needs more than one autoscaler. HPA handles replica count. Vertical Pod Autoscaler (VPA) adjusts CPU and memory requests for pods. Cluster Autoscaler adds or removes nodes when the cluster itself needs more or less capacity. Each solves a different problem, and each can create trouble if you treat it as independent.

The clean mental model is simple. HPA decides how many pods you need. VPA informs or changes how big each pod should be. Cluster Autoscaler decides whether there are enough nodes to place those pods.

A diagram illustrating how Cluster Autoscaler, VPA, and HPA components manage resources in a Kubernetes cluster.

Where teams get into trouble

The most common coordination failure is between HPA and VPA. If HPA is scaling on CPU or memory utilization while VPA is also changing CPU or memory requests automatically, the target itself keeps moving. That makes the autoscaling signal unstable because utilization is measured relative to requests.

For example, if VPA raises CPU requests, apparent CPU utilization can fall without a reduction in the application's workload. HPA may then see less pressure and choose fewer replicas. The inverse can also happen. Engineers experience this as odd autoscaling behavior when the underlying issue is competing control loops.

The safer pattern is usually to keep responsibilities separate:

Use HPA for horizontal elasticity: especially for stateless services and worker pools.
Use VPA in recommendation mode: let it surface better request values without continuously mutating the workload.
Apply sizing changes deliberately: review recommendations, then roll them out through normal delivery workflows.

Recommended pattern: Let VPA help you size the pod. Let HPA decide how many copies of that pod should exist.

HPA and Cluster Autoscaler depend on each other

HPA can ask for more pods, but it can’t create node capacity. If the cluster has no room, new pods stay pending until Cluster Autoscaler reacts. That means a healthy HPA setup still fails users if node scaling is slow, constrained by policy, or blocked by incompatible requests.

This interaction changes how you think about “successful scaling.” An HPA event is only useful when the rest of the system can honor it. Platform teams should always inspect these together:

Controller	Primary job	Failure symptom
HPA	Adds or removes pods	Replica count changes but performance doesn’t improve
VPA	Recommends or adjusts requests	Pods are still poorly sized or disruptive changes occur
Cluster Autoscaler	Adds or removes nodes	Pods remain Pending after HPA scales out

A common production smell is blaming HPA for latency while the underlying issue is unschedulable pods. HPA did its part. The cluster couldn’t place the new replicas.

A practical operating model

A stable coordination model looks like this in day-to-day operations:

Right-size workloads first
Use observation and VPA recommendations to get requests close to reality.
Choose HPA metrics that reflect demand
CPU is acceptable for some services. Throughput, backlog, or ingress-driven metrics are often better.
Make sure node scaling can follow pod scaling
Pending pods during scale-out should be treated as part of the autoscaling design, not a separate incident.
Review rollout behavior During rolling updates, HPA still manages the target workload’s scale while the workload controller manages underlying pods. Startup time, readiness, and image pull time all affect the actual outcome.

What works well together

The combination that tends to work best for general platform engineering is:

HPA active on production workloads
VPA in recommendation mode
Cluster Autoscaler enabled and monitored
Prometheus and Grafana watching scaling events, pending pods, and request sizing drift

That arrangement keeps each control loop in its lane. It also makes troubleshooting far easier because you can ask a clear question each time: was the problem pod count, pod size, or cluster capacity?

HPA Best Practices for Platform Engineering Teams

Running HPA well is less about enabling it and more about operating it as part of the platform. Teams that succeed with autoscaling usually standardize observability, set clear ownership for metric pipelines, and treat HPA events as first-class operational data instead of background noise.

The runbook below is the practical baseline.

Build dashboards around scaling decisions

Teams commonly monitor CPU and memory and stop there. That’s not enough. You need to see what HPA decided, what it was looking at, and whether the workload could act on the recommendation.

At minimum, make dashboards and alerts around:

Current and desired replica counts: this shows whether HPA wants movement and whether the workload is converging.
Metric target versus current metric: otherwise you can’t tell whether scale actions are justified.
HPA conditions and events: these often explain why scaling is blocked or skipped.
Pending pods and scheduling failures: this separates HPA issues from cluster capacity issues.
Deployment rollout state: rolling updates can change perceived scaling behavior.

A useful dashboard pairs HPA state with service telemetry. If replicas rise while latency remains poor, you may have a startup, scheduling, or application bottleneck rather than an autoscaling issue.

Lock down the dependencies HPA needs

HPA is a controller, so it needs the cluster APIs behind it to function correctly. That means RBAC and aggregated metrics APIs have to be treated as platform dependencies, not optional extras.

Check these first when a new cluster or tenant environment is built:

Metrics Server availability: CPU and memory scaling depend on it.
Custom or external metrics adapters: required for Prometheus-backed or off-cluster metrics.
Access to the scale subresource: the target workload must support scaling.
Namespace conventions and selectors: HPA acts on the pods selected by the target workload, so broken selectors lead to confusing outcomes.

Security-wise, keep the permissions narrow and deliberate. The goal is to let controllers and metrics components do their job without turning observability infrastructure into an over-privileged side path.

Troubleshoot in the right order

When someone says “HPA isn’t working,” they usually mean one of a few specific failure patterns.

Start with evidence, not assumptions. Most HPA incidents are either bad metrics, bad requests, or no room to schedule.

Use this order:

Describe the HPA object
Inspect status, events, current metrics, and conditions.
Validate the metric pipeline
If resource metrics are missing, check Metrics Server. If custom metrics are missing, inspect the adapter and the underlying Prometheus query path.
Check pod requests
CPU utilization scaling is meaningless without valid CPU requests.
Confirm pods can schedule
If desired replicas increase but pods stay Pending, look at node capacity and autoscaler behavior.
Inspect readiness and startup behavior
New pods that aren’t becoming Ready quickly can make HPA look ineffective even when it’s scaling correctly.

Watch for these recurring anti-patterns

The following mistakes show up repeatedly in production reviews:

Static spec.replicas kept in deployment workflows: delivery tools keep resetting the workload to a fixed replica count.
Scaling on CPU for queue workers: the autoscaler reacts to pod stress instead of actual backlog.
No stabilization tuning: replicas flap when demand jitters.
Poorly sized requests: HPA scales clones of a bad pod specification.
Treating HPA as a cost-only feature: over-aggressive downscaling often creates the exact instability teams were trying to avoid.

A hand-drawn infographic illustrating four best practices for configuring a Kubernetes Horizontal Pod Autoscaler for platform engineering.

The strongest platform teams treat HPA as part of service design. They choose metrics intentionally, tune behavior for the workload’s startup profile, and verify that cluster capacity can follow horizontal growth. That’s what turns autoscaling from a demo feature into production infrastructure.

If your team is designing or fixing autoscaling on Kubernetes, CloudCops GmbH can help you build a production-grade setup that connects HPA, observability, GitOps delivery, and cluster capacity planning without adding unnecessary operational complexity.

Ready to scale your cloud infrastructure?

Let's discuss how CloudCops can help you build secure, scalable, and modern DevOps workflows. Schedule a free discovery call today.

Book a Meeting with an Expert

Continue Reading

Apr 15, 2026

Stateful Set Kubernetes: The Ultimate Guide

Master stateful set kubernetes with this complete guide. Learn core concepts, YAML examples, scaling strategies, and production best practices.

stateful set kubernetes

CloudCops

Apr 29, 2026

DevOps Transformation Services: Strategy to Success

Explore DevOps transformation services, from strategy to GitOps. Choose a partner, measure ROI with DORA metrics, and build lasting capabilities.

devops transformation services

CloudCops

Apr 27, 2026

Terraform Cloud Automation: Your Production Guide

Master Terraform Cloud automation with our end-to-end guide. Learn to set up VCS-driven workflows, policies, CI/CD, and security for production-grade IaC.

terraform cloud automation

CloudCops