Tuning Kubernetes Deployments: Perfect Reliability and Minimum Cost
by Caleb Geene, Director, Site Reliability Engineering
Introduction
Optimizing a Kubernetes deployment to maximize reliability while minimizing cost is a surprisingly complex challenge. While tools like Goldilocks can provide decent recommendations, they only get you about 60% of the way—leaving much more fine-tuning to be done.
In this post, we’ll focus on tuning an API hosted in Kubernetes to achieve 99.999% reliability while keeping costs low. Specifically, we will:
- Handle traffic spikes efficiently – ensuring fast scaling and avoiding 5xx errors
- Optimize deployment resource requests and limits to prevent over-provisioning
- Scale up and down quickly to minimize cloud costs
Key Areas of Optimization
We’ll achieve our goals by tuning the following Kubernetes deployment parameters while performing load testing:
- Deployment Resources: Requests and Limits
- Liveness & Readiness Probes
- HPA (Horizontal Pod Autoscaler) Scaling Policies
Before diving in, let’s clarify what resource requests and limits do and why they are critical for Kubernetes performance tuning.
Understanding Kubernetes Resource Requests and Limits
One of the most fundamental aspects of Kubernetes cost and performance optimization is correctly configuring resource requests and limits. Improper settings can lead to wasted cloud spend, poor scaling, or even downtime.
Resource Requests
Resource requests have two primary functions:
- Pod Scheduling: Kubernetes uses requests to determine if a node has enough available CPU and memory to schedule a pod.
- Autoscaling: When using an HPA (Horizontal Pod Autoscaler) to scale based on CPU or memory, Kubernetes calculates utilization percentages based on the requested resources—not the limits.
For example If a pod has a CPU request of 500m (0.5 vCPU) and consumes 300m (0.3 vCPU), HPA sees it as 60% utilized.
Resource Limits
Limits define the maximum amount of CPU or memory a pod can consume. This helps prevent noisy neighbor issues, but improper settings can impact performance.
Hitting the CPU limit → The pod gets CPU throttled, causing latency spikes but continuing to run. Hitting the Memory limit → The pod gets killed with an OOMKilled (Out of Memory) error.
Key Takeaway Requests define the baseline allocation, while limits enforce an upper boundary.
Finding Maximum Resource Utilization (“Terminal Velocity”)
A key concept in Kubernetes performance tuning is what I call “Terminal Velocity.” This represents the maximum amount of CPU and memory a process can actually use given unlimited resources and unlimited workload.
Unless you are managing a supercomputer or an AI training cluster, your application will have a natural limit to how much compute it can effectively utilize, and there is always a limit. A couple examples:
- The maximum number of go routines Golang can effectively manage in a single process
- The number of concurrent IO operations a single Node JS event loop can handle before you get a diminishing returns
- The number of workers configured in a Uvicorn Python API
Your goal? Find that limit before setting any resource requests and limits.
How to Find the Terminal Velocity for an API
- Scale your deployment down to a single replica.
- Set an extreme request value (e.g., 5 CPU & 10GB Memory) to ensure it’s not constrained.
- Run a high-traffic load test using tools like Siege or Vegeta.
- Monitor CPU & Memory usage using a metrics dashboard or simply run:
kubectl top pod
- After observing the resource utilization, set limits to the observed maximum values.
Example Siege command to load test an API:
siege -c 100 -t 5M 'https://non-production-api.com/endpoint' --header="Authorization: Bearer auth-header"
Important Considerations:
- Use an API endpoint that accurately represents production traffic.
- Ensure database dependencies are not the bottleneck. The API should be the limiting factor.
- Check resource utilization at startup. Some applications, particularly in Python, have high initialization costs.
Tuning Autoscaling
Now that we have determined the Resource Limits, we need to:
- Set the right resource requests
- Tune the HPA (Horizontal Pod Autoscaler) for optimal scaling
A good starting point for Resource Requests is 30% of the Resource Limits. At first glance this seems low, but it's very strategic for the following reasons:
- Setting a lower request ensures that we are scaling up quickly enough to handle the traffic spikes.
- It ensures we get good compute density on the nodes. We are targeting node utilization to average 50 - 80 percent CPU utilization. which will significantly reduce the cost of the cluster.
- If you notice your nodes running out of CPU or memory, increase the request and decrease the HPA target utilization.
Set HPA target utilization to 80%
Approaching Perfect Reliability with Load Testing
To validate our tuning, we use the same load test from earlier—this time unblocking HPA.
- Set HPA minimum replicas to 2 and maximum to 10.
- Run the load test and monitor scaling behavior.
- Ensure 100% success rate (no 5xx errors) and tolorable latency.
- Verify scaling up and down works efficiently.
This likely will not work perfectly the first time. Adjust the resource requests and the HPA target utilization until you achieve the perfect balance. Let also look at some common issues you may run into.
Common Pitfalls & Fixes
Problem: While scaling up if readiness probes are misconfigured, Kubernetes may send traffic to a new pod before it’s fully initialized, causing 502 errors.
Solution: Set the initialDelaySeconds
, to the mimimum startup time for you pod. And set successThreshold to 1.
Problem: I have a low request and HPA target but I am still getting timeouts.
Solution: Decrease the time it takes for a new pod to come online by removing init containers, or moving unecessary startup logic to the docker build. If all else fails, increase the mimimum replica count to buy yourself more time if a traffic spike happens.
Chaos Testing
Pod termination is a very common source of 5xx errors. Unless configured correctly, your pods will continue to recieve traffic even after they have been sent a termination signal. When the pod does finally terminate, regarless of your podTerminationGracePeriod, any requests currently inflight through that pod will fail with a 5xx error.
Test this by running the same load test and deleting a few pods. kubectl delete pod <pod-name>
Watch the logs for any 5xx errors.
Solution: Handle sigterm in your application and set a default podTerminationGracePeriod. I'm not going to get into details here, there are plenty of other blogs that cover this topic.
Final Thoughts: The Path to Optimized Kubernetes Deployments
Achieving perfect reliability and minimum cost in Kubernetes deployments is an iterative process. No one-size-fits-all solution exists—but by following these best practices, you will get much closer to an optimized, cost-efficient, and highly available system.
Key Takeaways:
- Find the Terminal Velocity – determine the true CPU & memory needs
- Set requests intelligently – aim for 30% of the resource limit
- Optimize readiness probes – ensure fast & accurate startup checks
- Handle sigterm correctly – gracefully shutdown to avoid 5xx errors
- Use load testing – validate that autoscaling is fast and effective