Introduction
When deploying applications in Kubernetes, understanding how memory requests and limits work is essential to ensure resource optimization and avoid unwanted pod terminations. This blog will walk you through how memory requests and limits function not only within Kubernetes but also from the Linux perspective using control groups(cgroups), the mechanism Kubernetes uses under the hood to enforce resource constraints.
What are Control Groups(cgroups)?
Control groups, or cgroups, are a Linux kernel feature that Kubernetes uses to manage and limit resources like CPU and memory for containers. By placing each pod and container in specific cgroups, Kubernetes enforces resource requests and limits, ensuring that containers don’t consume more than their allocated resources. These groups also allow us to inspect memory usage patterns and behavior under different conditions.
In this blog, we will explore memory requests and limits in Kubernetes, using cgroups to observe how these configurations behave internally. We will also include outputs from my lab environment to demonstrate how different configurations impact the pod’s resource usage and stability.
Overview of Memory Requests and Limits in Kubernetes
Memory requests are the guaranteed minimum amount of memory that a container needs to be scheduled on a node. If a node doesn’t have this amount of memory available, the pod won’t be scheduled there.
Memory limits define the maximum amount of memory a container is allowed to use. If a container tries to exceed this limit, it will be terminated due to an Out of Memory(OOM) event, keeping the system safe from excessive resource usage.
The Linux cgroup system helps enforce these requests and limits at the kernel level, giving precise control over memory allocation and handling OOM situations for containers.
Example Scenarios and Observations in cgroups
Scenario 1: No Memory Request or Limit
If no memory request or limit is specified for a pod, it falls into the BestEffort QoS class. BestEffort pods are the first to be terminated if the node experiences memory pressure since they have no guaranteed resources.
Output from the lab
root@master01:~# kubectl get pod -o wide test-pod
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-pod 1/1 Running 0 20s 10.52.34.44 worker01.pbandark.com <none> <none>
root@worker01:~# crictl ps | grep -i test-pod
c600f2f0746ec 27a71e19c9562 40 seconds ago Running test-pod 0 994421c46791b test-pod
root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod7a27dce8_e232_46a1_a0af_be03968ce0b3.slice/cri-containerd-c600f2f0746ec66addb428ac411e0b74dabe49b0cdfec998641cea0bf77f514e.scope/memory.max
max
Here, the memory.max file shows max, meaning there is no memory limit set at the cgroup level. Since no request or limit was defined, this pod has minimal priority and will be the first to be terminated if the system experiences memory pressure.
Scenario 2: Memory Request < Limit
When a memory request is defined and is less than the memory limit, the pod falls into the Burstable QoS class. It has guaranteed resources and can use more up to its limit if available.
Output from the lab
root@master01:~# kubectl get pod -o wide test-pod-limits
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-pod-limits 1/1 Running 0 3s 10.52.34.37 worker01.pbandark.com <none> <none>
root@worker01:~# crictl ps | grep -i limits
620f0b4d524a6 27a71e19c9562 18 seconds ago Running test-pod-limits 0 462ca1d0d8695 test-pod-limits
root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0703b242_cc42_414f_bf76_6fee978ee13d.slice/cri-containerd-620f0b4d524a6594ffafcd85ef679ac6839ed74d52d567640ffcbfad7ca79d9b.scope/memory.max
536870912
Here, the pod has a memory.max set to 512 MiB(536870912 bytes). The pod can burst up to its memory limit but will be deprioritized if other workloads need the memory.
Scenario 3: Memory Request == Limit
When both memory request and limit are equal, the pod still belongs to the Burstable QoS class but ensuring it receives the exact amount of memory specified with lower oom_score_adj. The reason its not in Guaranteed QoS class as it requires both CPU and memory(equal) specified for the pod.
Output from the lab
root@master01:~# kubectl get pod -o wide test-pod-limits-eq
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-pod-limits-eq 1/1 Running 0 5s 10.52.34.27 worker01.pbandark.com <none> <none>
root@worker01:~# crictl ps | grep -i test-pod-limits-eq
7295c5bfba225 27a71e19c9562 41 seconds ago Running test-pod-limits-eq 0 e8531039aacf5 test-pod-limits-eq
root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6e8f024f_cd12_42b7_af3f_5f92e1783e0c.slice/cri-containerd-7295c5bfba2256594ffafcd85ef679ac6839ed74d52d567640ffcbfad7ca79d9b.scope/memory.max
536870912
Since both the request and limit are set to the same value(512 MiB), this pod is placed in the Burstable cgroup and assigned the lower oom_score_adj. It will be the last to face termination if the node encounters memory pressure.
Handling Out of Memory(OOM) Situations
When a pod exceeds its memory limit, Kubernetes triggers an OOMKill. The oom_score_adj value in cgroups controls which processes are more likely to be killed when the system is under memory pressure. Pods in the BestEffort QoS class have a high oom_score_adj, making them more likely to be killed, while Guaranteed pods have a lower oom_score_adj, protecting them from OOM kills.
Observing oom_score_adj Values
- BestEffort Pods:
High oom_score_adj, making them the first to be terminated.
"resources": {
"linux": {
[...]
"memoryLimitInBytes": "0",
"memorySwapLimitInBytes": "0",
"oomScoreAdj": "1000", <==
[...]
},
- Burstable Pods(Memory Request < Limit):
Medium oom_score_adj, making them prioritized based on their memory requests.
"linux": {
"resources": {
[...]
"memory_limit_in_bytes": 536870912,
"oom_score_adj": 999, <===
[...]
},
- Burstable Pods(Memory Request == Limit):
Low oom_score_adj, protecting them against OOM kills unless all other pods have been terminated.
"linux": {
"resources": {
[...]
"memory_limit_in_bytes": 536870912,
"oom_score_adj": 997, <==
[...]
},
Conclusion
Memory requests and limits in Kubernetes offer a powerful way to ensure that your applications get the resources they need while safeguarding cluster stability. By setting memory requests and limits thoughtfully, and understanding how Kubernetes uses cgroups and OOM scores to manage resources, you can create a more resilient application environment. Hopefully, these insights into cgroups and the outputs from my lab setup help clarify how Kubernetes enforces these settings under the hood.
If you have any query related to this topic, please add a comment or ping me on LinkedIn. Stay curious and keep exploring!