How Kubernetes Manages Pod Memory Using cgroups: Requests, Limits, and OOM Handling

Introduction

When deploying applications in Kubernetes, understanding how memory requests and limits work is essential to ensure resource optimization and avoid unwanted pod terminations. This blog will walk you through how memory requests and limits function not only within Kubernetes but also from the Linux perspective using control groups(cgroups), the mechanism Kubernetes uses under the hood to enforce resource constraints.

What are Control Groups(cgroups)?

Control groups, or cgroups, are a Linux kernel feature that Kubernetes uses to manage and limit resources like CPU and memory for containers. By placing each pod and container in specific cgroups, Kubernetes enforces resource requests and limits, ensuring that containers don’t consume more than their allocated resources. These groups also allow us to inspect memory usage patterns and behavior under different conditions.

In this blog, we will explore memory requests and limits in Kubernetes, using cgroups to observe how these configurations behave internally. We will also include outputs from my lab environment to demonstrate how different configurations impact the pod’s resource usage and stability.

Overview of Memory Requests and Limits in Kubernetes

Memory requests are the guaranteed minimum amount of memory that a container needs to be scheduled on a node. If a node doesn’t have this amount of memory available, the pod won’t be scheduled there.

Memory limits define the maximum amount of memory a container is allowed to use. If a container tries to exceed this limit, it will be terminated due to an Out of Memory(OOM) event, keeping the system safe from excessive resource usage.

The Linux cgroup system helps enforce these requests and limits at the kernel level, giving precise control over memory allocation and handling OOM situations for containers.

Example Scenarios and Observations in cgroups

Scenario 1: No Memory Request or Limit

If no memory request or limit is specified for a pod, it falls into the BestEffort QoS class. BestEffort pods are the first to be terminated if the node experiences memory pressure since they have no guaranteed resources.

Output from the lab

root@master01:~# kubectl get pod -o wide test-pod
NAME       READY   STATUS    RESTARTS   AGE   IP            NODE               NOMINATED NODE   READINESS GATES
test-pod   1/1     Running   0          20s   10.52.34.44   worker01.pbandark.com   <none>           <none>

root@worker01:~# crictl ps | grep -i test-pod
c600f2f0746ec       27a71e19c9562       40 seconds ago      Running             test-pod                     0                   994421c46791b       test-pod

root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod7a27dce8_e232_46a1_a0af_be03968ce0b3.slice/cri-containerd-c600f2f0746ec66addb428ac411e0b74dabe49b0cdfec998641cea0bf77f514e.scope/memory.max
max

Here, the memory.max file shows max, meaning there is no memory limit set at the cgroup level. Since no request or limit was defined, this pod has minimal priority and will be the first to be terminated if the system experiences memory pressure.

Scenario 2: Memory Request < Limit

When a memory request is defined and is less than the memory limit, the pod falls into the Burstable QoS class. It has guaranteed resources and can use more up to its limit if available.

Output from the lab

root@master01:~# kubectl get pod -o wide test-pod-limits
NAME              READY   STATUS    RESTARTS   AGE   IP            NODE               NOMINATED NODE   READINESS GATES
test-pod-limits   1/1     Running   0          3s    10.52.34.37   worker01.pbandark.com   <none>           <none>

root@worker01:~# crictl ps | grep -i limits
620f0b4d524a6       27a71e19c9562       18 seconds ago      Running             test-pod-limits              0                   462ca1d0d8695       test-pod-limits

root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod0703b242_cc42_414f_bf76_6fee978ee13d.slice/cri-containerd-620f0b4d524a6594ffafcd85ef679ac6839ed74d52d567640ffcbfad7ca79d9b.scope/memory.max
536870912

Here, the pod has a memory.max set to 512 MiB(536870912 bytes). The pod can burst up to its memory limit but will be deprioritized if other workloads need the memory.

Scenario 3: Memory Request == Limit

When both memory request and limit are equal, the pod still belongs to the Burstable QoS class but ensuring it receives the exact amount of memory specified with lower oom_score_adj. The reason its not in Guaranteed QoS class as it requires both CPU and memory(equal) specified for the pod.

Output from the lab

root@master01:~# kubectl get pod -o wide test-pod-limits-eq
NAME                 READY   STATUS    RESTARTS   AGE   IP            NODE               NOMINATED NODE   READINESS GATES
test-pod-limits-eq   1/1     Running   0          5s    10.52.34.27   worker01.pbandark.com   <none>           <none>

root@worker01:~# crictl ps | grep -i test-pod-limits-eq
7295c5bfba225       27a71e19c9562       41 seconds ago      Running             test-pod-limits-eq           0                   e8531039aacf5       test-pod-limits-eq

root@worker01:~# cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6e8f024f_cd12_42b7_af3f_5f92e1783e0c.slice/cri-containerd-7295c5bfba2256594ffafcd85ef679ac6839ed74d52d567640ffcbfad7ca79d9b.scope/memory.max
536870912

Since both the request and limit are set to the same value(512 MiB), this pod is placed in the Burstable cgroup and assigned the lower oom_score_adj. It will be the last to face termination if the node encounters memory pressure.

Handling Out of Memory(OOM) Situations

When a pod exceeds its memory limit, Kubernetes triggers an OOMKill. The oom_score_adj value in cgroups controls which processes are more likely to be killed when the system is under memory pressure. Pods in the BestEffort QoS class have a high oom_score_adj, making them more likely to be killed, while Guaranteed pods have a lower oom_score_adj, protecting them from OOM kills.

Observing oom_score_adj Values

BestEffort Pods:

High oom_score_adj, making them the first to be terminated.

    "resources": {
      "linux": {
        [...]
        "memoryLimitInBytes": "0",
        "memorySwapLimitInBytes": "0",
        "oomScoreAdj": "1000",  <==
        [...]
      },

Burstable Pods(Memory Request < Limit):

Medium oom_score_adj, making them prioritized based on their memory requests.

      "linux": {
        "resources": {
          [...]  
          "memory_limit_in_bytes": 536870912,
          "oom_score_adj": 999, <===
          [...]
        },

Burstable Pods(Memory Request == Limit):

Low oom_score_adj, protecting them against OOM kills unless all other pods have been terminated.

      "linux": {
        "resources": {
          [...]
          "memory_limit_in_bytes": 536870912,
          "oom_score_adj": 997,  <==
          [...]
        },

Conclusion

Memory requests and limits in Kubernetes offer a powerful way to ensure that your applications get the resources they need while safeguarding cluster stability. By setting memory requests and limits thoughtfully, and understanding how Kubernetes uses cgroups and OOM scores to manage resources, you can create a more resilient application environment. Hopefully, these insights into cgroups and the outputs from my lab setup help clarify how Kubernetes enforces these settings under the hood.

If you have any query related to this topic, please add a comment or ping me on LinkedIn. Stay curious and keep exploring!

References

https://en.wikipedia.org/wiki/Cgroups

Introduction

What are Control Groups(cgroups)?

Overview of Memory Requests and Limits in Kubernetes

Example Scenarios and Observations in cgroups

Scenario 1: No Memory Request or Limit

Output from the lab

Scenario 2: Memory Request < Limit

Output from the lab

Scenario 3: Memory Request == Limit

Output from the lab

Handling Out of Memory(OOM) Situations

Observing oom_score_adj Values

Conclusion

References

Related Posts