Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroupv2: Memory controller, case study #1005

Open
filbranden opened this issue Mar 9, 2019 · 6 comments
Open

cgroupv2: Memory controller, case study #1005

filbranden opened this issue Mar 9, 2019 · 6 comments

Comments

@filbranden
Copy link

This is a feature request within the scope of #1002. I decided to open a separate issue so we can have a discussion on one specific controller and I found the memory controller to be a very interesting one for this discussion.

The point of this discussion is to figure out how to support the new features that are coming in cgroupv2 (including a closer look at the motivations to adopt these features) and what to do about the controls that are going away and being removed in cgroupv2 (together with a look at why these are going away and what was wrong with them.)

So let's start looking at specific parts of the standard.

Memory Limits

The memory controller in cgroupv2 exposes these controls, of which the following are available to control memory limits:

  • memory.low: Best-effort memory protection. This is similar to cgroupv1's memory.soft_limit_in_bytes, but better. It's reasonable to map OCI's memory.reservation to this control, since it's close in meaning.
  • memory.high: Memory usage throttle limit. Once memory usage goes above this limit, the container will start getting memory pressure (even if the host itself is not under memory pressure.) In practical terms, when processes try to allocate memory, the kernel will first go into reclaim and try to free memory (from this container) before giving them memory. But note that going over the high limit never invokes the OOM killer, which means under extreme conditions the limit may be breached, but will not cause OOMs. This limit has no counterpart in cgroupv1.
  • memory.max: Memory usage hard limit. Hard limit, once this one is reached, if reclaim can't shrink memory usage of this cgroup, then a container OOM will happen here. OCI's memory.limit typically wouold map to this control, since it's the one with the closest semantics (including OOMing if necessary.)
  • memory.min: Hard memory protection. This is a fairly new control and it sets a limit under which pages from the container will not be reclaimed, even if the host itself is under memory pressure. This is useful for guaranteed reservations, for some workloads that should not be affected (or have reduced impact) in an oversubscribed host.

High Limit

The current high limit in cgroupv1 (and current OCI spec) is undesirable in that it allows OOMs to happen. As a consequence, some container managers avoid using it. For example, Kubernetes in many cases will not not set a high limit, instead it will monitor memory usage and decide to evict containers to prevent host OOMs.

That's not really great, since container managers would like to control memory pressure on containers and tell the kernel to try to shrink them, just the side effect of OOMing it is undesirable in most cases (the container manager still would prefer to make that kind of decision.)

The new control memory.high is a much more useful than cgroupv1's memory.limit_in_bytes (which is actually still available as memory.max in cgroupv2.) There would be a great advantage to exposing it as part of the OCI.

Reservation (Low Limit)

While cgroupv2 offers a new tunable memory.min (for a hard guarantee), it seems memory.low is pretty close to the behavior that container managers want to control.

So it seems that cgroupv2 (through memory.low) and cgroupv1 (through memory.soft_limit_in_bytes) are offering the same control here. But that's really not the case, since cgroupv1's memory.soft_limit_in_bytes has a number of issues that have been fixed in cgroupv2. The biggest issue with memory.soft_limit_in_bytes is that it's not really hierarchical (doesn't "borrow" from a reservation of the parent cgroup), which makes subtree delegation not really viable (since a manager for a subtree could request an arbitrarily high soft limit if they wanted to.) memory.low fixes that. See issues with memory cgroupv1 for more details and more of these issues.

OOMs

The memory controller in cgroupv2 also exposes this control for controlling container OOMs:

  • memory.oom.group: Determines whether the cgroup should be treated as an indivisible workload by the OOM killer. If set, all tasks belonging to the cgroup are killed together or not at all. This can be used to avoid partial kills to guarantee workload integrity.

This is a great new control, that's unavailable in cgroupv1. When OOMs happen, killing a whole container (eviction) is often what the container manager wants, so such a setting from the cgroup subsystem is quite useful.

Disabling OOMs

cgroupv1 has memory.oom_control, which allows disabling OOMs for a container and is no longer present in cgroupv2. This will be addressed in the next section, about removed controls.

Removed Controls

These three controls, which are encoded in the OCI, are no longer available in cgroupv2:

  • memory.kmem.limit_in_bytes and memory.kmem.tcp.limit_in_bytes: Hard limit for kernel and tcp buffer memory, respectively. These can be used to set a separate limit for kernel memory usage only.
  • memory.swappiness: Swappiness parameter of vmscan. Similar to the global vm.swappiness sysctl, but per container.
  • memory.oom_control: Disable OOMs. This control has some other uses in notifications, but writing it is used to disable OOMs for a specific cgroup.

I asked @htejun about these and here is some rationale on why they were removed (and ought to have been removed.) Below are some comments on each of those.

Kernel Memory Limits (kmem and kmem.tcp)

Accounting and limiting kernel memory separately has a number of issues, mostly arising from different memory usages siloed into separate buckets.

Tejun: "In cgroup2, all significant memory usages are accounted and controlled together by default. There’s no distinction between user, kernel or network memories. Memories are memories and they’re handled the same way."

I guess another way to look at this is considering the history of memory controller in cgroupv1. First only user memory accounting was available, because it was the easiest to measure and was quite useful. The kernel limits required deeper changes and were introduced later. For quite a few kernel versions, kernel memory accounting was buggy and/or problematic (so it was even possible to not enable accounting, by not setting any limit, even an arbitrary one.) That's all behind us at this point, so we should just look at a single memory limit per container and not worry about whether that's used in userspace or kernel.

The two OCI configurations memory.kernel and memory.kernelTCP set those limits. I would propose that a reasonable way to handle these on a cgroupv2 system is to simply ignore them. Since the other counters will include kernel memory, limits for that memory usage will be implicitly set, so using the new controls only makes sense here.

Swappiness

Tejun: "It’s not very clear what swappiness encodes. A lot of it is compared to file-backed IOs, how [un]favorable IOs for anonymous memory are considering their inherently higher randomness. As such, it’s more a function of the underlying hardware than workloads. Also, the implementation wasn’t quite right either – iirc, the behavior would differ depending on who’s reclaiming."

I guess the only setting that makes sense here is disabling swap altogether and that setting can be better achieved by setting memory.swap.max.

The current OCI specification includes a memory.swappiness setting for this control. For backwards compatibility, we can look into possibly translating (or more specifically, reconcyling it with memory.swap) on the corner cases that try to disable swap for a specific container using this setting.

Disabling OOMs

Tejun: "In cgroup1, userspace could block kernel oom actions. This puts the victims in a completely new blocked state and can and often did lead to deadlock conditions as it was extending the dependency chain for the forward progress guarantee out to userspace."

Tejun: "In cgroup2, userspace can’t block kernel from making forward progress. Instead, kernel provides metrics to measure resource pressures (PSI) allowing userspace agent to detect and remediate memory contention issues way before kernel OOM condition is reached. Instead of blocking cgroup which is running out of memory completely, the kernel slows them down which is reflected in PSI so that userspace can handle the situation. This way, userspace has way earlier warnings and kernel isn’t blocked on guaranteeing forward progress."

So, clearly, the corner cases of this setting are pretty problematic...

There is a separate way to control OOM behavior in the kernel, oom_score_adj which can be set to -1000 to disable OOM killer on a specific process. That has slightly different behavior than the cgroupv1 option (furthermore, it's set by process and not by cgroup), but it could potentially be leveraged for the same purpose.

OCI includes a memory.disableOOMKiller setting to disable the OOM killer. Perhaps translating that setting into using oom_score_adj instead could be a path towards deprecating this option.

Final Thoughts

The memory controller is a very interesting case study into migration to cgroupv2, perhaps because of how useful the new controls are that exist only in v2. In particular, memory.high would be immediately useful to container managers such as Kubernetes.

There is the problem of the settings that were removed (in the sense that these were exposed in the OCI and we need to figure out how to best deprecate or adapt them.) Hopefully this issue makes a good case of why these are gone, why their being gone is justified (as they were not well enough defined, not properly hierarchical, or mainly there for historic reasons) and moving forward with deprecating the OCI fields is a good idea.

Let's use this issue for a discussion on how to best adopt cgroupv2 (I believe a decision on how to handle the memory controller will quite possibly easily end up applying to other controllers too.)

I talked to @vbatts about bringing this up on a OCI call, planing to do so on the upcoming meeting on Wednesday, March 13th.

Thanks!
Filipe

@cyphar
Copy link
Member

cyphar commented Mar 9, 2019

A key problem that we should look into is how we are going to deal with the "my container userspace expects cgroupv1 but it got cgroupv2" -- the obvious example being Java. I don't know if newer Java versions support cgroupv2 but it is quite and important consideration.

I believe that LXC has a way of emulating cgroupv1 controllers on a cgroupv2 host (something we should look at even though we don't plan to bundle a FUSE filesystem -- because it will help inform the mappings between the two even more).

I will admit that I am slightly nervous about ignoring so many settings, because that's quite antithetical to what we've done in OCI in the past. Mapping settings makes some sense, but completely ignoring them when they've been explicitly set by a user is a completely different thing.

@thaJeztah
Copy link
Member

Mapping settings makes some sense, but completely ignoring them when they've been explicitly set by a user is a completely different thing.

At least it should produce a warning that the option was discarded.

Could it be an error-condition instead?

@thaJeztah
Copy link
Member

Also wondering how oom_score_adj should be mapped if it's per-process instead of cgroup; if a container has multiple processes, some of those could get OOM-killed, instead of the container as a whole, correct? (i.e., the container would no longer be handled as an atomic unit)

@vrothberg
Copy link

Also wondering how oom_score_adj should be mapped if it's per-process instead of cgroup; if a container has multiple processes, some of those could get OOM-killed, instead of the container as a whole, correct? (i.e., the container would no longer be handled as an atomic unit)

I had the very same question in mind as the kernel docs are not very explicit about this case. I did some tests and child processes seem to inherit the value. If the value is being set correctly to pid 1 (or inherited), it should affect the entire container.

@vrothberg
Copy link

Thanks, @filbranden, for the great summary!

@cyphar
Copy link
Member

cyphar commented Mar 11, 2019

I did some tests and child processes seem to inherit the value.

Yes, child processes inherid oom_score_adj. However, processes can always increase their oom_score_adj -- while they wouldn't be able to modify cgroup limits. Now, being able to make yourself easier to OOM isn't a bad thing but it means that it's no longer a limit you could actually meaningfully control with a container manager when it comes to already running containers (remember, we have runc update).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants