log.json of a container may grow to burst the tmpfs of /run, if a k8s user configure an exec liveness probe of a non exist executable file name. #8972

abel-von · 2023-08-16T09:27:26Z

Description

The runc exec command prints the error log to /run/containerd/io.containerd.runtime.v2.task/k8s.io/<cid>/log.json.
When a k8s user configure a exec liveness probe with a executable file that does not exist in container, this exec command will execute periodically according to the liveness probe config, and it will fail again and again. kubelet will consider this failure as a wrong config of liveness probe and will just send an event to k8s master, rather than restart the container. so the log.json file will grow unlimitedly if the k8s user ignored the liveness probe failure as all functions seems working properly.
As a result, the k8s node will become unhealthy as the tmpfs of /run may be bursted.

Steps to reproduce the issue

kubectl create a pod with liveness probe of a non exist executable path
check size growth of the log.json file in /run/containerd/io.containerd.runtime.v2.task/k8s.io/<cid>/log.json
size of the file grow along the time.

Describe the results you received and expected

expected: The misconfiguration of a container will not break the whole node.

received: the tmpfs of /run will burst and the node will be broken.

What version of containerd are you using?

v1.6.14

Any other relevant information

runc version: 1.1.3

Show configuration if it is related to CRI plugin.

config.toml is not related to this issue.

The text was updated successfully, but these errors were encountered:

fuweid · 2023-08-16T11:12:16Z

kubectl create a pod with liveness probe of a non exist executable path

It sounds like DDoS :( kubelet should have throttling for this.
Anyway, I think we should use new log.json for each runc-create/start/delete/exec call. And then shim deletes it after invoking.

abel-von · 2023-08-17T06:25:05Z

It sounds like DDoS :( kubelet should have throttling for this.

Actually, the livenessprobe is called every few seconds accoding to the config in pod.yaml, it seems not that frequent. it is just the accumulation of log if the pod runs for a long time.

I think we should use new log.json for each runc-create/start/delete/exec call.

Good idea! Shall we remove the file of log.json after runc-create/start/delete/exec call? or do we modify runc to open the log file with the mode of O_TRUNC rather than O_APPEND?

fuweid · 2023-08-17T07:31:48Z

do we modify runc to open the log file with the mode of O_TRUNC rather than O_APPEND?

We can file a issue to seek the runc maintainer's comment. (I don't have background knowledge about this part :( )

fm-matias-sosa · 2023-09-11T20:46:55Z

Hi all, I'm facing this same issue at the moment where the /run directory gets filled up due to the log.json file and pods start crashing. Right now a quick workaround is to restart contained, but of course, it's not ideal. Do you think there's a less reactive temp solution for this?

ning1875 · 2023-11-14T07:31:16Z

same issue，may be we can keep just last record or errMsg in log.json so that we can keep its size

so2bin · 2024-01-20T08:38:52Z

same question, a log of logs like these with interval 1s:

{"level":"info","msg":"Running with config:\n{\n  \"AcceptEnvvarUnprivileged\": true,\n  \"NVIDIAContainerCLIConfig\": {\n    \"Root\": \"\"\n  },\n  \"NVIDIACTKConfig\": {\n    \"Path\": \"nvidia-ctk\"\n  },\n  \"NVIDIAContainerRuntimeConfig\": {\n    \"DebugFilePath\": \"/dev/null\",\n    \"LogLevel\": \"info\",\n    \"Runtimes\": [\n      \"docker-runc\",\n      \"runc\"\n    ],\n    \"Mode\": \"auto\",\n    \"Modes\": {\n      \"CSV\": {\n        \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n      },\n      \"CDI\": {\n        \"SpecDirs\": null,\n        \"DefaultKind\": \"nvidia.com/gpu\",\n        \"AnnotationPrefixes\": [\n          \"cdi.k8s.io/\"\n        ]\n      }\n    }\n  },\n  \"NVIDIAContainerRuntimeHookConfig\": {\n    \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n    \"SkipModeDetection\": false\n  }\n}","time":"2024-01-20T16:05:43+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2024-01-20T16:05:43+08:00"}
{"level":"info","msg":"Running with config:\n{\n  \"AcceptEnvvarUnprivileged\": true,\n  \"NVIDIAContainerCLIConfig\": {\n    \"Root\": \"\"\n  },\n  \"NVIDIACTKConfig\": {\n    \"Path\": \"nvidia-ctk\"\n  },\n  \"NVIDIAContainerRuntimeConfig\": {\n    \"DebugFilePath\": \"/dev/null\",\n    \"LogLevel\": \"info\",\n    \"Runtimes\": [\n      \"docker-runc\",\n      \"runc\"\n    ],\n    \"Mode\": \"auto\",\n    \"Modes\": {\n      \"CSV\": {\n        \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n      },\n      \"CDI\": {\n        \"SpecDirs\": null,\n        \"DefaultKind\": \"nvidia.com/gpu\",\n        \"AnnotationPrefixes\": [\n          \"cdi.k8s.io/\"\n        ]\n      }\n    }\n  },\n  \"NVIDIAContainerRuntimeHookConfig\": {\n    \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n    \"SkipModeDetection\": false\n  }\n}","time":"2024-01-20T16:05:44+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2024-01-20T16:05:44+08:00"}
{"level":"info","msg":"Running with config:\n{\n  \"AcceptEnvvarUnprivileged\": true,\n  \"NVIDIAContainerCLIConfig\": {\n    \"Root\": \"\"\n  },\n  \"NVIDIACTKConfig\": {\n    \"Path\": \"nvidia-ctk\"\n  },\n  \"NVIDIAContainerRuntimeConfig\": {\n    \"DebugFilePath\": \"/dev/null\",\n    \"LogLevel\": \"info\",\n    \"Runtimes\": [\n      \"docker-runc\",\n      \"runc\"\n    ],\n    \"Mode\": \"auto\",\n    \"Modes\": {\n      \"CSV\": {\n        \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n      },\n      \"CDI\": {\n        \"SpecDirs\": null,\n        \"DefaultKind\": \"nvidia.com/gpu\",\n        \"AnnotationPrefixes\": [\n          \"cdi.k8s.io/\"\n        ]\n      }\n    }\n  },\n  \"NVIDIAContainerRuntimeHookConfig\": {\n    \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n    \"SkipModeDetection\": false\n  }\n}","time":"2024-01-20T16:05:45+08:00"}
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2024-01-20T16:05:45+08:00"}
{"level":"info","msg":"Running with config:\n{\n  \"AcceptEnvvarUnprivileged\": true,\n  \"NVIDIAContainerCLIConfig\": {\n    \"Root\": \"\"\n  },\n  \"NVIDIACTKConfig\": {\n    \"Path\": \"nvidia-ctk\"\n  },\n  \"NVIDIAContainerRuntimeConfig\": {\n    \"DebugFilePath\": \"/dev/null\",\n    \"LogLevel\": \"info\",\n    \"Runtimes\": [\n      \"docker-runc\",\n      \"runc\"\n    ],\n    \"Mode\": \"auto\",\n    \"Modes\": {\n      \"CSV\": {\n        \"MountSpecPath\": \"/etc/nvidia-container-runtime/host-files-for-container.d\"\n      },\n      \"CDI\": {\n        \"SpecDirs\": null,\n        \"DefaultKind\": \"nvidia.com/gpu\",\n        \"AnnotationPrefixes\": [\n          \"cdi.k8s.io/\"\n        ]\n      }\n    }\n  },\n  \"NVIDIAContainerRuntimeHookConfig\": {\n    \"Path\": \"/usr/bin/nvidia-container-runtime-hook\",\n    \"SkipModeDetection\": false\n  }\n}","time":"2024-01-20T16:05:46+08:00"}

Archesky · 2024-04-20T07:58:18Z

Got the same issue. Every 10 seconds (as a liveness probe configured) it prints messages to log.json. It's a calico pod and it has a liveness probe configured like this.

livenessProbe: exec: command: - /bin/calico/calico-node - '-felix-live' - '-bird-live'

Though it seems fine it fills the /run folder pretty rapidly. The only solution I found so far is to daily truncate a file with a cronjob.

the log.json lookes just like @so2bin:
{"level":"info","msg":"Using low-level runtime /usr/bin/runc","time":"2024-01-20T16:05:43+08:00"}
{"level":"info","msg":"Running with config:\n{\n "AcceptEnvvarUnprivileged": true,\n "NVIDIAContainerCLIConfig": {\n "Root": ""\n },\n "NVIDIACTKConfig": {\n "Path": "nvidia-ctk"\n },\n "NVIDIAContainerRuntimeConfig": {\n "DebugFilePath": "/dev/null",\n "LogLevel": "info",\n "Runtimes": [\n "docker-runc",\n "runc"\n ],\n "Mode": "auto",\n "Modes": {\n "CSV": {\n "MountSpecPath": "/etc/nvidia-container-runtime/host-files-for-container.d"\n },\n "CDI": {\n "SpecDirs": null,\n "DefaultKind": "nvidia.com/gpu",\n "AnnotationPrefixes": [\n "cdi.k8s.io/"\n ]\n }\n }\n },\n "NVIDIAContainerRuntimeHookConfig": {\n "Path": "/usr/bin/nvidia-container-runtime-hook",\n "SkipModeDetection": false\n }\n}","time":"2024-01-20T16:05:43+08:00"}

weistonedawei · 2024-05-25T13:21:41Z

@Archesky and @so2bin Looks like you are using nvidia-container-toolkit. The toolkit provides its own runtime (hooks) in /usr/loca/nvidia directory. The nvidia runtime eventually calls "runc", the low-level runtime.

The informational messages in log.json are from NVidia's runtime, which should be (IMHO) set to DEBUG level in the code. See source code starting on line 75. This is a relatively new addition that adds more text in log.json and fills /run N times faster.

The other part of part of the livenessProbe message comes from this part of code. Probably be better if it were at DEBUG level.

Work around ?

Setting log-level to "error"

# On each host
cd /usr/local/nvidia/toolkit/.config/nvidia-container-runtime
# edit config.toml and change log-level
diff config.toml config.toml.orig

14c14
<   log-level = "error"
---
>   log-level = "info"

2. Or, don't use exec type of livenessProbe

elezar · 2024-06-24T09:58:40Z

I have created NVIDIA/nvidia-container-toolkit#560 to reduce the logging verbosity in the NVIDIA Container Runtime. This was also pointed out in NVIDIA/nvidia-container-toolkit#511

ningmingxiao · 2024-07-04T05:42:46Z

we can let runc log into a fifo. #10430

abel-von added the kind/bug label Aug 16, 2023

fuweid added the area/cri Container Runtime Interface (CRI) label Aug 16, 2023

alexey-gavrilov-flant mentioned this issue Oct 13, 2023

[containerd] log.json of a container may grow to fill the tmpfs /run partition deckhouse/deckhouse#6186

Open

2 tasks

elezar mentioned this issue Jun 24, 2024

Reduce logging verbosity in NVIDIA Container Runtime NVIDIA/nvidia-container-toolkit#560

Merged

This was referenced Jul 4, 2024

reduce runc log #10429

Closed

reduce runc log #10430

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log.json of a container may grow to burst the tmpfs of /run, if a k8s user configure an exec liveness probe of a non exist executable file name. #8972

log.json of a container may grow to burst the tmpfs of /run, if a k8s user configure an exec liveness probe of a non exist executable file name. #8972

abel-von commented Aug 16, 2023 •

edited by thaJeztah

Loading

fuweid commented Aug 16, 2023

abel-von commented Aug 17, 2023 •

edited

Loading

fuweid commented Aug 17, 2023

fm-matias-sosa commented Sep 11, 2023

ning1875 commented Nov 14, 2023

so2bin commented Jan 20, 2024

Archesky commented Apr 20, 2024

weistonedawei commented May 25, 2024

elezar commented Jun 24, 2024

ningmingxiao commented Jul 4, 2024

log.json of a container may grow to burst the tmpfs of /run, if a k8s user configure an exec liveness probe of a non exist executable file name. #8972

log.json of a container may grow to burst the tmpfs of /run, if a k8s user configure an exec liveness probe of a non exist executable file name. #8972

Comments

abel-von commented Aug 16, 2023 • edited by thaJeztah Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of containerd are you using?

Any other relevant information

Show configuration if it is related to CRI plugin.

fuweid commented Aug 16, 2023

abel-von commented Aug 17, 2023 • edited Loading

fuweid commented Aug 17, 2023

fm-matias-sosa commented Sep 11, 2023

ning1875 commented Nov 14, 2023

so2bin commented Jan 20, 2024

Archesky commented Apr 20, 2024

weistonedawei commented May 25, 2024

elezar commented Jun 24, 2024

ningmingxiao commented Jul 4, 2024

abel-von commented Aug 16, 2023 •

edited by thaJeztah

Loading

abel-von commented Aug 17, 2023 •

edited

Loading