aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDisconnect3d <dominik.b.czarnota@gmail.com>2023-05-29 00:19:31 +0200
committerGitHub <noreply@github.com>2023-05-29 00:19:31 +0200
commitf7265e0690118c900e09adb3939019a4fe46c3f4 (patch)
treecc4c538ea508e42e6400e047621facde06408de6
parent603ba857e9f50061f64263ea595c83161f852fe8 (diff)
downloadnsjail-f7265e0690118c900e09adb3939019a4fe46c3f4.tar.gz
cgroup2.cc: improve note about using Docker
Improve the error log message when Nsjail fails to write to the `/sys/fs/cgroup/cgroup.subtree_control` file when it attempts to setup the cgroupv2 configuration. The previous message looked like this: ``` [E][2023-05-28T21:52:56+0000][8807] writeBufToFile():105 Couldn't write '7' bytes to file '/sys/fs/cgroup/cgroup.subtree_control' (fd='4'): Device or resource busy [E][2023-05-28T21:52:56+0000][8807] enableCgroupSubtree():95 Could not apply '+memory' to cgroup.subtree_control in '/sys/fs/cgroup'. If you are running in Docker, nsjail MUST be the root process to use cgroups. [E][2023-05-28T21:52:56+0000][8807] main():354 Couldn't setup parent cgroup (cgroupv2) ``` It could have been confusing because the nsjail may have already been running as real root with full capabilities, e.g., when the user ran the container with the `--privileged --user 0:0` flags. In such a case, the issue is that Docker enters new pid, uts, network, ipc, mount and cgroup namespaces (but not user or time namespaces, fwiw) and I believe that if you do so after the cgroupv2 filesystem is mounted, the root of its filesystem hierarchy will start to render only a subtree, or, generally a limited view of the cgroup. This can be seen below. On the host, we can see the cgroup sub-hierarchies and the `cgroup.subtree_control` shows us the controllers properly: ``` # ls /sys/fs/cgroup/ cgroup.controllers cgroup.threads dev-mqueue.mount memory.numa_stat system.slice cgroup.max.depth cpu.pressure init.scope memory.pressure user.slice cgroup.max.descendants cpuset.cpus.effective io.cost.model memory.stat cgroup.procs cpuset.mems.effective io.cost.qos sys-fs-fuse-connections.mount cgroup.stat cpu.stat io.pressure sys-kernel-config.mount cgroup.subtree_control dev-hugepages.mount io.stat sys-kernel-debug.mount # cat /sys/fs/cgroup/cgroup.subtree_control cpuset cpu io memory hugetlb pids rdma ``` However, even in a privileged container, we can't see the same: ``` # sudo docker run --rm -it --privileged nsjail ls /sys/fs/cgroup cgroup.controllers cpuset.cpus memory.events.local cgroup.events cpuset.cpus.effective memory.high cgroup.freeze cpuset.cpus.partition memory.low cgroup.kill cpuset.mems memory.max cgroup.max.depth cpuset.mems.effective memory.min cgroup.max.descendants hugetlb.2MB.current memory.numa_stat cgroup.procs hugetlb.2MB.events memory.oom.group cgroup.stat hugetlb.2MB.events.local memory.pressure cgroup.subtree_control hugetlb.2MB.max memory.stat cgroup.threads hugetlb.2MB.rsvd.current memory.swap.current cgroup.type hugetlb.2MB.rsvd.max memory.swap.events cpu.idle io.latency memory.swap.high cpu.max io.max memory.swap.max cpu.max.burst io.pressure pids.current cpu.pressure io.stat pids.events cpu.stat io.weight pids.max cpu.weight memory.current rdma.current cpu.weight.nice memory.events rdma.max # sudo docker run --rm -it --privileged nsjail cat /sys/fs/cgroup/cgroup.subtree_control # ``` Of course, the namespaces itself can be seen by comparing them like this: ``` // HOST # ls -la /proc/self/ns total 0 dr-x--x--x 2 root root 0 May 28 22:17 . dr-xr-xr-x 9 root root 0 May 28 22:17 .. lrwxrwxrwx 1 root root 0 May 28 22:17 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 May 28 22:17 ipc -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 May 28 22:17 mnt -> 'mnt:[4026531841]' lrwxrwxrwx 1 root root 0 May 28 22:17 net -> 'net:[4026531840]' lrwxrwxrwx 1 root root 0 May 28 22:17 pid -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 May 28 22:17 pid_for_children -> 'pid:[4026531836]' lrwxrwxrwx 1 root root 0 May 28 22:17 time -> 'time:[4026531834]' lrwxrwxrwx 1 root root 0 May 28 22:17 time_for_children -> 'time:[4026531834]' lrwxrwxrwx 1 root root 0 May 28 22:17 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 May 28 22:17 uts -> 'uts:[4026531838]' // CONTAINER # sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns total 0 dr-x--x--x 2 user user 0 May 28 22:17 . dr-xr-xr-x 9 user user 0 May 28 22:17 .. lrwxrwxrwx 1 user user 0 May 28 22:17 cgroup -> 'cgroup:[4026532381]' lrwxrwxrwx 1 user user 0 May 28 22:17 ipc -> 'ipc:[4026532317]' lrwxrwxrwx 1 user user 0 May 28 22:17 mnt -> 'mnt:[4026532315]' lrwxrwxrwx 1 user user 0 May 28 22:17 net -> 'net:[4026532319]' lrwxrwxrwx 1 user user 0 May 28 22:17 pid -> 'pid:[4026532318]' lrwxrwxrwx 1 user user 0 May 28 22:17 pid_for_children -> 'pid:[4026532318]' lrwxrwxrwx 1 user user 0 May 28 22:17 time -> 'time:[4026531834]' lrwxrwxrwx 1 user user 0 May 28 22:17 time_for_children -> 'time:[4026531834]' lrwxrwxrwx 1 user user 0 May 28 22:17 user -> 'user:[4026531837]' lrwxrwxrwx 1 user user 0 May 28 22:17 uts -> 'uts:[4026532316]' ``` Anyway, passing `--cgroupns=host` solves this problem, which can be seen below: ``` # ls -la /proc/self/ns | grep cgroup lrwxrwxrwx 1 root root 0 May 28 22:18 cgroup -> cgroup:[4026531835] # sudo docker run --rm -it --cgroupns=host --privileged nsjail ls -la /proc/self/ns | grep cgroup lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026531835]' # sudo docker run --rm -it --privileged nsjail ls -la /proc/self/ns | grep cgroup lrwxrwxrwx 1 user user 0 May 28 22:19 cgroup -> 'cgroup:[4026532381]' ```
-rw-r--r--cgroup2.cc7
1 files changed, 5 insertions, 2 deletions
diff --git a/cgroup2.cc b/cgroup2.cc
index 4d11c41..215e177 100644
--- a/cgroup2.cc
+++ b/cgroup2.cc
@@ -93,8 +93,11 @@ static bool enableCgroupSubtree(nsjconf_t *nsjconf, const std::string &controlle
}
}
LOG_E(
- "Could not apply '%s' to cgroup.subtree_control in '%s'. If you are running in Docker, "
- "nsjail MUST be the root process to use cgroups.",
+ "Could not apply '%s' to cgroup.subtree_control in '%s'. nsjail MUST be run from root "
+ "and the cgroup mount path must refer to the root/host cgroup to use cgroupv2. If you "
+ "use Docker, you may need to run the container with --cgroupns=host so that nsjail can"
+ " access the host/root cgroupv2 hierarchy. An alternative is mounting (or remounting) "
+ "the cgroupv2 filesystem but using the flag is just simpler.",
val.c_str(), cgroup_path.c_str());
return false;
}