Discussion:
Need advice about fixing PROC mount failures in a DIY Linux container
(too old to reply)
Lew Pitcher
2023-01-07 01:27:28 UTC
Permalink
Hi, all

I've come late to the party, and have just started learning
about the ins and outs of Linux containers. To get a better
understanding of the subject, I decided to learn about the
underlying technologies by building my own container software.

I've modelled my DIY container on Brian Swetland's mkbox
container[1], and have a demonstration program that works
on my development system (a 64bit AMD Ryzen 5 3400G with
Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied).
[1] https://github.com/swetland/mkbox


However, when I run either Brian's mkbox or my demo program
on my "production" system (another 64bit AMD Ryzen 5 3400G
with Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied), the
container breaks while trying to mount the proc filesystem
to the new (isolated) root fs.

Specifically, I get an "Operation not permitted" error when
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.

This failure affects both my DIY container and Brian's mkbox
container.

With my DIY container, I've checked the capabilities given
to the container process, and they are identical and complete
on both systems. On both systems, I run the container process
(mine and Brian's) from the same unprivileged UID/GID.

I have to conclude that there's a difference in the two
environments that causes this problem, but I don't know what
that difference is. Both systems use the type CPU, the
same amount of memory, the same 64-bit addressing mode,
the same kernel, and the same distribution (with the same
essential utilities).

There /are/ differences in the two systems:
pn the development system, my user is a member of a
number of groups that it is not a member of on the
"production" system. I run a root pulseaudio (I have my
reasons) on the development system that I do not on
the "production" system. Et cetera.

Can anyone suggest an environmental factor or set of
factors that might cause this behaviour?

For reference, I include a copy of a minimal implementation
of my DIY container that illustrates the problem, along with
captures of both a successful run on my development system
and an unsuccessful run on my production system.

========== demo.c ==========
/*
** demonstrate selective problem with Slackware Linux 14.2
** user namespace creation (Kernel 4.4.301)
*/

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/mount.h>
#include <sched.h>
#include <string.h>
#include <errno.h>

/* pivot_root() prototype not supplied by headers */
extern int pivot_root(const char *new_root, const char *put_old);

void Die(int line); /* generate error message and exit process */
#define DIE() Die(__LINE__)

int main(void)
{
char *fauxRoot = "./.fauxroot", /* will be our new root filesystem */
*oldRoot = ".oldroot", /* where pivot_root puts old root fs */
*oldProc = ".oldproc", /* where we temp relocate /proc to */
*newProc = "proc"; /* where we mount /proc to */
pid_t init_pid;

umask(0);

rmdir(fauxRoot); if (mkdir(fauxRoot,0777)) DIE();

if (unshare(CLONE_NEWUSER|CLONE_NEWNS|CLONE_NEWPID)) DIE();

if (mount("none","/",NULL,MS_REC|MS_PRIVATE,NULL)) DIE();
if (mount(fauxRoot,fauxRoot,NULL,MS_BIND|MS_NOSUID,NULL)) DIE();
if (chdir(fauxRoot)) DIE();

rmdir(oldRoot); if (mkdir(oldRoot,0751)) DIE();
rmdir(oldProc); if (mkdir(oldProc,0755)) DIE();
rmdir(newProc); if (mkdir(newProc,0755)) DIE();

if (mount("/proc",oldProc,NULL,MS_BIND|MS_REC,NULL)) DIE();

/* set new uid, gid */
{
FILE *map;

if ((map = fopen("/proc/self/uid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getuid());
fclose(map);

if ((map = fopen("/proc/self/setgroups","w")) == NULL) DIE();
fwrite("deny",4,1,map);
fclose(map);

if ((map = fopen("/proc/self/gid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getgid());
fclose(map);
}

if (pivot_root(".",oldRoot)) DIE();
if (umount2(oldRoot,MNT_DETACH)) DIE();
if (rmdir(oldRoot)) DIE();

switch (init_pid = fork())
{
case -1:
DIE();
break;

case 0:
if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
if (umount2(oldProc,MNT_DETACH)) DIE();
if (rmdir(oldProc)) DIE();
printf("INIT: my pid is %lu\n",(unsigned long)getpid());
break;

default:
printf("PARENT: INIT pid is %lu\n",(unsigned long)init_pid);
wait(NULL);
break;
}

return EXIT_SUCCESS;
}

void Die(int line)
{
fprintf(stderr,"Error encountered at line %d: %s\n",line,strerror(errno));
exit(EXIT_FAILURE);
}

========== successful execution on development system ==========
Script started on Fri 06 Jan 2023 08:20:12 PM EST
20:20 $ uname -a
Linux wordsworth 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
20:20 $ cat /etc/slackware-version
Slackware 14.2
20:20 $ rm demo
20:20 $ rm -rf .fauxroot
20:20 $ cc -o demo demo.c
20:20 $ ./demo
PARENT: INIT pid is 558
INIT: my pid is 1
20:20 $ ls -laR .fauxroot
fauxroot:
total 12
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:20 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 proc

fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 .
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 ..
20:21 $ exit
exit

Script done on Fri 06 Jan 2023 08:21:02 PM EST


========== unsuccessful execution on production system ==========
Script started on Fri Jan 6 20:21:11 2023
~/code/namespaces $ uname -a
Linux merlin 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
~/code/namespaces $ cat /etc/slackware-version
Slackware 14.2
~/code/namespaces $ rm demo
~/code/namespaces $ rm -rf .fauxroot
~/code/namespaces $ cc -o demo demo.c
~/code/namespaces $ ./demo
PARENT: INIT pid is 1651
Error encountered at line 77: Operation not permitted
~/code/namespaces $ nl -ba demo.c | grep ' 77'
77 if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
~/code/namespaces $ ls -laR .fauxroot
fauxroot:
total 16
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:21 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .oldproc
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 proc

fauxroot/.oldproc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..

fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
~/code/namespaces $ exit
exit

Script done on Fri Jan 6 20:22:50 2023
--
Lew Pitcher
"In Skills, We Trust"
Lew Pitcher
2023-01-07 02:12:43 UTC
Permalink
Post by Lew Pitcher
Hi, all
I've come late to the party, and have just started learning
about the ins and outs of Linux containers. To get a better
understanding of the subject, I decided to learn about the
underlying technologies by building my own container software.
I've modelled my DIY container on Brian Swetland's mkbox
container[1], and have a demonstration program that works
on my development system (a 64bit AMD Ryzen 5 3400G with
Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied).
[1] https://github.com/swetland/mkbox
However, when I run either Brian's mkbox or my demo program
on my "production" system (another 64bit AMD Ryzen 5 3400G
with Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied), the
container breaks while trying to mount the proc filesystem
to the new (isolated) root fs.
Specifically, I get an "Operation not permitted" error when
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.
This failure affects both my DIY container and Brian's mkbox
container.
With my DIY container, I've checked the capabilities given
to the container process, and they are identical and complete
on both systems. On both systems, I run the container process
(mine and Brian's) from the same unprivileged UID/GID.
I have to conclude that there's a difference in the two
environments that causes this problem, but I don't know what
that difference is. Both systems use the type CPU, the
same amount of memory, the same 64-bit addressing mode,
the same kernel, and the same distribution (with the same
essential utilities).
pn the development system, my user is a member of a
number of groups that it is not a member of on the
"production" system. I run a root pulseaudio (I have my
reasons) on the development system that I do not on
the "production" system. Et cetera.
Can anyone suggest an environmental factor or set of
factors that might cause this behaviour?
[snip]


Well, I can answer my own question, now. But the answer
leads to more questions.

The reason I get "Operation not permitted" on the
container /proc mount on my "production" system is that
I also run an nfs server on my "production" system (and
do not run one on my development system), and is nfs
server maintains two mountpoints within the /proc
filesystem.

Apparently, the attempt to mount /proc within my container
was blocked by the existance of these two mount points
(/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
rpc and nfs servers, and umounted these two mounts, I could
successfully run my demo container.

/Now/ the question is: how do I get my container /proc mount
to ignore or bypass these two nfsd mounts?
--
Lew Pitcher
"In Skills, We Trust"
Jasen Betts
2023-01-07 07:06:37 UTC
Permalink
Post by Lew Pitcher
Post by Lew Pitcher
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.
Well, I can answer my own question, now. But the answer
leads to more questions.
The reason I get "Operation not permitted" on the
container /proc mount on my "production" system is that
I also run an nfs server on my "production" system (and
do not run one on my development system), and is nfs
server maintains two mountpoints within the /proc
filesystem.
Apparently, the attempt to mount /proc within my container
was blocked by the existance of these two mount points
(/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
rpc and nfs servers, and umounted these two mounts, I could
successfully run my demo container.
/Now/ the question is: how do I get my container /proc mount
to ignore or bypass these two nfsd mounts?
What's the difference between mount() and /bin/mount
--
Jasen.
pǝsɹǝʌǝɹ sʇɥƃᴉɹ ll∀
Joseph Rosevear
2023-02-18 01:17:58 UTC
Permalink
Post by Jasen Betts
Post by Lew Pitcher
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.
Well, I can answer my own question, now. But the answer leads to more
questions.
The reason I get "Operation not permitted" on the container /proc mount
on my "production" system is that I also run an nfs server on my
"production" system (and do not run one on my development system), and
is nfs server maintains two mountpoints within the /proc filesystem.
Apparently, the attempt to mount /proc within my container was blocked
by the existance of these two mount points (/proc/fs/nfs and
/proc/fs/nfsd), as when I shut down my rpc and nfs servers, and
umounted these two mounts, I could successfully run my demo container.
/Now/ the question is: how do I get my container /proc mount to ignore
or bypass these two nfsd mounts?
What's the difference between mount() and /bin/mount
I'm going to try *again* to reply (having trouble):

Because I write bash code, your question looks to me like you are asking
for the difference between a shell function called mount and the mount
executable at /bin/mount.

For example you could write a function definition, source it, and then
use it to perform the real (or modified) /bin/mount. Here is such a
function defintion with no modification:

mount() {

local device
local point

device="$1"
point="$2"

/bin/mount $device $point

return
}

Notice the "mount()" syntax in the function definition. That is what
prompted me to respond as I did.

-Joe
Henrik Carlqvist
2023-02-18 10:26:21 UTC
Permalink
Post by Joseph Rosevear
Post by Jasen Betts
What's the difference between mount() and /bin/mount
Because I write bash code, your question looks to me like you are asking
for the difference between a shell function called mount and the mount
executable at /bin/mount.
For example you could write a function definition, source it, and then
use it to perform the real (or modified) /bin/mount. Here is such a
Actually, it is exactly as you say that the mount command described by:

man 8 mount

is the executable /bin/mount

But, the mount call described by the man page:

man 2 mount

is not some bash function you are supposed to write yourself but the C
API system call to the Linux kernel. Lew in his original post
(crossposted to 4 different newsgroups) were writing a C program called
demo.c. From such a C program it is best to use the C API and call
mount(), but it would also be possible to call system("/bin/mount ...");

regards Henrik

John-Paul Stewart
2023-01-07 16:41:34 UTC
Permalink
[Followups set to comp.os.linux.misc since I don't read any of the other
groups]
Post by Lew Pitcher
The reason I get "Operation not permitted" on the
container /proc mount on my "production" system is that
I also run an nfs server on my "production" system (and
do not run one on my development system), and is nfs
server maintains two mountpoints within the /proc
filesystem.
Apparently, the attempt to mount /proc within my container
was blocked by the existance of these two mount points
(/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
rpc and nfs servers, and umounted these two mounts, I could
successfully run my demo container.
/Now/ the question is: how do I get my container /proc mount
to ignore or bypass these two nfsd mounts?
In your OP you showed that you've got MS_REC in the mountflags field,
which will cause a recursive mount; i.e., you've explicitly asked for
the inclusion of the NFS-related subtrees. Have you tried without that
flag? MS_BIND would seem a more appropriate choice instead, IMHO, since
it doesn't do the recursion. Then, by default, the subtrees will be
excluded.

See also the section on "Changing the propagation type of an existing
mount" in the mount(2) man page for other ways to prevent the NFS
subtrees from being processed recursively. That might be relevant if
you want to recurse into other parts of the /proc tree, just not the two
directories you've named.
Rainer Weikusat
2023-01-09 19:27:13 UTC
Permalink
Lew Pitcher <***@digitalfreehold.ca> writes:

[...]
Post by Lew Pitcher
Well, I can answer my own question, now. But the answer
leads to more questions.
The reason I get "Operation not permitted" on the
container /proc mount on my "production" system is that
I also run an nfs server on my "production" system (and
do not run one on my development system), and is nfs
server maintains two mountpoints within the /proc
filesystem.
Apparently, the attempt to mount /proc within my container
was blocked by the existance of these two mount points
(/proc/fs/nfs and /proc/fs/nfsd), as when I shut down my
rpc and nfs servers, and umounted these two mounts, I could
successfully run my demo container.
/Now/ the question is: how do I get my container /proc mount
to ignore or bypass these two nfsd mounts?
Instead of doing a bind mount of a proc filesystem already mounted
somewhere, you could mount a new instance of it. The command for this
would be

mount -t proc proc <mount point>

You'll generally also want to mount sysfs, BTW.
Loading...