Lew Pitcher
2023-01-07 01:27:28 UTC
Hi, all
I've come late to the party, and have just started learning
about the ins and outs of Linux containers. To get a better
understanding of the subject, I decided to learn about the
underlying technologies by building my own container software.
I've modelled my DIY container on Brian Swetland's mkbox
container[1], and have a demonstration program that works
on my development system (a 64bit AMD Ryzen 5 3400G with
Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied).
[1] https://github.com/swetland/mkbox
However, when I run either Brian's mkbox or my demo program
on my "production" system (another 64bit AMD Ryzen 5 3400G
with Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied), the
container breaks while trying to mount the proc filesystem
to the new (isolated) root fs.
Specifically, I get an "Operation not permitted" error when
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.
This failure affects both my DIY container and Brian's mkbox
container.
With my DIY container, I've checked the capabilities given
to the container process, and they are identical and complete
on both systems. On both systems, I run the container process
(mine and Brian's) from the same unprivileged UID/GID.
I have to conclude that there's a difference in the two
environments that causes this problem, but I don't know what
that difference is. Both systems use the type CPU, the
same amount of memory, the same 64-bit addressing mode,
the same kernel, and the same distribution (with the same
essential utilities).
There /are/ differences in the two systems:
pn the development system, my user is a member of a
number of groups that it is not a member of on the
"production" system. I run a root pulseaudio (I have my
reasons) on the development system that I do not on
the "production" system. Et cetera.
Can anyone suggest an environmental factor or set of
factors that might cause this behaviour?
For reference, I include a copy of a minimal implementation
of my DIY container that illustrates the problem, along with
captures of both a successful run on my development system
and an unsuccessful run on my production system.
========== demo.c ==========
/*
** demonstrate selective problem with Slackware Linux 14.2
** user namespace creation (Kernel 4.4.301)
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/mount.h>
#include <sched.h>
#include <string.h>
#include <errno.h>
/* pivot_root() prototype not supplied by headers */
extern int pivot_root(const char *new_root, const char *put_old);
void Die(int line); /* generate error message and exit process */
#define DIE() Die(__LINE__)
int main(void)
{
char *fauxRoot = "./.fauxroot", /* will be our new root filesystem */
*oldRoot = ".oldroot", /* where pivot_root puts old root fs */
*oldProc = ".oldproc", /* where we temp relocate /proc to */
*newProc = "proc"; /* where we mount /proc to */
pid_t init_pid;
umask(0);
rmdir(fauxRoot); if (mkdir(fauxRoot,0777)) DIE();
if (unshare(CLONE_NEWUSER|CLONE_NEWNS|CLONE_NEWPID)) DIE();
if (mount("none","/",NULL,MS_REC|MS_PRIVATE,NULL)) DIE();
if (mount(fauxRoot,fauxRoot,NULL,MS_BIND|MS_NOSUID,NULL)) DIE();
if (chdir(fauxRoot)) DIE();
rmdir(oldRoot); if (mkdir(oldRoot,0751)) DIE();
rmdir(oldProc); if (mkdir(oldProc,0755)) DIE();
rmdir(newProc); if (mkdir(newProc,0755)) DIE();
if (mount("/proc",oldProc,NULL,MS_BIND|MS_REC,NULL)) DIE();
/* set new uid, gid */
{
FILE *map;
if ((map = fopen("/proc/self/uid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getuid());
fclose(map);
if ((map = fopen("/proc/self/setgroups","w")) == NULL) DIE();
fwrite("deny",4,1,map);
fclose(map);
if ((map = fopen("/proc/self/gid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getgid());
fclose(map);
}
if (pivot_root(".",oldRoot)) DIE();
if (umount2(oldRoot,MNT_DETACH)) DIE();
if (rmdir(oldRoot)) DIE();
switch (init_pid = fork())
{
case -1:
DIE();
break;
case 0:
if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
if (umount2(oldProc,MNT_DETACH)) DIE();
if (rmdir(oldProc)) DIE();
printf("INIT: my pid is %lu\n",(unsigned long)getpid());
break;
default:
printf("PARENT: INIT pid is %lu\n",(unsigned long)init_pid);
wait(NULL);
break;
}
return EXIT_SUCCESS;
}
void Die(int line)
{
fprintf(stderr,"Error encountered at line %d: %s\n",line,strerror(errno));
exit(EXIT_FAILURE);
}
========== successful execution on development system ==========
Script started on Fri 06 Jan 2023 08:20:12 PM EST
20:20 $ uname -a
Linux wordsworth 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
20:20 $ cat /etc/slackware-version
Slackware 14.2
20:20 $ rm demo
20:20 $ rm -rf .fauxroot
20:20 $ cc -o demo demo.c
20:20 $ ./demo
PARENT: INIT pid is 558
INIT: my pid is 1
20:20 $ ls -laR .fauxroot
fauxroot:
total 12
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:20 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 proc
fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 .
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 ..
20:21 $ exit
exit
Script done on Fri 06 Jan 2023 08:21:02 PM EST
========== unsuccessful execution on production system ==========
Script started on Fri Jan 6 20:21:11 2023
~/code/namespaces $ uname -a
Linux merlin 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
~/code/namespaces $ cat /etc/slackware-version
Slackware 14.2
~/code/namespaces $ rm demo
~/code/namespaces $ rm -rf .fauxroot
~/code/namespaces $ cc -o demo demo.c
~/code/namespaces $ ./demo
PARENT: INIT pid is 1651
Error encountered at line 77: Operation not permitted
~/code/namespaces $ nl -ba demo.c | grep ' 77'
77 if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
~/code/namespaces $ ls -laR .fauxroot
fauxroot:
total 16
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:21 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .oldproc
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 proc
fauxroot/.oldproc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
~/code/namespaces $ exit
exit
Script done on Fri Jan 6 20:22:50 2023
I've come late to the party, and have just started learning
about the ins and outs of Linux containers. To get a better
understanding of the subject, I decided to learn about the
underlying technologies by building my own container software.
I've modelled my DIY container on Brian Swetland's mkbox
container[1], and have a demonstration program that works
on my development system (a 64bit AMD Ryzen 5 3400G with
Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied).
[1] https://github.com/swetland/mkbox
However, when I run either Brian's mkbox or my demo program
on my "production" system (another 64bit AMD Ryzen 5 3400G
with Radeon Vega Graphics, running Slackware Linux 14.2 with
the 4.4.301 kernel and all available patches applied), the
container breaks while trying to mount the proc filesystem
to the new (isolated) root fs.
Specifically, I get an "Operation not permitted" error when
I try to
mount("proc","proc","proc",MS_REC,NULL)
/but/ ONLY ON THIS ONE SYSTEM.
This failure affects both my DIY container and Brian's mkbox
container.
With my DIY container, I've checked the capabilities given
to the container process, and they are identical and complete
on both systems. On both systems, I run the container process
(mine and Brian's) from the same unprivileged UID/GID.
I have to conclude that there's a difference in the two
environments that causes this problem, but I don't know what
that difference is. Both systems use the type CPU, the
same amount of memory, the same 64-bit addressing mode,
the same kernel, and the same distribution (with the same
essential utilities).
There /are/ differences in the two systems:
pn the development system, my user is a member of a
number of groups that it is not a member of on the
"production" system. I run a root pulseaudio (I have my
reasons) on the development system that I do not on
the "production" system. Et cetera.
Can anyone suggest an environmental factor or set of
factors that might cause this behaviour?
For reference, I include a copy of a minimal implementation
of my DIY container that illustrates the problem, along with
captures of both a successful run on my development system
and an unsuccessful run on my production system.
========== demo.c ==========
/*
** demonstrate selective problem with Slackware Linux 14.2
** user namespace creation (Kernel 4.4.301)
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <sys/mount.h>
#include <sched.h>
#include <string.h>
#include <errno.h>
/* pivot_root() prototype not supplied by headers */
extern int pivot_root(const char *new_root, const char *put_old);
void Die(int line); /* generate error message and exit process */
#define DIE() Die(__LINE__)
int main(void)
{
char *fauxRoot = "./.fauxroot", /* will be our new root filesystem */
*oldRoot = ".oldroot", /* where pivot_root puts old root fs */
*oldProc = ".oldproc", /* where we temp relocate /proc to */
*newProc = "proc"; /* where we mount /proc to */
pid_t init_pid;
umask(0);
rmdir(fauxRoot); if (mkdir(fauxRoot,0777)) DIE();
if (unshare(CLONE_NEWUSER|CLONE_NEWNS|CLONE_NEWPID)) DIE();
if (mount("none","/",NULL,MS_REC|MS_PRIVATE,NULL)) DIE();
if (mount(fauxRoot,fauxRoot,NULL,MS_BIND|MS_NOSUID,NULL)) DIE();
if (chdir(fauxRoot)) DIE();
rmdir(oldRoot); if (mkdir(oldRoot,0751)) DIE();
rmdir(oldProc); if (mkdir(oldProc,0755)) DIE();
rmdir(newProc); if (mkdir(newProc,0755)) DIE();
if (mount("/proc",oldProc,NULL,MS_BIND|MS_REC,NULL)) DIE();
/* set new uid, gid */
{
FILE *map;
if ((map = fopen("/proc/self/uid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getuid());
fclose(map);
if ((map = fopen("/proc/self/setgroups","w")) == NULL) DIE();
fwrite("deny",4,1,map);
fclose(map);
if ((map = fopen("/proc/self/gid_map","w")) == NULL) DIE();
fprintf(map,"0 %lu 1\n",(unsigned long)getgid());
fclose(map);
}
if (pivot_root(".",oldRoot)) DIE();
if (umount2(oldRoot,MNT_DETACH)) DIE();
if (rmdir(oldRoot)) DIE();
switch (init_pid = fork())
{
case -1:
DIE();
break;
case 0:
if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
if (umount2(oldProc,MNT_DETACH)) DIE();
if (rmdir(oldProc)) DIE();
printf("INIT: my pid is %lu\n",(unsigned long)getpid());
break;
default:
printf("PARENT: INIT pid is %lu\n",(unsigned long)init_pid);
wait(NULL);
break;
}
return EXIT_SUCCESS;
}
void Die(int line)
{
fprintf(stderr,"Error encountered at line %d: %s\n",line,strerror(errno));
exit(EXIT_FAILURE);
}
========== successful execution on development system ==========
Script started on Fri 06 Jan 2023 08:20:12 PM EST
20:20 $ uname -a
Linux wordsworth 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
20:20 $ cat /etc/slackware-version
Slackware 14.2
20:20 $ rm demo
20:20 $ rm -rf .fauxroot
20:20 $ cc -o demo demo.c
20:20 $ ./demo
PARENT: INIT pid is 558
INIT: my pid is 1
20:20 $ ls -laR .fauxroot
fauxroot:
total 12
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:20 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 proc
fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:20 .
drwxrwxrwx 3 lpitcher users 4096 Jan 6 20:20 ..
20:21 $ exit
exit
Script done on Fri 06 Jan 2023 08:21:02 PM EST
========== unsuccessful execution on production system ==========
Script started on Fri Jan 6 20:21:11 2023
~/code/namespaces $ uname -a
Linux merlin 4.4.301 #1 SMP Mon Jan 31 20:27:28 CST 2022 x86_64 AMD Ryzen 5 3400G with Radeon Vega Graphics AuthenticAMD GNU/Linux
~/code/namespaces $ cat /etc/slackware-version
Slackware 14.2
~/code/namespaces $ rm demo
~/code/namespaces $ rm -rf .fauxroot
~/code/namespaces $ cc -o demo demo.c
~/code/namespaces $ ./demo
PARENT: INIT pid is 1651
Error encountered at line 77: Operation not permitted
~/code/namespaces $ nl -ba demo.c | grep ' 77'
77 if (mount("/proc",newProc,"proc",MS_REC,NULL)) DIE();
~/code/namespaces $ ls -laR .fauxroot
fauxroot:
total 16
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 .
drwxr-xr-x 6 lpitcher users 4096 Jan 6 20:21 ..
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .oldproc
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 proc
fauxroot/.oldproc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
fauxroot/proc:
total 8
drwxr-xr-x 2 lpitcher users 4096 Jan 6 20:21 .
drwxrwxrwx 4 lpitcher users 4096 Jan 6 20:21 ..
~/code/namespaces $ exit
exit
Script done on Fri Jan 6 20:22:50 2023
--
Lew Pitcher
"In Skills, We Trust"
Lew Pitcher
"In Skills, We Trust"