Containers are a current virtualization and application delivery trend, and I am working on it. If you try to search google about them, you can find tons of how-tos, information, tech guides, etc. As in anything in IT, there are flavors of containers. In this case, the players are Docker, LXC/LXD (in which Docker was once based), CoreOS RKT, OpenVZ, etc. If you have a look in the google trends, you’ll notice that undoubtedly the winner hype is Docker and “the others” try to fight against it.
But as there are several alternatives, I wanted to learn about the underlying technology and it seems that all of them are simply based on a set of kernel features: mainly the linux namespaces and the cgroups. And the most important diferences are the utilities that they provide to automate the procedures (the repository of images, container management and other parts of the ecosystem of a particular product).
Disclaimer: This is not a research blog, and so I am not going in depth on when namespaces were introduced in the kernel, which namespaces exist, how they work, what is copy on write, what are cgroups, etc. The purpose of this post is simply “fun with containers” 🙂
At the end, the “hard work” (i.e. the execution of a containerized environment) is made by the Linux kernel. And so this time I learned…
How to run Docker containers using common Linux tools (without Docker).
We start from a scenario in which we have one container running in Docker, and we want to run it using standard Linux tools. We will mainly act as a common user that has permissions to run Docker containers (i.e. in the case of Ubuntu, my user calfonso is in group ‘docker’), to see that we can run containers in the user space.
To run a contained environment with its own namespaces, using standard Linux tools you can follow the next procedure:calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar calfonso:handcontainer$ mkdir rootfs calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/ calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot $PWD/rootfs ash root:# mount -t proc none /proc root:# mount -t sysfs none /sys root:# mount -t tmpfs none /tmp
At this point you need to set up the network devices (from outside the container) and deal with the cgroups (if you need to).
In first place, we are preparing a folder for our tests (handcontainer) and then we will dump the filesystem of the container:
calfonso:~$ mkdir handcontainer calfonso:~$ cd handcontainer calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar
If we check the tar file produced, we’ll see that it is the whole filesystem of the container
Let’s extract it in a new folder (called rootfs)
calfonso:handcontainer$ mkdir rootfs calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/
This action will raise an error, because only the root user can use the mknod application and it is needed for the /dev folder, but it will be fine for us because we are not dealing with devices.
If we check the contents of rootfs, the filesystem is there and we can chroot to that filesystem to verify that we can use it (more or less) as if it was the actual system.
The chroot technique is well known and it was enough in the early days, but we have no isolation in this system. It is exposed if we use the next commands:
/ # ip link / # mount -t proc proc /proc && ps -ef / # hostname
In these cases, we can manipulate the network of the host, interact with the processes of the host or manipulate the hostname.
This is because using chroot only changes the root filesystem for the current session, but it takes no other action.
Some words on namespaces
One of the “magic” of containers are the namespaces (you can read more on this in this link). The namespaces make that one process have a particular vision of “the things” in several areas. The namespaces that are currently available in Linux are the next:
- Mounts namespace: mount points.
- PID namespace: process number.
- IPC namespace: Inter Process Communication resources.
- UTS namespace: hostname and domain name.
- Network namespace: network resources.
- User namespace: User and Group ID numbers.
Namespaces are handled in the Linux kernel, and any process is already in one namespace (i.e. the root namespace). So changing the namespaces of one particular process do not introduce additional complexity for the processes.
Creating particular namespaces for particular processes means that one process will have its particular vision of the resources in that namespace. As an example, if one process is started with its own PID namespace, the PID number of the process will be 0 and its children will have the next PID numbers. Or if one process is started with its own NET namespace, it will have a particular stack of network devices.
The parent namespace of one namespace is able to manipulate the nested namespace… It is a “hard” sentence, but what this means is that the root namespace is always able to manipulate the resources in the nested namespaces. So the root of one host has the whole vision of the namespaces.
Now that we know about namespaces, we want to use them 😉
We can think of a container as one process (e.g. a /bin/bash shell) that has its particular root filesystem, its particular network, its particular hostname, its particular PIDs and users, etc. And this can be achieved by creating all these namespaces and spawning the /bin/bash processes inside of them.
The Linux kernel includes the calls clone, setns and unshare that enable to easily manipulate the namespaces for processes. But the common Linux distributions also provide the commands unshare and nsenter that enable to manipulate the namespaces for proccesses and applications from the commandline.
If we get back to the main host, we can use the command unshare to create a process with its own namespaces:
calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash
It seems that nothing happened, except that we are “root”, but if we start using commands that manipulate the features in the host, we’ll see what happened.
If we echo the PID of the current process ($$) we can see that it is 1 (the main process), the current user has UID and GID 0 (he is root), we have not any network device, we can manipulate the hostname…
If we check the processes in the host, in other terminal, we’ll see that even we are shown as ‘root’, outside the process our process is executed under the credentials of our regular user:
This is the magic of the PID namespace, that makes that one process has different PID numbers, depending on the namespace.
Back in our “unshared” environment, if we try to show the processes that are currently running, we’ll get the vision of the processes in the host:
This is because of how Linux works: the processes are file descriptors in the /proc mount point and, in our environments, we still have access to the existing mountpoints. But as we have our mnt namespace, we can mount our particular mount filesystem:
From ouside the container, we will be able to create a network device and put it into the namespace:
And if we get back to our “unshared” environment, we’ll see that we have a new network device:
The network setup is incomplete, and we will have access to nowhere (the peer of our eth0 is not connected to any network). This falls out of the scope of this post, but tha main idea is that you will need to connect the peer to some bridge, set an IP address for the eth0 inside the unshared environment, set up a NAT in the host, etc.
Obtaining the filesystem of the container
Now that we are in an “isolated” environment, we want to have the filesystem, utilities, etc. from the container that we started. And this can be done with our old friend “chroot” and some mounts:
root:handcontainer# chroot rootfs ash root:# mount -t proc none /proc root:# mount -t sysfs none /sys root:# mount -t tmpfs none /tmp
Using chroot, the filesystem changes and we can use all the new mount points, commands, etc. in that filesystem. So now we have the vision of being inside an isolated environment with an isolated filesystem.
Now we have finished setting up a “hand made container” from an existing Docker container.
Appart from the “contained environment”, the Docker containers also are managed inside cgroups. Cgroups enable to account and to limit the resources that the processes are able to use (i.e. CPU, I/O, Memory and Devices) and that will be interesting to better control the resources that the processes will be allowed to use (and how).
It is possible to explore the cgroups in the path /sys/fs/cgroups. In that folder you will find the different cgroups that are managed in the system. Dealing with cgroups is a bit “obscure” (creating subfolders, adding PID to files, etc.), and will be left to other eventual post.
Other features that offer Docker is the layered filesystems. The layered filesystem is used in Docker basically to have a common filesystem and only track the modifications. So there is a set of common layers for different containers (that will not be modified) and each of the containers will have a layer that makes its filesystem unique from the others.
In our case, we used a simple flat filesystem for the container, that we used as root filesystem for our contained environment. Dealing with layered filesystem will be a new post 😉
Well, in this post we tried to understand how the containers work and see that it is a relatively simple feature that is offered by the kernel. But it involves a lot of steps to have a properly configured container (remember that we left the cgroups out of this post).
We did this steps just because we could… just to better understand containers.
My advise is to use the existing technologies to be able to use well built containers (e.g. Docker).
As in other posts, I wrote this just to arrange my concepts following a very simple step-by-step procedure. But you can find a lot of resources about containers using your favourite search engine. The most useful resources that I found are: