How to deal with the Union File Systems that use Docker (OverlayFS and AUFS)

Containers are a modern application delivery mechanism (very interesting for software reproduciblity). As I commented in my previous post, undoubtedly the winner hype is Docker. Most of containers developments (e.g. Docker) are supported by the Linux kernel, by using namespaces, cgroups, other technologies… and chroots to the filesystem of the system to virtualize.

Using lightweight virtualization increases the density of the virtualized units, but many containers may share the same base system (e.g. the plain OS with some utilities installed) and only modify a few files (e.g. installing one application of updating the configuration files). And that is why Docker and others use Union File Systems to implement the filesystems of the virtualized units.

The recent publication of the NIST “Application Container Security Guide” suggests that “An image should only include the executables and libraries required by the app itself; all other OS functionality is provided by the OS kernel within the underlying host OS. Images often use techniques like layering and copy-on-write (in which shared master images are read only and changes are recorded to separate files) to minimize their size on disk and improve operational efficiency“. This strenghtens the usage of Union File Systems for containers, as Docker has introduced since I remember (AUFS, OverlayFS, OverlayFS2, etc.).

The conclusion is that Union File Systems are actually used in Docker, and I need to understand how to deal with them if anything fails. So this time I learned…

How to deal with the Union File Systems that use Docker (AUFS, OverlayFS and Overlay2)


In first place, it is important to know how AUFS works. It is intuitively simple and I think that it is well explained in the Docker documentation. The next image is from the documentation of Docker (just in case that the URL changes):


The idea is to have a set of layers, which consist of different directory trees, and they are combined to show a single one which is the result of ordered combination of the different directory trees. The order is important, because if one file is present in more than one directory tree, you will only see the version in the “upper layer”.

There are different readonly layers, and a working layer that gathers the modification in the resulting filesystem: if some files are modified (or added), the new version will appear in the working layer; if any file is deleted, some metadata is included in the layers, to instruct the kernel to hide the file in the resulting filesystem (we’ll see some practical examples).

Working with AUFS

We are preparing a test example to see how AUFS works and that it is very easy to understand and to work with.

We’ll have 2 base folders (named layer1 and layer2) and a working folder (named worklayer). And a folder named mountedfs that will hold the combined filesystem. The next commands will create the starting scenario:

root:root# cd /tmp
root:tmp# mkdir aufs-test
root:tmp# cd aufs-test/
root:aufs-test# mkdir layer1 layer2 upperlayer mountedfs
root:aufs-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:aufs-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:aufs-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:aufs-test# echo "content for file3.txt in layer2" > layer2/file3.txt

Both layer1 and layer2 have a file with the same name (file1.txt) with different content, and there are different files in each layer. The result is shown in the next figure:


Now we’ll mount the filesystem using the basic syntax:

root:aufs-test# mount -t aufs -o br:upperlayer:layer1:layer2 none mountedfs

The result is that folder mountedfs contains the union of the files that we have created:


The whole syntax and options is explained in the manpage of aufs (i.e. man aufs), but we’ll be using the basic options.

The key for us is the option br in which we set the branches (i.e. layers) that will be unioned in the resulting filesystem. They have precedence from left to righ. That means that if one file exist in two layers, the version shown in the AUFS filesystem will be the version of the leftmost layer.

The next figure we can see the contents of the files in the mounted AUFS folder:


In our case, file1.txt contains “content for file1.txt in layer1”, as expected because of the order of the layers.

Now if we create a new file (file4.txt) with the content “new content for file4.txt”, it will be created in the folder upperlayer:


If we delete the file “file1.txt”, it will be kept in each of the layers (i.e. folders layer1 and layer2). But it will be marked as deleted in folder upperlayer by including some control files (although these files will not be shown in the resulting mounted filesystem).


The key for the AUFS driver are the files named .wh*. In this case, we can see that the deletion is reflected in the upperlayer folder by creating the file .wh.file1.txt. That file instructs AUFS to hide the file in resulting mount point. If we create the file again, it will appear the file again and the control file for deletion will be removed.


Of course, the content of file1.txt in layer1 and layer2 folders is kept.

Docker and AUFS

Docker makes use of AUFS in ubuntu (althouhg is being replaced by overlay2). We’ll explore a bit by running a container and searching for its filesystem…


We can see that our container ID is d5afc60dbfd7. If we see the mounts, by simply typing the command “mount” we’ll see that we have a new AUFS mounted point:


Well… we are smart and we know that they are related… but how? We need to check folder /var/lib/docker/image/aufs/layerdb/mounts/ and there we will find a folder named as the ID for our container (d5afc60dbfd7…). And several files in it:


And the mount-id file contains an ID that corresponds with a folder in /var/lib/docker/aufs/mnt/ that correspond with the unioned filesystem that is the root filesystem for container d5afc60dbfd7. Such folder correspond to the mount point exposed when we inspected the mountpoints before.


In folder /var/lib/docker/aufs/layers we can inspect the information about the layers. In particular, we can see the content of the file that correspond to the ID of our mountpoint:


Such content correspond to the layers that have been used to create the mount point at /var/lib/docker/aufs/mnt/. The directory trees that correspond to these layers are included in folders with the corresponding names in folder /var/lib/docker/aufs/diff. In the case of our container, if we create a new file, we can see that it appears in the working layer.


OverlayFS and Overlay2

While AUFS is only supported in some distributions (debian, gentoo, etc.), OverlayFS is included in the Linux Kernel.

The schema of OverlayFS is shown in the next image (obtained from Docker docs):


The underlying idea is the same than AUFS, and the concepts are almost the same: layers that combine together to build a unioned filesystem. The lowerdir are the readonly layers in AUFS, while the upperdir correspond to the read/write layer.

OverlayFS needs an extra folder named workdir that substitutes the .wh.* hidden files in AUFS, but also is used to support the atomic operations on the filesystem.

The main difference between OverlayFS and Overlay2 is that, OverlayFS only supported merging 1 single readonly layer with 1 read/write layer (although overlaying could be nested by overlaying overlayed layers). Now Overlay2 supports 128 lower layers

Working with OverlayFS

We are preparing an equivalent test example to see how OverlayFS works and that it is very easy to understand and to work with.

root:root# cd /tmp
root:tmp# mkdir overlay-test
root:tmp# cd overlay-test/
root:overlay-test# mkdir layer1 layer2 upperlayer workdir mountedfs
root:overlay-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:overlay-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:overlay-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:overlay-test# echo "content for file3.txt in layer2" > layer2/file3.txt

With respect to the AUFS example, we had to include an extra folder (workdir) that is needed for OverlayFS to work.


And now we’ll mount the layers using the basic syntax

Now we’ll mount the filesystem using the basic syntax:

root:overlay-test# mount -t overlay -o lowerdir=layer1:layer2,upperdir=upperlayer,workdir=workdir overlay mountedfs

The result is that folder mountedfs contains the union of the files that we have created:


As expected, we the vision of the union of the 2 existing layers, and the contents of these files are the expected. The lowerdir folders are interpreted from left to right for the case of precendece. So if one file exists in different lowerdirs the unioned filesystem will show the file in the leftmost lowerdir.

Now if we create a new file, the new contents will be created in the upperdir folder, while the contents in the other folders will be kept.


What can I do with this?

Appart for understanding how to better debug products such as Docker, you will be able to start containers in the Docker way using layered filesystems, using the tools shown in a previous post.


How to run Docker containers using common Linux tools (without Docker)

Containers are a current virtualization and application delivery trend, and I am working on it. If you try to search google about them, you can find tons of how-tos, information, tech guides, etc. As in anything in IT, there are flavors of containers. In this case, the players are Docker, LXC/LXD (in which Docker was once based), CoreOS RKT, OpenVZ, etc. If you have a look in the google trends, you’ll notice that¬†undoubtedly the winner hype is Docker and “the others” try to fight against it.


But as there are several alternatives, I wanted to learn about the underlying technology and it seems that all of them are simply based on a set of kernel features: mainly the linux namespaces and the cgroups. And the most important diferences are the utilities that they provide to automate the procedures (the repository of images, container management and other parts of the ecosystem of a particular product).

Disclaimer: This is not a research blog, and so I am not going in depth on when namespaces were introduced in the kernel, which namespaces exist, how they work, what is copy on write, what are cgroups, etc. The purpose of this post is simply “fun with containers” ūüôā

At the end, the “hard work” (i.e. the execution of a containerized environment) is made by the Linux kernel. And so this time I learned…

How to run Docker containers using common Linux tools (without Docker).

We start from a scenario in which we have one container running in Docker, and we want to run it using standard Linux tools. We will mainly act as a common user that has permissions to run Docker containers (i.e. in the case of Ubuntu, my user calfonso is in group ‘docker’), to see that we can run containers in the user space.



To run a contained environment with its own namespaces, using standard Linux tools you can follow the next procedure:

calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar
calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/
calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot $PWD/rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

At this point you need to set up the network devices (from outside the container) and deal with the cgroups (if you need to).

In first place, we are preparing a folder for our tests (handcontainer) and then we will dump the filesystem of the container:

calfonso:~$ mkdir handcontainer
calfonso:~$ cd handcontainer
calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar

If we check the tar file produced, we’ll see that it is the whole filesystem of the container


Let’s extract it in a new folder (called rootfs)

calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/

This action will raise an error, because only the root user can use the mknod application and it is needed for the /dev folder, but it will be fine for us because we are not dealing with devices.


If we check the contents of rootfs, the filesystem is there and we can chroot to that filesystem to verify that we can use it (more or less) as if it was the actual system.


The chroot technique is well known and it was enough in the early days, but we have no isolation in this system. It is exposed if we use the next commands:

/ # ip link
/ # mount -t proc proc /proc && ps -ef
/ # hostname

In these cases, we can manipulate the network of the host, interact with the processes of the host or manipulate the hostname.

This is because using chroot only changes the root filesystem for the current session, but it takes no other action.

Some words on namespaces

One of the “magic” of containers are the namespaces (you can read more on this in this link). The namespaces make that one process have a particular vision of “the things” in several areas. The namespaces that are currently available in Linux are the next:

  • Mounts namespace: mount points.
  • PID namespace: process number.
  • IPC namespace: Inter Process Communication resources.
  • UTS namespace: hostname and domain name.
  • Network namespace: network resources.
  • User namespace: User and Group ID numbers.

Namespaces are handled in the Linux kernel, and any process is already in one namespace (i.e. the root namespace). So changing the namespaces of one particular process do not introduce additional complexity for the processes.

Creating particular namespaces for particular processes means that one process will have its particular vision of the resources in that namespace. As an example, if one process is started with its own PID namespace, the PID number of the process will be 0 and its children will have the next PID numbers. Or if one process is started with its own NET namespace, it will have a particular stack of network devices.

The parent namespace of one namespace is able to manipulate the nested namespace… It is a “hard” sentence, but what this means is that the root namespace is always able to manipulate the resources in the nested namespaces. So the root of one host has the whole vision of the namespaces.

Using namespaces

Now that we know about namespaces, we want to use them ūüėČ

We can think of a container as one process (e.g. a /bin/bash shell) that has its particular root filesystem, its particular network, its particular hostname, its particular PIDs and users, etc. And this can be achieved by creating all these namespaces and spawning the /bin/bash processes inside of them.

The Linux kernel includes the calls clone, setns and unshare that enable to easily manipulate the namespaces for processes. But the common Linux distributions also provide the commands unshare and nsenter that enable to manipulate the namespaces for proccesses and applications from the commandline.

If we get back to the main host, we can use the command unshare to create a process with its own namespaces:

calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

It seems that nothing happened, except that we are “root”, but if we start using commands that manipulate the features in the host, we’ll see what happened.

8If we echo the PID of the current process ($$) we can see that it is 1 (the main process), the current user has UID and GID 0 (he is root), we have not any network device, we can manipulate the hostname…

If we check the processes in the host, in other terminal, we’ll see that even we are shown as ‘root’, outside the process our process is executed under the credentials of our regular user:


This is the magic of the PID namespace, that makes that one process has different PID numbers, depending on the namespace.

Back in our “unshared” environment, if we try to show the processes that are currently running, we’ll get the vision of the processes in the host:

10.pngThis is because of how Linux works: the processes are file descriptors in the /proc mount point and, in our environments, we still have access to the existing mountpoints. But as we have our mnt namespace, we can mount our particular mount filesystem:


From ouside the container, we will be able to create a network device and put it into the namespace:


And if we get back to our “unshared” environment, we’ll see that we have a new network device:


The network setup is incomplete, and we will have access to nowhere (the peer of our eth0 is not connected to any network). This falls out of the scope of this post, but tha main idea is that you will need to connect the peer to some bridge, set an IP address for the eth0 inside the unshared environment, set up a NAT in the host, etc.

Obtaining the filesystem of the container

Now that we are in an “isolated” environment, we want to have the filesystem, utilities, etc. from the container that we started. And this can be done with our old friend “chroot” and some mounts:

root:handcontainer# chroot rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

Using chroot, the filesystem changes and we can use all the new mount points, commands, etc. in that filesystem. So now we have the vision of being inside an isolated environment with an isolated filesystem.

Now we have finished setting up a “hand made container” from an existing Docker container.

Further work

Appart from the “contained environment”, the Docker containers also are managed inside cgroups. Cgroups enable to account and to limit the resources that the processes are able to use (i.e. CPU, I/O, Memory and Devices) and that will be interesting to better control the resources that the processes will be allowed to use (and how).

It is possible to explore the cgroups in the path /sys/fs/cgroups. In that folder you will find the different cgroups that are managed in the system. Dealing with cgroups is a bit “obscure” (creating subfolders, adding PID to files, etc.), and will be left to other eventual post.

Other features that offer Docker is the layered filesystems. The layered filesystem is used in Docker basically to have a common filesystem and only track the modifications. So there is a set of common layers for different containers (that will not be modified) and each of the containers will have a layer that makes its filesystem unique from the others.

In our case, we used a simple flat filesystem for the container, that we used as root filesystem for our contained environment. Dealing with layered filesystem will be a new post ūüėČ

And now…

Well, in this post we tried to understand how the containers work and see that it is a relatively simple feature that is offered by the kernel. But it involves a lot of steps to have a properly configured container (remember that we left the cgroups out of this post).

We did this steps just because we could… just to better understand containers.

My advise is to use the existing technologies to be able to use well built containers (e.g. Docker).

Further reading

As in other posts, I wrote this just to arrange my concepts following a very simple step-by-step procedure. But you can find a lot of resources about containers using your favourite search engine. The most useful resources that I found are:

How to (securely) contain users using Docker containers

Docker container have proved to be very useful to deliver applications. They enable to pack all the libraries and dependencies needed by an application, and to run it in any system. One of the most drawbacks argued by Docker competitors is that the Docker daemon runs as root and it may introduce security threats.

I have searched for the security problems of Docker (e.g. sysdig,¬†blackhat conference,¬†CVEs, etc.) and I could only find privilege escalation by running privileged containers (–privileged), files that are written using root permissions, using the communication socket, using block devices, poisoned images, etc. But all of these problems are related to letting the users start their own containers.

So I think that Docker can be used by sysadmins to provide a different or contained environment to the users. E.g. having a CentOS 7 front-end, but letting some users to run an Ubuntu 16.04 environment. This is why this time I learned…

How to (securely) contain users using Docker containers


You can find the results of this tests in this repo:

The repository contains DoSH (which stands for Docker SHell), which is a development to use Docker containers to run the shell of the users in your Linux system. It is an in-progress project that aims at provide a configurable and secure mechanism to make that when a user logs-in a Linux system, a customized (or standard) container will be created for him. This will enable to limit the resources that the user is able to use, the applications, etc. but also provide custom linux flavour for each user or group of users (i.e. it will coexist users that have CentOS 7 with Ubuntu 16.04 in the same server).

The Docker SHell

In a multi-user system it would be nice to offer a feature like providing different flavours of Linux, depending on the user. Or even including a “jailed” system for some specific users.

This could be achieved in a very easy way. You just need to create a script like the next one

root@onefront00:~# cat > /bin/dosh <<\EOF
docker run --rm -it alpine ash
root@onefront00:~# chmod +x /bin/dosh 
root@onefront00:~# echo "/bin/dosh" >> /etc/shells

And now you can change the sell of one user in /etc/passwd


And you simply have to allow myuser to run docker containers (e.g. in Ubuntu, by adding the user to the “docker” group).

Now we have that when “myuser” logs in the system, he will be¬†inside a container with the Alpine flavour:


This is a simple solution that enables the user to have a specific linux distribution… but also your specific linux environment with special applications, libraries, etc.

But the user has not access to its home nor other files that will be interesting to give him the appearance of being in the real system. So we could just map his home folder (and other folders that we wanted to have inside the container; e.g. /tmp). A modified version of /bin/dosh will be the next one:

docker run --rm -v /home/$username:/home/$username -v /tmp:/tmp -it alpine ash

But if we log in as myuser¬†the result is that the user that logs in is…¬†root. And the things that he does is as root.


We to run the container as the user and not as root. An updated version of the script is the next:

uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

If myuser now logs in, the container has the permissions of this user


We can double-check it by checking the running processes of the container

The problem now is that the name of the user (and the groups) are not properly resolved inside the container.


This is because the /etc/passwd and the /etc/group files are included in the container, and they do not know about the users or groups in the system. As we want to resemble the system in the container, we can share a readonly copy of /etc/passwd and /etc/group by modifying the /bin/dosh script:

uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

And now the container has the permissions of the user and the username is resolved. So the user can access the resources in the filesystem in the same conditions that if he was accessing the hosting system.


Now we should add the mappings for the folders to which the user has to have permissions to access (e.g. scratch, /opt, etc.).

Using this script as-is, the user will have different environment for each of the different sessions that he starts. That means that the processes will not be shared between different sessions.

But we can create a more ellaborated script to start containers using different Docker images depending on the user or on the group to which the user belongs. Or even to create pseudo-persistent containers that start when the user logs-in and stops when the user leaves (to allow multiple ttys for the same environment).

An example of this kind of script will be the next one:


uid="$(id -u $username)"
gid="$(id -g $username)"


case "$username" in

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
docker exec -it "$CONTAINERNAME" "$CMD"

Using this script we start the user containers on demand and their processes are kept between log-ins. Moreover, the log-in will fail in case that the container fails to start.

In the event that the system is powered off, the container will be powered off although its contents are kept for future log-ins (the container will be restarted from the stop state).

The development of Docker SHell continues in this repository:

Security concerns

The main problem of Docker related to security is that the daemon is running as root. So if I am able to run containers, I am able to run something like this:

$ docker run --privileged alpine ash -c 'echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger'

And the host will be powered off as a regular user. Or simply…

$ docker run --rm -v /etc:/etc -it alpine ash
/ # adduser mynewroot -G root
/ # exit

And once you exit the container, you will have a new root user in the physical host.

This happens because the user inside the container is “root” that has UID=0, and it is root because the Docker daemon is root with UID=0.

We could change this behaviour by shifting the user namespace with the flag –userns-remap and the subuids to make that the Docker daemon does not run as UID=0, but this will also limit the features of Docker for the sysadmin. The first consequence is that that the sysadmin will not be able to run Docker containers as root (nor privileged containers). If this is acceptable for your system, this will probably be the best solution for you as it limits the possible security threats.

If you are not experienced with the configuration of Docker or you simply do not want (or do not know how) to use the –userns-remap, you can still use DoSH.

Executing the docker commands by non-root users

The actual problem is that the user needs to be allowed to use Docker to spawn the DoSH container, and you do not want to allow the user to run arbitraty docker commands.

We can consider that the usage of Docker is secure if the containers are ran under the credentials of regular users, and the devices and other critical resources that are attached to the container are used under these credentials. So users can be allowed to run Docker containers if they are forced to include the flat -u <uid>:<gid> and the rest of the commandline is controlled.

The solution is as easy as installing sudo (which is shipped in the default distribution of Ubuntu but also is an standard package almost in any distribution) and allow users to run as sudo only a specific command that execute the docker commands, but do not allow these users to modify these commands.

Once installed sudo, we can create the file /etc/sudoers.d/dosh

root@onefront00:~# cat > /etc/sudoers.d/dosh <<\EOF
> ALL ALL=NOPASSWD: /bin/shell2docker
root@onefront00:~# chmod 440 /etc/sudoers.d/dosh

Now we must move the previous /bin/dosh script to /bin/shell2docker and then we can create the script /bin/dosh with the following content:

root@onefront00:~# mv /bin/dosh /bin/shell2docker
root@onefront00:~# cat > /bin/dosh <<\EOF
sudo /bin/shell2docker
root@onefront00:~# chmod +x /bin/dosh

And finally, we will remove the ability to run docker containers to the user (e.g. in Ubuntu, by removing him from the “docker” group).

If you try to log-in as the user, you will notice that now we have the problem that the user that runs the script is “root” and then the container will be run as “root”. But we can modify the script to detect whether the script has be ran as sudo or as a regular user and then catch the appropriate username. The updated script will be the next:

if [ $SUDO_USER ]; then username=$SUDO_USER; else username="$(whoami)"; fi
uid="$(id -u $username)"
gid="$(id -g $username)"


case "$username" in

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
docker exec -it "$CONTAINERNAME" "$CMD"

Now any user can execute the command that create the Docker container as root (using sudo), but the user cannot run arbitraty Docker commands. So all the security is now again in the side of the sysadmin that must create “secure” containers.

This is an in-progress work that will continue in this repository:





How to create a simple Docker Swarm cluster

I have an old computer cluster, and the nodes have not any virtualization extensions. So I’m trying to use it to run Docker containers. But I do not want to choose in which of the internal nodes I have to run the containers. So I am using Docker Swarm, and I will use it as a single Docker host, by calling the main node to execute the containers and the swarm will decide the host in which the container will be ran. So this time…

I learned how to create a simple Docker Swarm cluster with a single front-end and multiple internal nodes

The official documentation of Docker includes this post that describes how to do it, but whereas it is very easy, I prefer to describe my specific use case.


  • 1 Master node with the public IP and the private IP
  • 3 Nodes with the private IPs,¬† and¬†

I want to call the master node to create a container from other computer (e.g., and leave the master to choose in which internal node is hosted the container.

Preparing the master node

First of all, I will install Docker

$ curl -sSL | sh

Now it is needed to install consul that is a backend for key-value storage. It will run as a container in the front-end (and it will be used by the internal nodes to synchronize with the master)

$ docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap

Finally I will launch the swarm master

$ docker run -d -p 4000:4000 swarm manage -H :4000 --advertise consul://

(*) remember that consul is installed in the front-end, but you could detach it and install in another node if you want (need) to.

Installing the internal nodes

Again, we should install Docker and export docker through the IP

$ curl -sSL | sh

And once it is running, it is needed to expose the docker API through the IP address of the node. The easy way to test it is to launch the daemon using the following option:

$ docker daemon -H tcp:// -H unix:///var/run/docker.sock

Now you should be able to issue command line options such as

$ docker -H :2375 info

or even from other hosts

$ docker -H info

The underlying aim is that with swarm you are able to expose the local docker daemon to be used remotely in the swarm.

To make the changes persistent, you should set the parameters in the docker configuration file /etc/default/docker:

DOCKER_OPTS="-H tcp:// -H unix:///var/run/docker.sock"

It seems that docker version 1.11 has a bug and does not properly use that file (at least in ubuntu 16.04). So you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp:// -H unix:///var/run/docker.sock -H fd://

Finally now we have to launch the swarm on each node

  • On node
docker run --restart=always -d swarm join --advertise= consul://
  • On node
docker run --restart=always -d swarm join --advertise= consul://
  • On node
docker run --restart=always -d swarm join --advertise= consul:// steps: communicating containers between them

Next steps: communicating the containers

If you launch new containers as usual (i.e. docker run -it containerimage bash), you will get containers with overlapping IPs. This is because you are using the default network scheme in the individual docker servers.

If you want to have a common network, you need to create an overlay network that spans across the different docker daemons.

But in order to be able to make it, you need to change the way that the docker daemons are being started. You need a system to coordinate the network, and it can be the same consul that we are using.

So you have to append the next flags to the command line that starts docker:

 --cluster-advertise eth1:2376 --cluster-store consul://

You can add the parameters to the docker configuration file /etc/default/docker. In the case of the internal nodes, the result will be the next (according to our previous modifications):

DOCKER_OPTS="-H tcp:// -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://"

As stated before, docker version 1.11 has a bug and does not properly use that file. In the meanwhile you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp:// -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://

(*) We are using eth1 because it is the device in which our internal IP address is. You should use the device to which the 10.100.0.x address is assigned.

Now you must restart the docker daemons of ALL the nodes in the swarm.

Once they have been restarted, you can create a new network for the swarm:

$ docker -H network create swarm-network

And then you can use it for the creation of the containers:

$ docker -H run -it --net=swarm-network ubuntu:latest bash

Now the IPs will be given in a coordinated way, and the containers will have several IPs (the IP in the swarm and its IP in the local docker server).

Some more words on this

This post is made in May/2016. Both docker and swarm are evolving and maybe this post is outdated soon.

Some things that bother me on this installation…

  • While using the overlay network, if you expose one port using the flag -p, the port is exposed in the IP from the internal docker host. I think that you should be able to express in which IP you want to expose the port or use the IP from the main server.
    • I solve this issue by using a development made by me¬†IPFloater: Once I create the container, I get the internal IP in which the port is exposed and I create a redirection in IPFloater, to be able to access the container through¬†a specific IP.
  • Consul fails¬†A LOT. If I leave the swarm running for hours (i.e. 8 hours) consul will probably fail. If I run a command like this: “docker run –rm=true swarm list consul://”, it states that it has a fail. Then I have to delete the container and create a new one.