How to (securely) contain users using Docker containers

Docker container have proved to be very useful to deliver applications. They enable to pack all the libraries and dependencies needed by an application, and to run it in any system. One of the most drawbacks argued by Docker competitors is that the Docker daemon runs as root and it may introduce security threats.

I have searched for the security problems of Docker (e.g. sysdigblackhat conferenceCVEs, etc.) and I could only find privilege escalation by running privileged containers (–privileged), files that are written using root permissions, using the communication socket, using block devices, poisoned images, etc. But all of these problems are related to letting the users start their own containers.

So I think that Docker can be used by sysadmins to provide a different or contained environment to the users. E.g. having a CentOS 7 front-end, but letting some users to run an Ubuntu 16.04 environment. This is why this time I learned…

How to (securely) contain users using Docker containers

TL;DR

You can find the results of this tests in this repo: https://github.com/grycap/dosh

The repository contains DoSH (which stands for Docker SHell), which is a development to use Docker containers to run the shell of the users in your Linux system. It is an in-progress project that aims at provide a configurable and secure mechanism to make that when a user logs-in a Linux system, a customized (or standard) container will be created for him. This will enable to limit the resources that the user is able to use, the applications, etc. but also provide custom linux flavour for each user or group of users (i.e. it will coexist users that have CentOS 7 with Ubuntu 16.04 in the same server).

The Docker SHell

In a multi-user system it would be nice to offer a feature like providing different flavours of Linux, depending on the user. Or even including a “jailed” system for some specific users.

This could be achieved in a very easy way. You just need to create a script like the next one

root@onefront00:~# cat > /bin/dosh <<\EOF
docker run --rm -it alpine ash
EOF
root@onefront00:~# chmod +x /bin/dosh 
root@onefront00:~# echo "/bin/dosh" >> /etc/shells

And now you can change the sell of one user in /etc/passwd

myuser:x:9870:9870::/home/myuser:/bin/dosh

And you simply have to allow myuser to run docker containers (e.g. in Ubuntu, by adding the user to the “docker” group).

Now we have that when “myuser” logs in the system, he will be inside a container with the Alpine flavour:

alpine-dockershell-1

This is a simple solution that enables the user to have a specific linux distribution… but also your specific linux environment with special applications, libraries, etc.

But the user has not access to its home nor other files that will be interesting to give him the appearance of being in the real system. So we could just map his home folder (and other folders that we wanted to have inside the container; e.g. /tmp). A modified version of /bin/dosh will be the next one:

#!/bin/bash
username="$(whoami)"
docker run --rm -v /home/$username:/home/$username -v /tmp:/tmp -it alpine ash

But if we log in as myuser the result is that the user that logs in is… root. And the things that he does is as root.

alpine-dockershell-2

We to run the container as the user and not as root. An updated version of the script is the next:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

If myuser now logs in, the container has the permissions of this user

alpine-dockershell-3

We can double-check it by checking the running processes of the container

The problem now is that the name of the user (and the groups) are not properly resolved inside the container.

alpine-dockershell-6

This is because the /etc/passwd and the /etc/group files are included in the container, and they do not know about the users or groups in the system. As we want to resemble the system in the container, we can share a readonly copy of /etc/passwd and /etc/group by modifying the /bin/dosh script:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

And now the container has the permissions of the user and the username is resolved. So the user can access the resources in the filesystem in the same conditions that if he was accessing the hosting system.

alpine-dockershell-7

Now we should add the mappings for the folders to which the user has to have permissions to access (e.g. scratch, /opt, etc.).

Using this script as-is, the user will have different environment for each of the different sessions that he starts. That means that the processes will not be shared between different sessions.

But we can create a more ellaborated script to start containers using different Docker images depending on the user or on the group to which the user belongs. Or even to create pseudo-persistent containers that start when the user logs-in and stops when the user leaves (to allow multiple ttys for the same environment).

An example of this kind of script will be the next one:

#!/bin/bash

username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Using this script we start the user containers on demand and their processes are kept between log-ins. Moreover, the log-in will fail in case that the container fails to start.

In the event that the system is powered off, the container will be powered off although its contents are kept for future log-ins (the container will be restarted from the stop state).

The development of Docker SHell continues in this repository: https://github.com/grycap/dosh

Security concerns

The main problem of Docker related to security is that the daemon is running as root. So if I am able to run containers, I am able to run something like this:

$ docker run --privileged alpine ash -c 'echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger'

And the host will be powered off as a regular user. Or simply…

$ docker run --rm -v /etc:/etc -it alpine ash
/ # adduser mynewroot -G root
...
/ # exit

And once you exit the container, you will have a new root user in the physical host.

This happens because the user inside the container is “root” that has UID=0, and it is root because the Docker daemon is root with UID=0.

We could change this behaviour by shifting the user namespace with the flag –userns-remap and the subuids to make that the Docker daemon does not run as UID=0, but this will also limit the features of Docker for the sysadmin. The first consequence is that that the sysadmin will not be able to run Docker containers as root (nor privileged containers). If this is acceptable for your system, this will probably be the best solution for you as it limits the possible security threats.

If you are not experienced with the configuration of Docker or you simply do not want (or do not know how) to use the –userns-remap, you can still use DoSH.

Executing the docker commands by non-root users

The actual problem is that the user needs to be allowed to use Docker to spawn the DoSH container, and you do not want to allow the user to run arbitraty docker commands.

We can consider that the usage of Docker is secure if the containers are ran under the credentials of regular users, and the devices and other critical resources that are attached to the container are used under these credentials. So users can be allowed to run Docker containers if they are forced to include the flat -u <uid>:<gid> and the rest of the commandline is controlled.

The solution is as easy as installing sudo (which is shipped in the default distribution of Ubuntu but also is an standard package almost in any distribution) and allow users to run as sudo only a specific command that execute the docker commands, but do not allow these users to modify these commands.

Once installed sudo, we can create the file /etc/sudoers.d/dosh

root@onefront00:~# cat > /etc/sudoers.d/dosh <<\EOF
> ALL ALL=NOPASSWD: /bin/shell2docker
> EOF
root@onefront00:~# chmod 440 /etc/sudoers.d/dosh

Now we must move the previous /bin/dosh script to /bin/shell2docker and then we can create the script /bin/dosh with the following content:

root@onefront00:~# mv /bin/dosh /bin/shell2docker
root@onefront00:~# cat > /bin/dosh <<\EOF
#!/bin/bash
sudo /bin/shell2docker
EOF
root@onefront00:~# chmod +x /bin/dosh

And finally, we will remove the ability to run docker containers to the user (e.g. in Ubuntu, by removing him from the “docker” group).

If you try to log-in as the user, you will notice that now we have the problem that the user that runs the script is “root” and then the container will be run as “root”. But we can modify the script to detect whether the script has be ran as sudo or as a regular user and then catch the appropriate username. The updated script will be the next:

#!/bin/bash
if [ $SUDO_USER ]; then username=$SUDO_USER; else username="$(whoami)"; fi
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Now any user can execute the command that create the Docker container as root (using sudo), but the user cannot run arbitraty Docker commands. So all the security is now again in the side of the sysadmin that must create “secure” containers.

This is an in-progress work that will continue in this repository: https://github.com/grycap/dosh

 

 

 

 

Advertisements

How to install a cluster with NIS and NFS in Ubuntu 16.04

I am used to create computing clusters. A cluster consists of a set of computers that work together to solve one task. In a cluster you usually have an interface to access to the cluster, a network that interconnect the nodes and a set of tools to manage the cluster. The interface to access to the cluster usually is a node named front-end to which the users can SSH. The other nodes are usually named the working-nodes. Another common component is a shared filesystem to ease simple communication between the WN.

A very common set-up is to install a NIS server in the front-end so that the users can access to the WN (i.e. using SSH), getting the same credentials than in the front-end. NIS is still useful because is very simple and it integrates very well with NFS, that is commonly used to share a file system.

It was easy to install all of this, but it is also a bit tricky (in special, NIS), and so this time I had to re-learn…

How to install a cluster with NIS and NFS in Ubuntu 16.04

We start from 3 nodes that have a fresh installation of Ubuntu 16.04. These nodes are in the network 10.0.0.1/24. Their names are hpcmd00 (10.0.0.35), hpcmd01 (10.0.0.36) and hpcmd02 (10.0.0.37). In this example, hpcmd00 will be the front-end node and the others will act as the working nodes.

Preparation

First of all we are updating ubuntu in all the nodes:

root@hpcmd00:~# apt-get update && apt-get -y dist-upgrade

Installing and configuring NIS

Install NIS in the Server

Now that the system is up to date, we are installing the NIS server in hpcmd00. It is very simple:

root@hpcmd00:~# apt-get install -y rpcbind nis

During the installation, we will be asked for the name of the domain (as in the next picture):

nisWe have selected the name hpcmd.nis for our domain. It will be kept in the file /etc/defaultdomain. Anyway we can change the name of the domain at any time by executing the next command:

root@hpcmd00:~# dpkg-reconfigure nis

And we will be prompted again for the name of the domain.

Now we need to adjust some parameters of the NIS server, that consist in editing the files /etc/default/nis and /etc/ypserv.securenets. In the first case we have to set the variable NISSERVER to the value “master”. In the second file (ypserv.securents) we are setting which IP addresses are allowed to access to the NIS service. In our case, we are allowing all the nodes in the subnet 10.0.0.0/24.

root@hpcmd00:~# sed -i 's/NISSERVER=.*$/NISSERVER=master/' /etc/default/nis
root@hpcmd00:~# sed 's/^\(0.0.0.0[\t ].*\)$/#\1/' -i /etc/ypserv.securenets
root@hpcmd00:~# echo "255.255.255.0 10.0.0.0" >> /etc/ypserv.securenets

Now we are including the name of the server in the /etc/hosts file, so that the server is able to solve its IP address, and then we are initializing the NIS service. As we have only one master server, we are including its name and let the initialization to proceed.

root@hpcmd00:~# echo "10.0.0.35 hpcmd00" >> /etc/hosts
root@hpcmd00:~# /usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. hpcmd00 is in the list of NIS server hosts. Please continue to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
 next host to add: hpcmd00
 next host to add: 
The current list of NIS servers looks like this:
hpcmd00
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/hpcmd.nis/ypservers...
Running /var/yp/Makefile...
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
...
Updating shadow.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'

hpcmd00 has been set up as a NIS master server.

Now you can run ypinit -s hpcmd00 on all slave server.

Finally we are exporting the users of our system by issuing the next command:

root@hpcmd00:~# make -C /var/yp/

Take into account that everytime that you create a new user in the front-end, you need to export the users by issuing the make -C /var/yp command. So it is advisable to create a cron task that runs that command, to make it sure that the users are exported.

root@hpcmd00:~# cat > /etc/cron.hourly/ypexport <<\EOT
#!/bin/sh
make -C /var/yp
EOT
root@hpcmd00:~# chmod +x /etc/cron.hourly/ypexport

The users in NIS

When issuing the command make…, you are exporting the users that have an identifier of 1000 and above. If you want to change it, you can adjust the parameters in the file /var/yp/Makefile.

In particular, you can change the variables MINUID and MINGID to match your needs.

In the default configuration, the users with id 1000 and above are exported because the user 1000 is the first user that is created in the system.

Install the NIS clients

Now that we have installed the NIS server, we can proceed to install the NIS clients. In this example we are installing hpcmd01, but it will be the same procedure for all the nodes.

First install NIS using the next command:

root@hpcmd01:~# apt-get install -y rpcbind nis

As it occurred in the server, you will be prompted for the name of the domain. In our case, it is hpcmd.nis because we set that name in the server.

root@hpcmd01:~# echo "domain hpcmd.nis server hpcmd00" >> /etc/yp.conf 
root@hpcmd01:~# sed -i 's/compat$/compat nis/g;s/dns$/dns nis/g' /etc/nsswitch.conf 
root@hpcmd01:~# systemctl restart nis

Fix the rpcbind bug in Ubuntu 16.04

At this time the NIS services (both in server and clients) are ready to be used, but… WARNING because the rpcbind package needed by NIS has a bug in Ubuntu and as you reboot any of your system, rpc is dead and so the NIS server will not work. You can check it by issuing the next command:

root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: inactive (dead)

Here you can see that it is inactive. And if you start it by hand, it will be properly running:

root@hpcmd00:~# systemctl start rpcbind
root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: active (running) since vie 2017-05-12 12:57:00 CEST; 1s ago
 Main PID: 1212 (rpcbind)
 Tasks: 1
 Memory: 684.0K
 CPU: 8ms
 CGroup: /system.slice/rpcbind.service
 └─1212 /sbin/rpcbind -f -w
may 12 12:57:00 hpcmd00 systemd[1]: Starting RPC bind portmap service...
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/rpcbind.xdr: failed
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/portmap.xdr: failed
may 12 12:57:00 hpcmd00 systemd[1]: Started RPC bind portmap service.

There are some patches, and it seems that it will be solved in the new versions. But for now, we are including a very simple workaround that consists in adding the next lines to the file /etc/rc.local, just before the “exit 0” line:

systemctl restart rpcbind
systemctl restart nis

Now if you reboot your system, it will be properly running the rpcbind service.

WARNING: this needs to be done in all the nodes.

Installing and configuring NFS

We are configuring NFS in a very straightforward way. If you need more security or other features, you should deep into NFS configuration options to adapt it to your deployment.

In particular, we are sharing the /home folder in hpcmd00 to be available for the WN. Then, the users will have their files available at each node. I followed the instructions at this blog post.

Sharing /home at front-end

In order to install NFS in the server, you just need to issue the next command

root@hpcmd00:~# apt-get install -y nfs-kernel-server

And to share the /home folder, you just need to add a line to the /etc/exports file

root@hpcmd00:~# cat >> /etc/exports << \EOF
/home hpcmd*(rw,sync,no_root_squash,no_subtree_check)
EOF

There are a lot of options to share a folder using NFS, but we are just using some of them that are common for a /home folder. Take into account that you can restrict the hosts to which you can share the folder using their names (that is our case: hpcmdXXXX) or using IP addresses. It is noticeable that you can use wildcards such as “*”.

Finally you need to restart the NFS daemon, and you will be able to verify that the exports are ready.

root@hpcmd00:~# service nfs-kernel-server restart
root@hpcmd00:~# showmount -e localhost
Export list for localhost:
/home hpcmd*

Mount the /home folder in the WN

In order to be able to use NFS endpoints, you just need to run the next command on each node:

root@hpcmd01:~# apt-get install -y nfs-common

Now you will be able to list the folders shared at the server

root@hpcmd01:~# showmount -e hpcmd00
Export list for hpcmd00:
/home hpcmd*

At this moment it is possible to mount the /home folder just issuing a command like

root@hpcmd01:~# mount -t nfs hpcmd00:/home /home

But we’d prefer to add a line to the /etc/fstab file. Using this approach, the mount will be available at boot time. In order to make it, we’ll add the proper line:

root@hpcmd01:~# cat >> /etc/fstab << \EOT
hpcmd00:/home /home nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
EOT

Now you can also issue the following command to start using your share without the need of rebooting:

root@hpcmd01:~# mount /home/

Verification

At the hpcmd00 node you can create a user, and verify that the home folder has been created:

root@hpcmd00:~# adduser testuser
Añadiendo el usuario `testuser' ...
Añadiendo el nuevo grupo `testuser' (1002) ...
Añadiendo el nuevo usuario `testuser' (1002) con grupo `testuser' ...
...
¿Es correcta la información? [S/n] S
root@hpcmd00:~# ls -l /home/
total 4
drwxr-xr-x 2 testuser testuser 4096 may 15 10:06 testuser

If you ssh to the internal nodes, it will fail (the user will not be available), because the user has not been exported:

root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Permission denied, please try again.

But the home folder for that user is already available in these nodes (because the folder is shared using NFS).

Once we export the users at hpcmd00 the user will be available in the domain and we will be able to ssh to the WN using that user:

root@hpcmd00:~# make -C /var/yp/
make: se entra en el directorio '/var/yp'
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating netid.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'
make: se sale del directorio '/var/yp'
root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-77-generic x86_64)

testuser@hpcmd01:~$ pwd
/home/testuser

 

How to avoid the automatic installation of the recommended packages in Ubuntu

I am used to install ubuntu servers, and I want them to use the less disk possible. So I usually install packages adding the flag –no-install-recommends. But I needed to include the flag each time. So this time I learned…

How to avoid the automatic installation of the recommended packages in Ubuntu

This is a very simple trick that I found in this post, but I want to keep it simple for me to find it.

It is needed to include some settings in one file in /etc/apt/apt.conf.d/. In order to isolate these settings, I will create a new file:

$ cat > /etc/apt/apt.conf.d/99_disablerecommends <\EOF
APT::Install-Recommends "false";
APT::AutoRemove::RecommendsImportant "false";
APT::AutoRemove::SuggestsImportant "false";
EOF

And from now on, when you issue apt-get install commands, the recommended packages will not be installed.

IMPORTANT: now that you have installed all those recommended packages, you can get rid of them just issuing a command like the next one:

$ apt-get autoremove --purge

How to connect complex networking infrastructures with Open vSwitch and LXC containers

Some days ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. In order to extend the features of the set-up that I did there, I wanted to introduce some services: a DHCP server, a router, etc. to create a more complex infrastructure. And so this time I learned…

How to connect complex networking infrastructures with Open vSwitch and LXC containers

My setup is based on the previous one, to introduce common services for networked environments. In particular, I am going to create a router and a DHCP server. So I will have two nodes that will host LXC containers and they will have the following features:

  • Any container in any node will get an IP address from the single DHCP server.
  • Any container will have access to the internet through the single router.
  • The containers will be able to connect between them using their private IP addresses.

We had the set-up in the next figure:

ovs

And now we want to get to the following set-up:

ovs2

Well… we are not making anything new, because we have worked with this before in How to create a multi-LXC infrastructure using custom NAT and DHCP server. But we can see this post as an integration post.

Update of the previous setup

On each of the nodes we have to create the bridge br-cont0 and the containers that we want. Moreover, we have to create the virtual swithc ovsbr0 and to connect it to the other node.

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-br ovsbr0
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

Warning: we are not starting the containers, because we want them to get the IP address from our dhcp server.

Preparing a bridge to the outern world (NAT bridge)

We need a bridge that will act as a router to the external world for the router in our LAN. This is because we only have two known IP addresses (the one for ovsnode01 and the one for ovsnode02). So we’ll provide access to the Internet through one of them (according to the figure, it will be ovsnode01).

So we will create the bridge and will give it a local IP address:

ovsnode01:~# brctl addbr br-out
ovsnode01:~# ip addr add dev br-out 10.0.1.1/24

And now we will provide access to the containers that connect to that bridge through NAT. So let’s create the following script and execute it:

ovsnode01:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth0
IFACE_LAN=br-out
NETWORK_LAN=10.0.1.0/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
ovsnode01:~# chmod +x enable_nat.sh
ovsnode01:~# ./enable_nat.sh

And that’s all. Now ovsnode01 will act as a router for IP addresses in the range 10.0.1.0/24.

DHCP server

Creating a DHCP server is as easy as creating a new container, installing dnsmasq and configuring it.

ovsnode01:~# cat > ./nat-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-out 
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f nat-network.tmpl -n dhcpserver -t ubuntu
ovsnode01:~# lxc-start -dn dhcpserver
ovsnode01:~# lxc-attach -n dhcpserver -- bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip addr add 10.0.1.2/24 dev eth0
route add default gw 10.0.1.1'
ovsnode01:~# lxc-attach -n dhcpserver

WARNING: we created the container attached to br-out, because we want it to have internet access to be able to install dnsmasq. Moreover we needed to give it an IP address and set the nameserver to the one from google. Once the dhcpserver is configured, we’ll change the configuration to attach to br-cont0, because the dhcpserver only needs to access to the internal network.

Now we have to install dnsmasq:

apt-get update
apt-get install -y dnsmasq

Now we’ll configure the static network interface (172.16.0.202), by modifying the file /etc/network/interfaces

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 172.16.1.202
netmask 255.255.255.0
EOF

And finally, we’ll configure dnsmasq

cat > /etc/dnsmasq.conf << EOF
interface=eth0
except-interface=lo
listen-address=172.16.1.202
bind-interfaces
dhcp-range=172.16.1.1,172.16.1.200,1h
dhcp-option=26,1400
dhcp-option=option:router,172.16.1.201
EOF

In this configuration we have created our range of IP addresses (from 172.16.1.1 to 172.16.1.200). We have stated that our router will have the IP address 172.16.1.201 and one important thing: we have set the MTU to 1400 (remember that when using OVS we had to set the MTU to a lower size).

Now we are ready to connect the container to br-cont0. In order to make it, we have to modify the file /var/lib/lxc/dhcpserver/config. In particular, we have to change the value of the attribute lxc.network.link from br-out to br-cont0. Once I modified it, my network configuration in that file is as follows:

# Network configuration
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br-cont0
lxc.network.hwaddr = 00:16:3e:9f:ae:3f

Finally we can reboot our container

ovsnode01:~# lxc-stop -n dhcpserver 
ovsnode01:~# lxc-start -dn dhcpserver

And we can check that our server gets the proper IP address:

root@ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 
dhcpserver RUNNING 0 - 172.16.1.202 -

We could also check that it is connected to the bridge:

ovsnode01:~# ip addr
...
83: vethGUV3HB: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-cont0 state UP group default qlen 1000
 link/ether fe:86:05:f6:f4:55 brd ff:ff:ff:ff:ff:ff
ovsnode01:~# brctl show br-cont0
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe3b968e0937 no vethGUV3HB
ovsnode01:~# lxc-attach -n dhcpserver -- ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
82: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
 link/ether 00:16:3e:9f:ae:3f brd ff:ff:ff:ff:ff:ff
ovsnode01:~# ethtool -S vethGUV3HB
NIC statistics:
 peer_ifindex: 82

No matter if you do not understand this… it is a very advanced issue for this post. The important thing is that bridge br-cont0 has the device vethGUV3HB, whose number is 83 and its peer interface is the 82 that, in fact, is the eth0 device from inside the container.

Installing the router

Now that we have our dhcpserver ready, we are going to create a container that will act as a router for our network. It is very easy (in fact, we have already created a router). And… this fact arises a question: why are we creating another router?

We create a new router because it has to have an IP address inside the private network and other interface in the network to which we want to provide acess from the internal network.

Once we have this issue clear, let’s create the router, which as an IP in the bridge in the internal network (br-cont0):

ovsnode01:~# cat > ./router-network.tmpl << EOF
 lxc.network.type = veth
 lxc.network.link = br-cont0
 lxc.network.flags = up
 lxc.network.hwaddr = 00:16:3e:xx:xx:xx
 lxc.network.type = veth
lxc.network.link = br-out
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
 ovsnode01:~# lxc-create  -t ubuntu -f router-network.tmpl -n router

WARNING: I don’t know why, but for some reason sometimes lxc 2.0.3 fails in Ubuntu 14.04 when starting containers if they are created using two NICs.

Now we can start the container and start to work with it:

ovsnode01:~# lxc-start -dn router
ovsnode01:~# lxc-attach -n router

Now we simply have to configure the IP addresses for the router (eth0 is the interface in the internal network, bridged to br-cont0, and eth1 is bridged to br-out)

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.1.201
netmask 255.255.255.0

auto eth1
iface eth1 inet static
address 10.0.1.2
netmask 255.255.255.0
gateway 10.0.1.1
EOF

And finally create the router by using a script which is similar to the previous one:

router:~# apt-get install -y iptables
router:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth1
IFACE_LAN=eth0
NETWORK_LAN=172.16.1.201/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
router:~# chmod +x enable_nat.sh
router:~# ./enable_nat.sh

Now we have our router ready to be used.

Starting the containers

Now we can simply start the containers that we created before, and we can check that they get an IP address by DHCP:

ovsnode01:~# lxc-start -n node01c01
ovsnode01:~# lxc-start -n node01c02
ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
dhcpserver RUNNING 0 - 172.16.1.202 -
node01c01 RUNNING 0 - 172.16.1.39 -
node01c02 RUNNING 0 - 172.16.1.90 -
router RUNNING 0 - 10.0.1.2, 172.16.1.201 -

And also we can check all the hops in our network, to check that it is properly configured:

ovsnode01:~# lxc-attach -n node01c01 -- apt-get install traceroute
(...)
ovsnode01:~# lxc-attach -n node01c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.085 ms 0.040 ms 0.041 ms
 2 10.0.1.1 0.079 ms 0.144 ms 0.067 ms
 3 10.10.2.201 0.423 ms 0.517 ms 0.514 ms
...
12 216.58.210.131 8.247 ms 8.096 ms 8.195 ms

Now we can go to the other host and create the bridges, the virtual switch and the containers, as we did in the previous post.

WARNING: Just to remember, I leave this snip of code here:

ovsnode02:~# brctl addbr br-cont0
ovsnode02:~# ip link set dev br-cont0 up
ovsnode02:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode02:~# apt-get install openvswitch-switch
ovsnode02:~# ovs-vsctl add-br ovsbr0
ovsnode02:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode02:~# ovs-vsctl add-port ovsbr0 vxlan0 — set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

And finally, we can start the containers and check that they get IP addresses from the DHCP server, and that they have connectivity to the internet using the routers that we have created:

ovsnode02:~# lxc-start -n node02c01
ovsnode02:~# lxc-start -n node02c02
ovsnode02:~# lxc-ls -f
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------
node02c01 RUNNING 172.16.1.50 - NO
node02c02 RUNNING 172.16.1.133 - NO
ovsnode02:~# lxc-attach -n node02c01 -- apt-get install traceroute
(...)
ovsnode02:~# lxc-attach -n node02c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.904 ms 0.722 ms 0.679 ms
 2 10.0.1.1 0.853 ms 0.759 ms 0.918 ms
 3 10.10.2.201 1.774 ms 1.496 ms 1.603 ms
...
12 216.58.210.131 8.849 ms 8.773 ms 9.062 ms

What is next?

Well, you’d probably want to persist the settings. Maybe you can set the iptables rules (aka the enable_nat.sh script) as a start script in /etc/init.d

As a further work, you can try VLAN tagging in OVS and so on, to duplicate the networks using the same components, but isolating the different networks.

You can also try to include new services (e.g. a private DNS server, a reverse NAT, etc.).

How to Recover Partitions from LVM Volume

Yesterday I had a problem with a disk… while trying to increase the size of a LVM volume, I lost the disk. What I did was: add the new disk to the LVM volume, mirror the volume and removing the original disk. But the problem is that I added the disk back again and the things started to go wrong. The volume did not boot, etc.

The original system was a Scientific Linux 6.7 and it had different partitions: one /boot ext-4 partition and a LVM volume in which we had 2 partitions: lv-root and lv-swap.

At the end of the LVM problem I had the original volume with LVM bad metadata that did not allow me to use the original information. Luckily I had not written any other information in the disks… so the information had to be there.

Once the disaster was there…

I learned how to recover the partitions from a LVM volume.

I had to recover the partitions, to create a new disk with the partitions.

Recovering partitions

After some failed tests, I got to the situation in which I had /dev/sda with a single GPT partition in there. I remembered about TestDisk, which is a tool that helps in forensics, and I started to play with it.

1

The first that I did was to try to figure out what could I do with my partition. So I started my system with a ubuntu 14.04 desktop LiveCD and /dev/sda, downloaded TestDisk and tried

$ ./testdisk_static /dev/sda

Then I used the GPT option and analysed the disk. In the disk I found two partitions: my boot partition and a LVM partition. I wrote the partition table and got to the initial page where I entered in the advanced options to dump the partition (Image Creation).

2

Then I had the boot partition dumped and it could be used as a raw partition dumped with dd.

Then I exited from the app and started it back again, because the LVM partition was now accesible in /dev/sda2. Now I tried

$ ./testdisk_static /dev/sda2

Now I selected the disk and selected the Intel partition option. TestDisk found the two partitions: the Linux and the Linux Swap.

3.png

And now I dumped the Linux partition.

Disclaimer

This is my specific organization for the disk, but the key is that TestDisk helped me to figure out where the partitions were and to get their raw images.

Creating the new disk

Now that I have the partition images: image.dd (for the boot partition) and sda1.dd, and now I have to create a new disk. So I booted the ubuntu desktop again, with a new disk (/dev/sdb).

The first thing is to get the size of the partitions and we will use the fdisk utility on the dumped files:

$ fdisk -l image.dd
$ fdisk -l sda1.dd

Using these commands I will get the size in sectors of the disk. And using that size I can make the partitions in /dev/sdb. According to my case, I will create a partition for the boot and other for the filesystem. My options were the next (please pay attention to the images to check that the values of the sectors match between the size of the partitions and the size of the NEW partitions and so on).

$ fdisk /dev/sdb
n
<enter>
<enter>
<enter>
+1024000
n
<enter>
<enter>
<enter>
+15810560
w

3.png

The key is to pay attention to the size (in sectors) of the partitions obtained with the fdisk -l command issued before. If you have any doubt, please check the images.

And now you are ready to dump the images in the disk:

$ dd if=image.dd of=/dev/sdb1
...
$ dd if=sda1.dd of=/dev/sdb2
...

Check this cool tip!

The process of dd costs a lot. If you want to see the progress, you can open other commandline and issue the command

$ killall -USR1 dd

The dd command will output the size that it has dumped in its console. If you want to see the progress, you can issue a command like this one:

$ watch -n 10 killall -USR1 dd

This will make that dd outputs the size dumped every 10 seconds.

More on this

Once I had the partition dumped, I used gparted to resize the second partition (as it was almost full). My disk was far bigger than the original, but if you only want to get the data or you have free space, this won’t probably be useful for you (so I am skipping it).

 

How to set the hostname from DHCP in Ubuntu 14.04

I have a lot of virtual servers, and I like preparing a “golden image” and instantiate it many times. One of the steps is to set the hostname for the host, from my private DHCP server. It usually works, but sometimes it fails and I didn’t know why. So I got tired of such indetermination and this time…

I learned how to set the hostname from DHCP in Ubuntu 14.04

Well, so I have my dnsmasq server that acts both as a dns server and as a dhcp server. I investigated how a host gets its hostname from DHCP and it seems that it occurs when the DHCP server sends it by default (option hostname).

I debugged the DHCP messages using this command from other server

# tcpdump -i eth1 -vvv port 67 or port 68

And I saw that my dhcp server was properly sending the hostname (dnsmasq sends it by default), and I realized that the problem was in the dhclient script. I googled a lot and I found some clever scripts that got the name from a nameserver and were started as hooks from the dhcpclient script. But if the dhcp protocol sets the hostname, why do I have to create other script to set the hostname???.

Finally I got this blog entry and I realized that it was a bug in the dhclient script: if there exists an old dhcp entry in /var/lib/dhcp/dhclient.eth0.leases (or eth1, etc.), it does not set the hostname.

At this point you have two options:

  • The easy: in /etc/rc.local include a line that removes that file
  • The clever (suggested in the blog entry): include a hook script that unsets the hostname
echo unset old_host_name >/etc/dhcp/dhclient-enter-hooks.d/unset_old_hostname

How to create a simple Docker Swarm cluster

I have an old computer cluster, and the nodes have not any virtualization extensions. So I’m trying to use it to run Docker containers. But I do not want to choose in which of the internal nodes I have to run the containers. So I am using Docker Swarm, and I will use it as a single Docker host, by calling the main node to execute the containers and the swarm will decide the host in which the container will be ran. So this time…

I learned how to create a simple Docker Swarm cluster with a single front-end and multiple internal nodes

The official documentation of Docker includes this post that describes how to do it, but whereas it is very easy, I prefer to describe my specific use case.

Scenario

  • 1 Master node with the public IP 111.22.33.44 and the private IP 10.100.0.1.
  • 3 Nodes with the private IPs 10.100.0.2, 10.100.0.3 and 10.100.0.4

I want to call the master node to create a container from other computer (e.g. 111.22.33.55), and leave the master to choose in which internal node is hosted the container.

Preparing the master node

First of all, I will install Docker

$ curl -sSL https://get.docker.com/ | sh

Now it is needed to install consul that is a backend for key-value storage. It will run as a container in the front-end (and it will be used by the internal nodes to synchronize with the master)

$ docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap

Finally I will launch the swarm master

$ docker run -d -p 4000:4000 swarm manage -H :4000 --advertise 10.100.0.1:4000 consul://10.100.0.1:8500

(*) remember that consul is installed in the front-end, but you could detach it and install in another node if you want (need) to.

Installing the internal nodes

Again, we should install Docker and export docker through the IP

$ curl -sSL https://get.docker.com/ | sh

And once it is running, it is needed to expose the docker API through the IP address of the node. The easy way to test it is to launch the daemon using the following option:

$ docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock

Now you should be able to issue command line options such as

$ docker -H :2375 info

or even from other hosts

$ docker -H 10.100.0.2:2375 info

The underlying aim is that with swarm you are able to expose the local docker daemon to be used remotely in the swarm.

To make the changes persistent, you should set the parameters in the docker configuration file /etc/default/docker:

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"

It seems that docker version 1.11 has a bug and does not properly use that file (at least in ubuntu 16.04). So you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock -H fd://

Finally now we have to launch the swarm on each node

  • On node 10.100.0.2
docker run --restart=always -d swarm join --advertise=10.100.0.2:2375 consul://10.100.0.1:8500
  • On node 10.100.0.3
docker run --restart=always -d swarm join --advertise=10.100.0.3:2375 consul://10.100.0.1:8500
  • On node 10.100.0.4
docker run --restart=always -d swarm join --advertise=10.100.0.4:2375 consul://10.100.0.1:8500Next steps: communicating containers between them

Next steps: communicating the containers

If you launch new containers as usual (i.e. docker run -it containerimage bash), you will get containers with overlapping IPs. This is because you are using the default network scheme in the individual docker servers.

If you want to have a common network, you need to create an overlay network that spans across the different docker daemons.

But in order to be able to make it, you need to change the way that the docker daemons are being started. You need a system to coordinate the network, and it can be the same consul that we are using.

So you have to append the next flags to the command line that starts docker:

 --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500

You can add the parameters to the docker configuration file /etc/default/docker. In the case of the internal nodes, the result will be the next (according to our previous modifications):

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500"

As stated before, docker version 1.11 has a bug and does not properly use that file. In the meanwhile you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500

(*) We are using eth1 because it is the device in which our internal IP address is. You should use the device to which the 10.100.0.x address is assigned.

Now you must restart the docker daemons of ALL the nodes in the swarm.

Once they have been restarted, you can create a new network for the swarm:

$ docker -H 10.100.0.1:4000 network create swarm-network

And then you can use it for the creation of the containers:

$ docker -H 10.100.0.1:4000 run -it --net=swarm-network ubuntu:latest bash

Now the IPs will be given in a coordinated way, and the containers will have several IPs (the IP in the swarm and its IP in the local docker server).

Some more words on this

This post is made in May/2016. Both docker and swarm are evolving and maybe this post is outdated soon.

Some things that bother me on this installation…

  • While using the overlay network, if you expose one port using the flag -p, the port is exposed in the IP from the internal docker host. I think that you should be able to express in which IP you want to expose the port or use the IP from the main server.
    • I solve this issue by using a development made by me IPFloater: Once I create the container, I get the internal IP in which the port is exposed and I create a redirection in IPFloater, to be able to access the container through a specific IP.
  • Consul fails A LOT. If I leave the swarm running for hours (i.e. 8 hours) consul will probably fail. If I run a command like this: “docker run –rm=true swarm list consul://10.100.0.1:8500”, it states that it has a fail. Then I have to delete the container and create a new one.