How to dynamically create on-demand services to respond to incoming TCP connections

Some time ago I had the problem of dynamically start virtual machines, when an incoming connection was received in a port. The exact problem was to have a VM that was powered off, and start it whenever an incoming ssh connection was received, and then forward the network traffic to that VM to serve the ssh request. In this way, I could have a server in a cloud provider (e.g. Amazon), and not to spend money if I was not using it.

This problem has been named “the sleeping beauty”, because of the tale. It was like having a sleeping virtual infrastructure (i.e. the sleeping beauty), that will be awaken as an incoming connection (i.e. the kiss) was received from the user (i.e. the prince).

Now I have figured out how to solve that problem, and that is why this time I learned

How to dynamically create on-demand services to respond to incoming TCP connections

The way to solve it is very straightforward, as it is fully based in the ​socat application.

socat is “a relay for bidirectional data transfer between two independent data channels”. And it can be used to forward the traffic received in a port, to other pair IP:PORT.

A simple example is:

$ socat tcp-listen:10000 tcp:localhost:22 &

And now we can SSH to localhost in the following way:

$ ssh localhost -p 10000

The interesting thing is that socat is able to exec one command upon receiving a connection (using the destination of the relay the address type EXEC or SYSTEM). But the most important thing is that socat will stablish a communication using stdin and stdout.

So it is possible to make this funny thing:

$ socat tcp-listen:10000 SYSTEM:'echo "hello world"' &
[1] 11136
$ wget -q -O- http://localhost:10000
hello world
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'echo "hello world"'

Now that we know that the communication is stablished using stdin and stdout, we can somehow abuse of socat and try this even funnier thing:

$ socat tcp-listen:10000 SYSTEM:'echo "$(date)" >> /tmp/sshtrack.log; socat - "TCP:localhost:22"' &
[1] 27421
$ ssh localhost -p 10000 cat /tmp/sshtrack.log
mié feb 27 14:36:45 CET 2019
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'echo "$(date)" >> /tmp/sshtrack.log; socat - "
TCP:localhost:22"'

The effect is that we can execute commands and redirect the connection to an arbitrary IP:PORT.

Now, it is easy to figure how to dinamically spawn servers to serve the incoming TCP resquests. An example to spawn a one-shot web server in port 8080 to serve requests in port 10000 is the next:

$ socat tcp-listen:10000 SYSTEM:'(echo "hello world" | nc -l -p 8080 -q 1 > /dev/null &) ; socat - "TCP:localhost:8080"' &
[1] 31586
$ wget -q -O- http://localhost:10000
hello world
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'(echo "hello world" | nc -l -p 8080 -q 1 > /dev/null &) ; socat - "TCP:localhost:8080"'

And now you can customize your scripts to create the effective servers on demand.

The sleeping beauty application

I have used these proofs of concept to create the sleeping-beauty application. It is open source, and you can get it in github.

The sleeping beauty is a system that helps to implement serverless infrastructures: you have the servers aslept (or not even created), and they are awaken (or created) as they are needed. Later, they go back to sleep (or they are disposed).

In the sleeping-beauty, you can configure services that listen to a port, and the commands that socat should use to start, check the status or stop the effective services. Moreover it implements an idle-detection mechanism that is able to check whether the effective service is idle, and if it has been idle for a period of time, stop it to save resources.

Example: In the description of the use case, the command to be used to start the service, will contact Amazon AWS and will start a VM. The command to stop the service will contact Amazon AWS to stop the VM. And the command to check whether the service is idle or not will ssh the VM and execute the command ‘who’.

How to install OpenStack Rocky – part 1

This is the first post of a series in which I am describing the installation of a OpenStack site using the latest distribution at writting time: Rocky.

My project is very ambitious, because I have a 2 virtualization nodes (that have different GPU each), 10GbE, a lot of memory and disk, and I want to offer the GPUs to the VMs. The front-end is a 24 core server, with 32Gb. RAM and 6 Tb. disk, with 4 network ports (2x10GbE+2x1GbE), that will also act as block devices server.

We’ll be using Ubuntu 18.04 LTS for all the nodes, and I’ll try to follow the official documentation. But I will try to be very straighforward in the configuration… I want to make it work and I will try to explain how things work, instead of tunning the configuration.

How to install OpenStack Rocky – part 1

My setup for the OpenStack installation is the next one:

horsemen

In the figure I have annotated the most relevant data to identify the servers: the IP addresses for each interface, which is the volume server and the virtualization nodes that will share their GPUs.

At the end, the server horsemen will host the next services: keystone, glance, cinder, neutron and horizon. On the other side, fh01 and fh02 will host the services compute and neutron-related.

In each of the servers we need a network interface (eno1, enp1s0f1 and enp1f0f1) which is intended for administration purposes (i.e. the network 192.168.1.1/24). That interface has a gateway (192.168.1.220) that enables the access to the internet via NAT. From now on, we’ll call these interfaces as the “private interfaces“.

We need an additional interface that is connected to the provider network (i.e. to the internet). That network will hold the publicly routable IP addresses. In my case, I have the network 158.42.1.1/24, that is publicly routable. It is a flat network with its own network services (e.g. gateway, nameservers, dhcp servers, etc.). From now on, we’ll call these interfaces the “public interfaces“.

One note on “eno4” interface in horsemen: I am using this interface for accessing horizon. In case that you do not have a different interface, you can use interface aliasing or providing the IP address in the ifupdown “up” script.

An extra note on “eno2” interface in horsemen: It is an extra interface in the node. It will be left unused during this installation, but it will be configured to work in bond mode with “eno1”.

IMPORTANT: In the whole series of posts, I am using the passwords as they appear: RABBIT_PASS, NOVA_PASS, NOVADB_PASS, etc. You should change it, according to a secure password policy, but they are set as-is to ease understanding the installation. Anyway, most of them will be fine if you have an isolated network and the service listen only in the management network (e.g. mysql will only be configured to listen in the management interface).

Some words on the Openstack network (concepts)

The basic installation of Openstack considers two networks: the provider network and the management network. The provider network means “the network that is attached to the provider” i.e. the network where the VMs can have publicly routable IP addresses.  On the other side, the management network is a private network that is (probably) isolated from the provider one. The computers in that network have private IP addresses (e.g. 192.168.1.1/16, 10.0.0.0/8, 172.16.0.0/12).

In the basic deployment of Openstack, it considers that the controller node do not need to have a routable IP address. Instead, it can be accessed by the admin by the management network. That is why the “eno3” interface has not an IP address.

In the Openstack architecture, horizon is a separated piece, so horizon is the one that will need a routable IP address. As I want to install horizon also in the controller, I need a routable IP address and that is why I put a publicly routable IP address in “eno4” (158.42.1.1).

In my case, I had a spare network interface (eno4) but if you do not have one of them, you can create a bridge and add your “interface connected to the provider network” (i.e. “eno3”) to that bridge, and then add a publicly routable IP address to the bridge.

IMPORTANT: this is not part of my installation. Although it may be part of your installation.

brctl addbr br-publicbrctl addif br-public eno3ip link set dev br-public upip addr add 158.42.1.1/16 dev br-public

Configuring the network

One of the first things that we need to set up is to configure the network for the different servers.

Ubuntu 18.04 has moved to netplan, but at the time of writting this text, I have not found any mechanism to get an interface UP and not providing an IP address for it, using netplan. Moreover, when trying to use ifupdown, netplan is not totally disabled and interferes with options such as dns-nameservers for the static addresses. At the end I need to install ifupdown and make a mix of configuration using both netplan and ifupdown.

It is very important to disable IPv6 for any of the servers, because if not, you will probably face a problem when using the public IP addresses. You can read more in this link.

To disable IPv6, we need to execute the following lines in all the servers (as root):

# sysctl -w net.ipv6.conf.all.disable_ipv6=1
# sysctl -w net.ipv6.conf.default.disable_ipv6=1
# cat >> /etc/default/grub << EOT
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1"
GRUB_CMDLINE_LINUX="ipv6.disable=1"
EOT
# update-grub

We’ll disable IPv6 for the current session, and persist it by disabling at boot time. If you have customized your grub, you should check the options that we are setting.

Configuring the network in “horsemen”

You need to install ifupdown, to have an anonymous interface connected to the internet, to include neutron-related services

# apt update && apt install -y ifupdown

Edit the file /etc/network/interfaces and adjust it with a content like the next one:

auto eno3
iface eno3 inet manual
up ip link set dev $IFACE up
down ip link set dev $IFACE down

Now edit the file /etc/netplan/50-cloud-init.yaml to set the private IP address:

network:
  ethernets:
    eno4:
      dhcp4: true
    eno1:
      addresses:
        - 192.168.1.240/24
      gateway4: 192.168.1.221
      nameservers:
        addresses: [ 192.168.1.220, 8.8.8.8 ]
  version: 2

When you save these settings, you can issue the next commands:

# netplan generate
# netplan apply

Now we’ll edit the file /etc/hosts, and will add the addresses of each server. My file is the next one:

127.0.0.1       localhost.localdomain   localhost
192.168.1.240   horsemen controller
192.168.1.241   fh01
192.168.1.242   fh02

I have removed the entry 127.0.1.1 because I read thet it may interfer. And I also removed all the crap about IPv6 because I disabled it.

Configuring the network in “fh01” and “fh02”

Here is the short version of the configuration of fh01:

# apt install -y ifupdown
# cat >> /etc/network/interfaces << EOT
auto enp1s0f0
iface enp1s0f0 inet manual
up ip link set dev $IFACE up
down ip link set dev $IFACE down
EOT

Here is my file /etc/netplan/50-cloud-init.yaml for fh01:

network:
  ethernets:
    enp1s0f1:
      addresses:
        - 192.168.1.241/24
      gateway4: 192.168.1.221
    nameservers:
    addresses: [ 192.168.1.220, 8.8.8.8 ]
  version: 2

Here is the file /etc/hosts for fh01:

127.0.0.1 localhost.localdomain localhost
192.168.1.240 horsemen controller
192.168.1.241 fh01
192.168.1.242 fh02

You can export this configuration to fh02 by adjusting the IP address in the /etc/netplan/50-cloud-init.yaml file.

Reboot and test

Now it is a good moment to reboot your systems and test that the network is properly configured. If it is not, please make sure that it is working because. Otherwise the next steps will have no sense.

From each of the hosts you should be able to ping to the outern world and ping between the hosts. These are the tests from horsemen, but you need to be able to repeat them from each of the servers.

root@horsemen# ping -c 2 www.google.es
PING www.google.es (172.217.17.3) 56(84) bytes of data.
64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=1 ttl=54 time=7.26 ms
64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=2 ttl=54 time=7.26 ms

--- www.google.es ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 7.262/7.264/7.266/0.002 ms
root@horsemen# ping -c 2 fh01
PING fh01 (192.168.1.241) 56(84) bytes of data.
64 bytes from fh01 (192.168.1.241): icmp_seq=1 ttl=64 time=0.180 ms
64 bytes from fh01 (192.168.1.241): icmp_seq=2 ttl=64 time=0.113 ms

--- fh01 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 0.113/0.146/0.180/0.035 ms
root@horsemen# ping -c 2 fh02
PING fh02 (192.168.1.242) 56(84) bytes of data.
64 bytes from fh02 (192.168.1.242): icmp_seq=1 ttl=64 time=0.223 ms
64 bytes from fh02 (192.168.1.242): icmp_seq=2 ttl=64 time=0.188 ms

--- fh02 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1027ms
rtt min/avg/max/mdev = 0.188/0.205/0.223/0.022 ms

Prerrequisites for Openstack in the server (horsemen)

Remember: for simplicity, I will use obvious passwords like SERVICE_PASS or SERVICEDB_PASS (e.g. RABBIT_PASS). You should change these passwords, although most of them will be fine if you have an isolated network and the service listen only in the management network.

First of all, we are installing the prerrequisites. We will start with the NTP server, that will keep the hour synchronized between the controller (horsemen) and the virtualization servers (fh01 and fh02). We’ll instal chrony (recommended in the Openstack configuration) and allow any computer in our private network to connect to this new NTP server:

# apt install -y chrony
cat >> /etc/chrony/chrony.conf << EOT
allow 192.168.1.0/24
EOT
# service chrony restart

Now we are installing and configuring the database server (we’ll use mariadb as it is used in the basic installation):

# apt install mariadb-server python-pymysql
# cat > /etc/mysql/mariadb.conf.d/99-openstack.cnf << EOT
[mysqld]
bind-address = 192.168.1.240

default-storage-engine = innodb
innodb_file_per_table = on
max_connections = 4096
collation-server = utf8_general_ci
character-set-server = utf8
EOT
#service mysql restart

Now we are installing rabbitmq, that will be used to orchestrate message interchange between services (please change RABBIT_PASS).

# apt install rabbitmq-server
# rabbitmqctl add_user openstack "RABBIT_PASS"
# rabbitmqctl set_permissions openstack ".*" ".*" ".*"

At this moment, we have to install memcached and configure it to listen in the management interface:

# apt install memcached

# echo "-l 192.168.1.240" >> /etc/memcached.conf

# service memcached restart

Finally we need to install etcd and configure it to be accessible by openstack

# apt install etcd
# cat >> /etc/default/etcd << EOT
ETCD_NAME="controller"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-01"
ETCD_INITIAL_CLUSTER="controller=http://192.168.1.240:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.240:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.240:2379"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.1.240:2379"
EOT
# systemctl enable etcd
# systemctl start etcd

Now we are ready to with the installation of the OpenStack Rocky packages… (continue to part 2)

How to create compressed and/or encrypted bash scripts

The immediate use of bash scripting is to automate tasks, but there are a lot of applications and tools in the commandline that are powerful enough and easy to use to have the need to use other languages such as python, perl, etc. that are harder to learn just to implement workflows of other tools.

When you try to make these scripts to be usable for third parties, its complexity increases as the amount of code that makes checks, verifications, etc. grows. At the end, you have created scripts that are reasonably big. Moreover, when you want to transfer these applications to others, you may want to avoid its re-distribution.

So this time I explored…

How to create compressed and/or encrypted bash scripts

TL;DR: An extended version of a code that gzips and encodes a script, that is run depending on a password (or license) is in the repository named LSC – License Shell Code.

Bash scripting is a powerful language to develop applications that manage servers, but also to automate processes in Linux, to execute batch tasks, or to other tasks that can be executed in the Linux commandline e.g. implement workflows in scientific applications (processing the output from one application to prepare it for other application).

But if your script grows a lot, maybe you want to compress it. Moreover, compressing the code is like somehow obfuscating the code. So it reduces the chance of re-using your code for other applications without permission (for the case of common end-users that do not master bash scripting).

Compressing the script and running it

Having a bash script in file example, compressing it is very simple. You can just see the next commands that create the example script and generates a gzipped version:

$ cat > example <<EOT
#!/bin/bash
echo "hello world"
EOT
$ cat > compressedscript <<EOT
#!/bin/bash
eval "\$(echo "$(cat example | gzip | base64)" | base64 -d | gunzip)"
EOT

Now the compressed script looks like this one:

#!/bin/bash
eval "$(echo "H4sIAOMSmVsAA1NW1E/KzNNPSizO4EpNzshXUMpIzcnJVyjPL8pJUeICAFeChBYfAAAA" | base64 -d | gunzip)"

And executing it is as simple as chmodding it and running it:

$ chmod +x compressedscript
$ ./compressedscript
hello world

As you can see, the process is very simple. We just embedded the gzipped source code (*) in a new script and took profit from the ability of the eval function.

(*) We needed to uuencode the gzipped file to be able to embed it in a plain text file.

Encrypting the script

Now that we know that the eval function works for our purposes, we can generalize the function of our proof of concept, to embed more complex things. For example, we can encrypt our script…

To encrypt things, we can use openssl as shown in this post. Briefly, we can encrypt a file with a command like the next one:

$ openssl enc -aes-256-cbc -salt -in file.txt -out file.txt.enc

And decode a encoded file with a command like the next one:

$ openssl enc -aes-256-cbc -d -in file.txt.enc -out file.txt

We apply encryption to our case, appart from gzipping our script. The resulting code is very similar to the previous case, but adding the encryption step:

$ cat > compressedscript <<EOT
#!/bin/bash
read -p "Please provide a password: " PASSWORD
eval "\$(echo "$(cat example | gzip | openssl enc -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "mypasswd" | base64 | tr -d '\n')" | base64 -d | openssl enc -d -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "\$PASSWORD" | gunzip)"
EOT

The resulting content for the new compressedscript file is the next one:

#!/bin/bash
read -p "Please provide a password: " PASSWORD
eval "$(echo "U2FsdGVkX18gnaQ3jFPBfhalu0/riWaRirtWHcFWgFqGFuzf3s98T/Y65Km2oe4jqGaAXlDCBX6+oWwWgqMIOBfG/O3P7qgmLTRpkzvShwc=" | base64 -d | openssl enc -d -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "$PASSWORD" | gunzip)"

If we run that script, it will prompt for a password to decode de script (in our case it is “mypasswd”), and then it will be ran:

$ ./compressedscript
Please provide a password: mypasswd
hello world
$ ./compressedscript
Please provide a password: otherpasswd
bad decrypt
140040196646552:error:06065064:digital envelope routines:EVP_DecryptFinal_ex:bad decrypt:evp_enc.c:529:

gzip: stdin: not in gzip format

As shown in the code above, if other password is introduced, the code will not be ran.

Further reading

An extended version of the code of this post is included in LSC – License Shell Code. LSC is an application that generate shell applications (from existing ones) that need a license code to be run.

NOTES: Providing your users with a LICENSE code give them a more professional distribution, and a more tailored solution. Getting the code also implies a knowledge of what they are doing and so it also includes a first barrier (both from the point of the knowledge and from the ethics).

How to deal with the Union File Systems that use Docker (OverlayFS and AUFS)

Containers are a modern application delivery mechanism (very interesting for software reproduciblity). As I commented in my previous post, undoubtedly the winner hype is Docker. Most of containers developments (e.g. Docker) are supported by the Linux kernel, by using namespaces, cgroups, other technologies… and chroots to the filesystem of the system to virtualize.

Using lightweight virtualization increases the density of the virtualized units, but many containers may share the same base system (e.g. the plain OS with some utilities installed) and only modify a few files (e.g. installing one application of updating the configuration files). And that is why Docker and others use Union File Systems to implement the filesystems of the virtualized units.

The recent publication of the NIST “Application Container Security Guide” suggests that “An image should only include the executables and libraries required by the app itself; all other OS functionality is provided by the OS kernel within the underlying host OS. Images often use techniques like layering and copy-on-write (in which shared master images are read only and changes are recorded to separate files) to minimize their size on disk and improve operational efficiency“. This strenghtens the usage of Union File Systems for containers, as Docker has introduced since I remember (AUFS, OverlayFS, OverlayFS2, etc.).

The conclusion is that Union File Systems are actually used in Docker, and I need to understand how to deal with them if anything fails. So this time I learned…

How to deal with the Union File Systems that use Docker (AUFS, OverlayFS and Overlay2)

AUFS

In first place, it is important to know how AUFS works. It is intuitively simple and I think that it is well explained in the Docker documentation. The next image is from the documentation of Docker (just in case that the URL changes):

aufs_layers

The idea is to have a set of layers, which consist of different directory trees, and they are combined to show a single one which is the result of ordered combination of the different directory trees. The order is important, because if one file is present in more than one directory tree, you will only see the version in the “upper layer”.

There are different readonly layers, and a working layer that gathers the modification in the resulting filesystem: if some files are modified (or added), the new version will appear in the working layer; if any file is deleted, some metadata is included in the layers, to instruct the kernel to hide the file in the resulting filesystem (we’ll see some practical examples).

Working with AUFS

We are preparing a test example to see how AUFS works and that it is very easy to understand and to work with.

We’ll have 2 base folders (named layer1 and layer2) and a working folder (named worklayer). And a folder named mountedfs that will hold the combined filesystem. The next commands will create the starting scenario:

root:root# cd /tmp
root:tmp# mkdir aufs-test
root:tmp# cd aufs-test/
root:aufs-test# mkdir layer1 layer2 upperlayer mountedfs
root:aufs-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:aufs-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:aufs-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:aufs-test# echo "content for file3.txt in layer2" > layer2/file3.txt

Both layer1 and layer2 have a file with the same name (file1.txt) with different content, and there are different files in each layer. The result is shown in the next figure:

aufs-r1

Now we’ll mount the filesystem using the basic syntax:

root:aufs-test# mount -t aufs -o br:upperlayer:layer1:layer2 none mountedfs

The result is that folder mountedfs contains the union of the files that we have created:

aufs-r2.png

The whole syntax and options is explained in the manpage of aufs (i.e. man aufs), but we’ll be using the basic options.

The key for us is the option br in which we set the branches (i.e. layers) that will be unioned in the resulting filesystem. They have precedence from left to righ. That means that if one file exist in two layers, the version shown in the AUFS filesystem will be the version of the leftmost layer.

The next figure we can see the contents of the files in the mounted AUFS folder:

aufs-t3

In our case, file1.txt contains “content for file1.txt in layer1”, as expected because of the order of the layers.

Now if we create a new file (file4.txt) with the content “new content for file4.txt”, it will be created in the folder upperlayer:

aufs-r3

If we delete the file “file1.txt”, it will be kept in each of the layers (i.e. folders layer1 and layer2). But it will be marked as deleted in folder upperlayer by including some control files (although these files will not be shown in the resulting mounted filesystem).

aufs-r4.png

The key for the AUFS driver are the files named .wh*. In this case, we can see that the deletion is reflected in the upperlayer folder by creating the file .wh.file1.txt. That file instructs AUFS to hide the file in resulting mount point. If we create the file again, it will appear the file again and the control file for deletion will be removed.

aufs-r5.png

Of course, the content of file1.txt in layer1 and layer2 folders is kept.

Docker and AUFS

Docker makes use of AUFS in ubuntu (althouhg is being replaced by overlay2). We’ll explore a bit by running a container and searching for its filesystem…

aufs-d1.png

We can see that our container ID is d5afc60dbfd7. If we see the mounts, by simply typing the command “mount” we’ll see that we have a new AUFS mounted point:

aufs-d2.png

Well… we are smart and we know that they are related… but how? We need to check folder /var/lib/docker/image/aufs/layerdb/mounts/ and there we will find a folder named as the ID for our container (d5afc60dbfd7…). And several files in it:

aufs-d3.png

And the mount-id file contains an ID that corresponds with a folder in /var/lib/docker/aufs/mnt/ that correspond with the unioned filesystem that is the root filesystem for container d5afc60dbfd7. Such folder correspond to the mount point exposed when we inspected the mountpoints before.

aufs-d4

In folder /var/lib/docker/aufs/layers we can inspect the information about the layers. In particular, we can see the content of the file that correspond to the ID of our mountpoint:

aufs-d5.png

Such content correspond to the layers that have been used to create the mount point at /var/lib/docker/aufs/mnt/. The directory trees that correspond to these layers are included in folders with the corresponding names in folder /var/lib/docker/aufs/diff. In the case of our container, if we create a new file, we can see that it appears in the working layer.

aufs-d6.png

OverlayFS and Overlay2

While AUFS is only supported in some distributions (debian, gentoo, etc.), OverlayFS is included in the Linux Kernel.

The schema of OverlayFS is shown in the next image (obtained from Docker docs):

overlay_constructs.jpg

The underlying idea is the same than AUFS, and the concepts are almost the same: layers that combine together to build a unioned filesystem. The lowerdir are the readonly layers in AUFS, while the upperdir correspond to the read/write layer.

OverlayFS needs an extra folder named workdir that substitutes the .wh.* hidden files in AUFS, but also is used to support the atomic operations on the filesystem.

The main difference between OverlayFS and Overlay2 is that, OverlayFS only supported merging 1 single readonly layer with 1 read/write layer (although overlaying could be nested by overlaying overlayed layers). Now Overlay2 supports 128 lower layers

Working with OverlayFS

We are preparing an equivalent test example to see how OverlayFS works and that it is very easy to understand and to work with.

root:root# cd /tmp
root:tmp# mkdir overlay-test
root:tmp# cd overlay-test/
root:overlay-test# mkdir layer1 layer2 upperlayer workdir mountedfs
root:overlay-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:overlay-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:overlay-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:overlay-test# echo "content for file3.txt in layer2" > layer2/file3.txt

With respect to the AUFS example, we had to include an extra folder (workdir) that is needed for OverlayFS to work.

overlay-1.png

And now we’ll mount the layers using the basic syntax

Now we’ll mount the filesystem using the basic syntax:

root:overlay-test# mount -t overlay -o lowerdir=layer1:layer2,upperdir=upperlayer,workdir=workdir overlay mountedfs

The result is that folder mountedfs contains the union of the files that we have created:

overlay-2.png

As expected, we the vision of the union of the 2 existing layers, and the contents of these files are the expected. The lowerdir folders are interpreted from left to right for the case of precendece. So if one file exists in different lowerdirs the unioned filesystem will show the file in the leftmost lowerdir.

Now if we create a new file, the new contents will be created in the upperdir folder, while the contents in the other folders will be kept.

overlay-3

What can I do with this?

Appart for understanding how to better debug products such as Docker, you will be able to start containers in the Docker way using layered filesystems, using the tools shown in a previous post.

How to run Docker containers using common Linux tools (without Docker)

Containers are a current virtualization and application delivery trend, and I am working on it. If you try to search google about them, you can find tons of how-tos, information, tech guides, etc. As in anything in IT, there are flavors of containers. In this case, the players are Docker, LXC/LXD (in which Docker was once based), CoreOS RKT, OpenVZ, etc. If you have a look in the google trends, you’ll notice that undoubtedly the winner hype is Docker and “the others” try to fight against it.

trends_containers

But as there are several alternatives, I wanted to learn about the underlying technology and it seems that all of them are simply based on a set of kernel features: mainly the linux namespaces and the cgroups. And the most important diferences are the utilities that they provide to automate the procedures (the repository of images, container management and other parts of the ecosystem of a particular product).

Disclaimer: This is not a research blog, and so I am not going in depth on when namespaces were introduced in the kernel, which namespaces exist, how they work, what is copy on write, what are cgroups, etc. The purpose of this post is simply “fun with containers” 🙂

At the end, the “hard work” (i.e. the execution of a containerized environment) is made by the Linux kernel. And so this time I learned…

How to run Docker containers using common Linux tools (without Docker).

We start from a scenario in which we have one container running in Docker, and we want to run it using standard Linux tools. We will mainly act as a common user that has permissions to run Docker containers (i.e. in the case of Ubuntu, my user calfonso is in group ‘docker’), to see that we can run containers in the user space.

1

TL;DR

To run a contained environment with its own namespaces, using standard Linux tools you can follow the next procedure:

calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar
calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/
calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot $PWD/rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

At this point you need to set up the network devices (from outside the container) and deal with the cgroups (if you need to).

In first place, we are preparing a folder for our tests (handcontainer) and then we will dump the filesystem of the container:

calfonso:~$ mkdir handcontainer
calfonso:~$ cd handcontainer
calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar

If we check the tar file produced, we’ll see that it is the whole filesystem of the container

2

Let’s extract it in a new folder (called rootfs)

calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/

This action will raise an error, because only the root user can use the mknod application and it is needed for the /dev folder, but it will be fine for us because we are not dealing with devices.

3

If we check the contents of rootfs, the filesystem is there and we can chroot to that filesystem to verify that we can use it (more or less) as if it was the actual system.

4

The chroot technique is well known and it was enough in the early days, but we have no isolation in this system. It is exposed if we use the next commands:

/ # ip link
/ # mount -t proc proc /proc && ps -ef
/ # hostname

In these cases, we can manipulate the network of the host, interact with the processes of the host or manipulate the hostname.

This is because using chroot only changes the root filesystem for the current session, but it takes no other action.

Some words on namespaces

One of the “magic” of containers are the namespaces (you can read more on this in this link). The namespaces make that one process have a particular vision of “the things” in several areas. The namespaces that are currently available in Linux are the next:

  • Mounts namespace: mount points.
  • PID namespace: process number.
  • IPC namespace: Inter Process Communication resources.
  • UTS namespace: hostname and domain name.
  • Network namespace: network resources.
  • User namespace: User and Group ID numbers.

Namespaces are handled in the Linux kernel, and any process is already in one namespace (i.e. the root namespace). So changing the namespaces of one particular process do not introduce additional complexity for the processes.

Creating particular namespaces for particular processes means that one process will have its particular vision of the resources in that namespace. As an example, if one process is started with its own PID namespace, the PID number of the process will be 0 and its children will have the next PID numbers. Or if one process is started with its own NET namespace, it will have a particular stack of network devices.

The parent namespace of one namespace is able to manipulate the nested namespace… It is a “hard” sentence, but what this means is that the root namespace is always able to manipulate the resources in the nested namespaces. So the root of one host has the whole vision of the namespaces.

Using namespaces

Now that we know about namespaces, we want to use them 😉

We can think of a container as one process (e.g. a /bin/bash shell) that has its particular root filesystem, its particular network, its particular hostname, its particular PIDs and users, etc. And this can be achieved by creating all these namespaces and spawning the /bin/bash processes inside of them.

The Linux kernel includes the calls clone, setns and unshare that enable to easily manipulate the namespaces for processes. But the common Linux distributions also provide the commands unshare and nsenter that enable to manipulate the namespaces for proccesses and applications from the commandline.

If we get back to the main host, we can use the command unshare to create a process with its own namespaces:

calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

It seems that nothing happened, except that we are “root”, but if we start using commands that manipulate the features in the host, we’ll see what happened.

8If we echo the PID of the current process ($$) we can see that it is 1 (the main process), the current user has UID and GID 0 (he is root), we have not any network device, we can manipulate the hostname…

If we check the processes in the host, in other terminal, we’ll see that even we are shown as ‘root’, outside the process our process is executed under the credentials of our regular user:

9.png

This is the magic of the PID namespace, that makes that one process has different PID numbers, depending on the namespace.

Back in our “unshared” environment, if we try to show the processes that are currently running, we’ll get the vision of the processes in the host:

10.pngThis is because of how Linux works: the processes are file descriptors in the /proc mount point and, in our environments, we still have access to the existing mountpoints. But as we have our mnt namespace, we can mount our particular mount filesystem:

11

From ouside the container, we will be able to create a network device and put it into the namespace:

12

And if we get back to our “unshared” environment, we’ll see that we have a new network device:

13.png

The network setup is incomplete, and we will have access to nowhere (the peer of our eth0 is not connected to any network). This falls out of the scope of this post, but tha main idea is that you will need to connect the peer to some bridge, set an IP address for the eth0 inside the unshared environment, set up a NAT in the host, etc.

Obtaining the filesystem of the container

Now that we are in an “isolated” environment, we want to have the filesystem, utilities, etc. from the container that we started. And this can be done with our old friend “chroot” and some mounts:

root:handcontainer# chroot rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

Using chroot, the filesystem changes and we can use all the new mount points, commands, etc. in that filesystem. So now we have the vision of being inside an isolated environment with an isolated filesystem.

Now we have finished setting up a “hand made container” from an existing Docker container.

Further work

Appart from the “contained environment”, the Docker containers also are managed inside cgroups. Cgroups enable to account and to limit the resources that the processes are able to use (i.e. CPU, I/O, Memory and Devices) and that will be interesting to better control the resources that the processes will be allowed to use (and how).

It is possible to explore the cgroups in the path /sys/fs/cgroups. In that folder you will find the different cgroups that are managed in the system. Dealing with cgroups is a bit “obscure” (creating subfolders, adding PID to files, etc.), and will be left to other eventual post.

Other features that offer Docker is the layered filesystems. The layered filesystem is used in Docker basically to have a common filesystem and only track the modifications. So there is a set of common layers for different containers (that will not be modified) and each of the containers will have a layer that makes its filesystem unique from the others.

In our case, we used a simple flat filesystem for the container, that we used as root filesystem for our contained environment. Dealing with layered filesystem will be a new post 😉

And now…

Well, in this post we tried to understand how the containers work and see that it is a relatively simple feature that is offered by the kernel. But it involves a lot of steps to have a properly configured container (remember that we left the cgroups out of this post).

We did this steps just because we could… just to better understand containers.

My advise is to use the existing technologies to be able to use well built containers (e.g. Docker).

Further reading

As in other posts, I wrote this just to arrange my concepts following a very simple step-by-step procedure. But you can find a lot of resources about containers using your favourite search engine. The most useful resources that I found are:

How to (securely) contain users using Docker containers

Docker container have proved to be very useful to deliver applications. They enable to pack all the libraries and dependencies needed by an application, and to run it in any system. One of the most drawbacks argued by Docker competitors is that the Docker daemon runs as root and it may introduce security threats.

I have searched for the security problems of Docker (e.g. sysdigblackhat conferenceCVEs, etc.) and I could only find privilege escalation by running privileged containers (–privileged), files that are written using root permissions, using the communication socket, using block devices, poisoned images, etc. But all of these problems are related to letting the users start their own containers.

So I think that Docker can be used by sysadmins to provide a different or contained environment to the users. E.g. having a CentOS 7 front-end, but letting some users to run an Ubuntu 16.04 environment. This is why this time I learned…

How to (securely) contain users using Docker containers

TL;DR

You can find the results of this tests in this repo: https://github.com/grycap/dosh

The repository contains DoSH (which stands for Docker SHell), which is a development to use Docker containers to run the shell of the users in your Linux system. It is an in-progress project that aims at provide a configurable and secure mechanism to make that when a user logs-in a Linux system, a customized (or standard) container will be created for him. This will enable to limit the resources that the user is able to use, the applications, etc. but also provide custom linux flavour for each user or group of users (i.e. it will coexist users that have CentOS 7 with Ubuntu 16.04 in the same server).

The Docker SHell

In a multi-user system it would be nice to offer a feature like providing different flavours of Linux, depending on the user. Or even including a “jailed” system for some specific users.

This could be achieved in a very easy way. You just need to create a script like the next one

root@onefront00:~# cat > /bin/dosh <<\EOF
docker run --rm -it alpine ash
EOF
root@onefront00:~# chmod +x /bin/dosh 
root@onefront00:~# echo "/bin/dosh" >> /etc/shells

And now you can change the sell of one user in /etc/passwd

myuser:x:9870:9870::/home/myuser:/bin/dosh

And you simply have to allow myuser to run docker containers (e.g. in Ubuntu, by adding the user to the “docker” group).

Now we have that when “myuser” logs in the system, he will be inside a container with the Alpine flavour:

alpine-dockershell-1

This is a simple solution that enables the user to have a specific linux distribution… but also your specific linux environment with special applications, libraries, etc.

But the user has not access to its home nor other files that will be interesting to give him the appearance of being in the real system. So we could just map his home folder (and other folders that we wanted to have inside the container; e.g. /tmp). A modified version of /bin/dosh will be the next one:

#!/bin/bash
username="$(whoami)"
docker run --rm -v /home/$username:/home/$username -v /tmp:/tmp -it alpine ash

But if we log in as myuser the result is that the user that logs in is… root. And the things that he does is as root.

alpine-dockershell-2

We to run the container as the user and not as root. An updated version of the script is the next:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

If myuser now logs in, the container has the permissions of this user

alpine-dockershell-3

We can double-check it by checking the running processes of the container

The problem now is that the name of the user (and the groups) are not properly resolved inside the container.

alpine-dockershell-6

This is because the /etc/passwd and the /etc/group files are included in the container, and they do not know about the users or groups in the system. As we want to resemble the system in the container, we can share a readonly copy of /etc/passwd and /etc/group by modifying the /bin/dosh script:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

And now the container has the permissions of the user and the username is resolved. So the user can access the resources in the filesystem in the same conditions that if he was accessing the hosting system.

alpine-dockershell-7

Now we should add the mappings for the folders to which the user has to have permissions to access (e.g. scratch, /opt, etc.).

Using this script as-is, the user will have different environment for each of the different sessions that he starts. That means that the processes will not be shared between different sessions.

But we can create a more ellaborated script to start containers using different Docker images depending on the user or on the group to which the user belongs. Or even to create pseudo-persistent containers that start when the user logs-in and stops when the user leaves (to allow multiple ttys for the same environment).

An example of this kind of script will be the next one:

#!/bin/bash

username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Using this script we start the user containers on demand and their processes are kept between log-ins. Moreover, the log-in will fail in case that the container fails to start.

In the event that the system is powered off, the container will be powered off although its contents are kept for future log-ins (the container will be restarted from the stop state).

The development of Docker SHell continues in this repository: https://github.com/grycap/dosh

Security concerns

The main problem of Docker related to security is that the daemon is running as root. So if I am able to run containers, I am able to run something like this:

$ docker run --privileged alpine ash -c 'echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger'

And the host will be powered off as a regular user. Or simply…

$ docker run --rm -v /etc:/etc -it alpine ash
/ # adduser mynewroot -G root
...
/ # exit

And once you exit the container, you will have a new root user in the physical host.

This happens because the user inside the container is “root” that has UID=0, and it is root because the Docker daemon is root with UID=0.

We could change this behaviour by shifting the user namespace with the flag –userns-remap and the subuids to make that the Docker daemon does not run as UID=0, but this will also limit the features of Docker for the sysadmin. The first consequence is that that the sysadmin will not be able to run Docker containers as root (nor privileged containers). If this is acceptable for your system, this will probably be the best solution for you as it limits the possible security threats.

If you are not experienced with the configuration of Docker or you simply do not want (or do not know how) to use the –userns-remap, you can still use DoSH.

On linux capabilities

If you add the flag --cap-grop=all (or selective cap-drop) to the sequence of running the Docker container, you can get an even more secure container that will never get some linux capabilities (e.g. to mount a device). You can learn more on capabilities in the linux manpage, but we can easily verify the capabilities…

We will run a process using inside the container, using the flag –cap-drop=all:

$ docker run --rm --cap-drop=all -u 1001:1001 -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/myuser:/home/myuser -v /tmp:/tmp -w /home/myuser -it alpine sleep 10000

Now we can check the capabilities of such process

capabilities-zero.png

We should check the effective capabilities (CapEff) but also the upper bound of the capabilities (CapBnd) which determines which capabilities could the process acquire (e.g. using sudo or executing a suid application). We can see that boths capabilities are zero, and that means that the process cannot get any capability.

Take into account that using  –capdrop=all will make that commands such as ping do not work because it is an application that needs specific capabilities (in the case of ping, it needs cap_net_raw, and this is why it has suid permissions).

capabilities-capdrop

Dropping capabilities when spawning the container will make that the sleep command inside a container is even more secure than the regular one. You can check it by simply repeating the same procedure but not using the containers. In such case, if you inspect the capabilities, you will find the next thing:

capabilities-all.png

The effective capabilities is zero, but the CapBnd field shows that the user could escalate up to get any of the capabilities in a buggy application.

Executing the docker commands by non-root users

The actual problem is that the user needs to be allowed to use Docker to spawn the DoSH container, and you do not want to allow the user to run arbitraty docker commands.

We can consider that the usage of Docker is secure if the containers are ran under the credentials of regular users, and the devices and other critical resources that are attached to the container are used under these credentials. So users can be allowed to run Docker containers if they are forced to include the flat -u <uid>:<gid> and the rest of the commandline is controlled.

The solution is as easy as installing sudo (which is shipped in the default distribution of Ubuntu but also is an standard package almost in any distribution) and allow users to run as sudo only a specific command that execute the docker commands, but do not allow these users to modify these commands.

Once installed sudo, we can create the file /etc/sudoers.d/dosh

root@onefront00:~# cat > /etc/sudoers.d/dosh <<\EOF
> ALL ALL=NOPASSWD: /bin/shell2docker
> EOF
root@onefront00:~# chmod 440 /etc/sudoers.d/dosh

Now we must move the previous /bin/dosh script to /bin/shell2docker and then we can create the script /bin/dosh with the following content:

root@onefront00:~# mv /bin/dosh /bin/shell2docker
root@onefront00:~# cat > /bin/dosh <<\EOF
#!/bin/bash
sudo /bin/shell2docker
EOF
root@onefront00:~# chmod +x /bin/dosh

And finally, we will remove the ability to run docker containers to the user (e.g. in Ubuntu, by removing him from the “docker” group).

If you try to log-in as the user, you will notice that now we have the problem that the user that runs the script is “root” and then the container will be run as “root”. But we can modify the script to detect whether the script has be ran as sudo or as a regular user and then catch the appropriate username. The updated script will be the next:

#!/bin/bash
if [ $SUDO_USER ]; then username=$SUDO_USER; else username="$(whoami)"; fi
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Now any user can execute the command that create the Docker container as root (using sudo), but the user cannot run arbitraty Docker commands. So all the security is now again in the side of the sysadmin that must create “secure” containers.

This is an in-progress work that will continue in this repository: https://github.com/grycap/dosh

 

 

 

 

How to install a cluster with NIS and NFS in Ubuntu 16.04

I am used to create computing clusters. A cluster consists of a set of computers that work together to solve one task. In a cluster you usually have an interface to access to the cluster, a network that interconnect the nodes and a set of tools to manage the cluster. The interface to access to the cluster usually is a node named front-end to which the users can SSH. The other nodes are usually named the working-nodes. Another common component is a shared filesystem to ease simple communication between the WN.

A very common set-up is to install a NIS server in the front-end so that the users can access to the WN (i.e. using SSH), getting the same credentials than in the front-end. NIS is still useful because is very simple and it integrates very well with NFS, that is commonly used to share a file system.

It was easy to install all of this, but it is also a bit tricky (in special, NIS), and so this time I had to re-learn…

How to install a cluster with NIS and NFS in Ubuntu 16.04

We start from 3 nodes that have a fresh installation of Ubuntu 16.04. These nodes are in the network 10.0.0.1/24. Their names are hpcmd00 (10.0.0.35), hpcmd01 (10.0.0.36) and hpcmd02 (10.0.0.37). In this example, hpcmd00 will be the front-end node and the others will act as the working nodes.

Preparation

First of all we are updating ubuntu in all the nodes:

root@hpcmd00:~# apt-get update && apt-get -y dist-upgrade

Installing and configuring NIS

Install NIS in the Server

Now that the system is up to date, we are installing the NIS server in hpcmd00. It is very simple:

root@hpcmd00:~# apt-get install -y rpcbind nis

During the installation, we will be asked for the name of the domain (as in the next picture):

nisWe have selected the name hpcmd.nis for our domain. It will be kept in the file /etc/defaultdomain. Anyway we can change the name of the domain at any time by executing the next command:

root@hpcmd00:~# dpkg-reconfigure nis

And we will be prompted again for the name of the domain.

Now we need to adjust some parameters of the NIS server, that consist in editing the files /etc/default/nis and /etc/ypserv.securenets. In the first case we have to set the variable NISSERVER to the value “master”. In the second file (ypserv.securents) we are setting which IP addresses are allowed to access to the NIS service. In our case, we are allowing all the nodes in the subnet 10.0.0.0/24.

root@hpcmd00:~# sed -i 's/NISSERVER=.*$/NISSERVER=master/' /etc/default/nis
root@hpcmd00:~# sed 's/^\(0.0.0.0[\t ].*\)$/#\1/' -i /etc/ypserv.securenets
root@hpcmd00:~# echo "255.255.255.0 10.0.0.0" >> /etc/ypserv.securenets

Now we are including the name of the server in the /etc/hosts file, so that the server is able to solve its IP address, and then we are initializing the NIS service. As we have only one master server, we are including its name and let the initialization to proceed.

root@hpcmd00:~# echo "10.0.0.35 hpcmd00" >> /etc/hosts
root@hpcmd00:~# /usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. hpcmd00 is in the list of NIS server hosts. Please continue to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
 next host to add: hpcmd00
 next host to add: 
The current list of NIS servers looks like this:
hpcmd00
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/hpcmd.nis/ypservers...
Running /var/yp/Makefile...
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
...
Updating shadow.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'

hpcmd00 has been set up as a NIS master server.

Now you can run ypinit -s hpcmd00 on all slave server.

Finally we are exporting the users of our system by issuing the next command:

root@hpcmd00:~# make -C /var/yp/

Take into account that everytime that you create a new user in the front-end, you need to export the users by issuing the make -C /var/yp command. So it is advisable to create a cron task that runs that command, to make it sure that the users are exported.

root@hpcmd00:~# cat > /etc/cron.hourly/ypexport <<\EOT
#!/bin/sh
make -C /var/yp
EOT
root@hpcmd00:~# chmod +x /etc/cron.hourly/ypexport

The users in NIS

When issuing the command make…, you are exporting the users that have an identifier of 1000 and above. If you want to change it, you can adjust the parameters in the file /var/yp/Makefile.

In particular, you can change the variables MINUID and MINGID to match your needs.

In the default configuration, the users with id 1000 and above are exported because the user 1000 is the first user that is created in the system.

Install the NIS clients

Now that we have installed the NIS server, we can proceed to install the NIS clients. In this example we are installing hpcmd01, but it will be the same procedure for all the nodes.

First install NIS using the next command:

root@hpcmd01:~# apt-get install -y rpcbind nis

As it occurred in the server, you will be prompted for the name of the domain. In our case, it is hpcmd.nis because we set that name in the server.

root@hpcmd01:~# echo "domain hpcmd.nis server hpcmd00" >> /etc/yp.conf 
root@hpcmd01:~# sed -i 's/compat$/compat nis/g;s/dns$/dns nis/g' /etc/nsswitch.conf 
root@hpcmd01:~# systemctl restart nis

Fix the rpcbind bug in Ubuntu 16.04

At this time the NIS services (both in server and clients) are ready to be used, but… WARNING because the rpcbind package needed by NIS has a bug in Ubuntu and as you reboot any of your system, rpc is dead and so the NIS server will not work. You can check it by issuing the next command:

root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: inactive (dead)

Here you can see that it is inactive. And if you start it by hand, it will be properly running:

root@hpcmd00:~# systemctl start rpcbind
root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: active (running) since vie 2017-05-12 12:57:00 CEST; 1s ago
 Main PID: 1212 (rpcbind)
 Tasks: 1
 Memory: 684.0K
 CPU: 8ms
 CGroup: /system.slice/rpcbind.service
 └─1212 /sbin/rpcbind -f -w
may 12 12:57:00 hpcmd00 systemd[1]: Starting RPC bind portmap service...
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/rpcbind.xdr: failed
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/portmap.xdr: failed
may 12 12:57:00 hpcmd00 systemd[1]: Started RPC bind portmap service.

There are some patches, and it seems that it will be solved in the new versions. But for now, we are including a very simple workaround that consists in adding the next lines to the file /etc/rc.local, just before the “exit 0” line:

systemctl restart rpcbind
systemctl restart nis

Now if you reboot your system, it will be properly running the rpcbind service.

WARNING: this needs to be done in all the nodes.

Installing and configuring NFS

We are configuring NFS in a very straightforward way. If you need more security or other features, you should deep into NFS configuration options to adapt it to your deployment.

In particular, we are sharing the /home folder in hpcmd00 to be available for the WN. Then, the users will have their files available at each node. I followed the instructions at this blog post.

Sharing /home at front-end

In order to install NFS in the server, you just need to issue the next command

root@hpcmd00:~# apt-get install -y nfs-kernel-server

And to share the /home folder, you just need to add a line to the /etc/exports file

root@hpcmd00:~# cat >> /etc/exports << \EOF
/home hpcmd*(rw,sync,no_root_squash,no_subtree_check)
EOF

There are a lot of options to share a folder using NFS, but we are just using some of them that are common for a /home folder. Take into account that you can restrict the hosts to which you can share the folder using their names (that is our case: hpcmdXXXX) or using IP addresses. It is noticeable that you can use wildcards such as “*”.

Finally you need to restart the NFS daemon, and you will be able to verify that the exports are ready.

root@hpcmd00:~# service nfs-kernel-server restart
root@hpcmd00:~# showmount -e localhost
Export list for localhost:
/home hpcmd*

Mount the /home folder in the WN

In order to be able to use NFS endpoints, you just need to run the next command on each node:

root@hpcmd01:~# apt-get install -y nfs-common

Now you will be able to list the folders shared at the server

root@hpcmd01:~# showmount -e hpcmd00
Export list for hpcmd00:
/home hpcmd*

At this moment it is possible to mount the /home folder just issuing a command like

root@hpcmd01:~# mount -t nfs hpcmd00:/home /home

But we’d prefer to add a line to the /etc/fstab file. Using this approach, the mount will be available at boot time. In order to make it, we’ll add the proper line:

root@hpcmd01:~# cat >> /etc/fstab << \EOT
hpcmd00:/home /home nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
EOT

Now you can also issue the following command to start using your share without the need of rebooting:

root@hpcmd01:~# mount /home/

Verification

At the hpcmd00 node you can create a user, and verify that the home folder has been created:

root@hpcmd00:~# adduser testuser
Añadiendo el usuario `testuser' ...
Añadiendo el nuevo grupo `testuser' (1002) ...
Añadiendo el nuevo usuario `testuser' (1002) con grupo `testuser' ...
...
¿Es correcta la información? [S/n] S
root@hpcmd00:~# ls -l /home/
total 4
drwxr-xr-x 2 testuser testuser 4096 may 15 10:06 testuser

If you ssh to the internal nodes, it will fail (the user will not be available), because the user has not been exported:

root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Permission denied, please try again.

But the home folder for that user is already available in these nodes (because the folder is shared using NFS).

Once we export the users at hpcmd00 the user will be available in the domain and we will be able to ssh to the WN using that user:

root@hpcmd00:~# make -C /var/yp/
make: se entra en el directorio '/var/yp'
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating netid.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'
make: se sale del directorio '/var/yp'
root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-77-generic x86_64)

testuser@hpcmd01:~$ pwd
/home/testuser

 

How to create overlay networks using Linux Bridges and VXLANs

Some time ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. Digging in the topic of overlay networks, I saw that linux bridges had included VXLAN capabilities, and also saw how some people were using it to create overlay networks in a LAN. As an example, it is made in this way in the OpenStack linux bridges plugin. So I decided to work by hand in this topic (in order to better understand how OpenStack was working) and learned…

How to create overlay networks using Linux Bridges and VXLANs

Well, I do not know when OpenStack started to use overlay networks using Linux Bridges, but as I started to search for information on how to do it by hand, I realized that it is a widespread topic. As an example, I found this post from VMWare that is a must if we want to better understand what is going on here.

Scenario

I have a single LAN, and I want to have several overlay networks with multiple VMs in each of them. I want that one set of VMs can communicate between them, but I don’t want that other set of VMs even know about the first set: I want to isolate networks from multiple tenants.

The next figure shows what I wan to do:

overlay-1

The left hand part of the image shows that will happen, and the right hand side of the image shows what the users in the hosts will have the vision that happen. The hosts which end in “1” will see that they are they are alone in their LAN, and the hosts which end in “2” will see that they are alone in their LAN.

Set up

We will create a “poor man setup” in which we will have two VMs that are simulating the hosts, and we will use LXC containers that will act as “guests”.

The next figure shows what we are creating

overlay-2

node01 and node02 are the hosts that will host the containers. Each of them have a physical interface named ens3, with IPs 10.0.0.28 and 10.0.0.31. We will create on each of them a bridge named br-vxlan-<ID> to which we should be able to bridge our containers. And these containers will have a interface (eth0) with an IP addresses in the range of 192.168.1.1/24.

To isolate the networks, we are using VXLANS with different VXLAN Network Identifier (VNI). In our case, 10 and 20.

Starting point

We have 2 hosts that can ping one to each other (node01 and node02).

root@node01:~# ping -c 2 node02
PING node02 (10.0.0.31) 56(84) bytes of data.
64 bytes from node02 (10.0.0.31): icmp_seq=1 ttl=64 time=1.17 ms
64 bytes from node02 (10.0.0.31): icmp_seq=2 ttl=64 time=0.806 ms

and

root@node02:~# ping -c 2 node01
PING node01 (10.0.0.28) 56(84) bytes of data.
64 bytes from node01 (10.0.0.28): icmp_seq=1 ttl=64 time=0.740 ms
64 bytes from node01 (10.0.0.28): icmp_seq=2 ttl=64 time=0.774 ms

In each of them I will make sure that I have installed the package iproute2 (i.e. the command “ip”).

In order to verify that everything is properly working, in each node, we will install the latest version of lxd according to this (in my case, I have lxd version 2.8). The one shipped with ubuntu 16.04.1 is 2.0 and will not be useful for us because we want that it is able to manage networks.

Anyway, I will offer an alternative for non-ubuntu users that will consist in creating a extra interface that will be bridged to the br-vxlan interface.

Let’s begin

The implementation of vxlan for the linux bridges works encapsulating traffic in multicast UDP messages that are distributed using IGMP.

In order to enable that the TCP/IP traffic is encapsulated through these interfaces, we will create a bridge and will attach the vxlan interface to that bridge. At the end, a bridge works like a network hub and forwards the traffic to the ports that are connected to it. So the traffic that appears in the bridge will be encapsulated into the UDP multicast messages.

For the creation of the first VXLAN (with VNI 10) we will need to issue the next commands (in each of the nodes)

ip link add vxlan10 type vxlan id 10 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan10 type bridge
ip link set vxlan10 master br-vxlan10
ip link set vxlan10 up
ip link set br-vxlan10 up

In these lines…

  1. First we create a vxlan port with VNI 10 that will use the device ens3 to multicast the UDP traffic using group 239.1.1.1 (using dstport 0 makes use of the default port).
  2. Then we will create a bridge named br-vxlan10 to which we will bridge the previously created vxlan port.
  3. Finally we will set both ports up.

Now that we have the first VXLAN, we will proceed with the second:

ip link add vxlan20 type vxlan id 20 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan20 type bridge
ip link set vxlan20 master br-vxlan20
ip link set vxlan20 up
ip link set br-vxlan20 up

Both VXLANs will be created in both nodes node01 and node02.

Tests and verification

At this point, we have the VXLANs ready to be used, and the traffic of each port that we bridge to the br-vxlan10 or br-vxlan20 will be multicasted using UDP to the network. As we have several nodes in the LAN, we will have VXLANs that span across multiple nodes.

In practise, the bridges br-vxlan10 of each node will be in LAN (each port included in such bridge of each node will be in the same LAN). The same occurs for br-vxlan20.

And the traffic of br-vxlan10 will not be visible in br-vxlan20 and vice-versa.

Verify using LXD containers

This is the test that will be more simple to understand as it is conceptually what we want. The only difference is that we will create containers instead of VMs.

In order to verify that it works, we will create the containers lhs1 (in node01) and and rhs1 (in node02) that will be attached to the br-vxlan10. In node01 we will execute the following commands:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 lhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec lhs1 ip addr add 192.168.1.1/24 dev eth0

What we are doing is the next:

  1. Creating a LXC profile, to ensure that it has not any network interface.
  2. Making that the profile uses the bridge that we created for the VXLAN.
  3. Creating a container that uses the profile (and so, will be attached to the VXLAN).
  4. Assigning the IP address 192.168.1.1 to the container.

In node02, we will create other container (rhs1) with IP 192.168.1.2:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 rhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec rhs1 ip addr add 192.168.1.2/24 dev eth0

And now, we have one container in each node that feels like if it was in a LAN with the other container.

In order to verify it, we will use a simple server that echoes the information sent. So in node01, in lhs1 we will start netcat listening in port 9999:

root@node01:~# lxc exec lhs1 -- nc -l -p 9999

And in node02, in rhs1 we will start netcat connected to the lhs1 IP and port (192.168.1.1:9999):

root@node02:~# lxc exec rhs1 -- nc 192.168.1.1 9999

Anything that we write in this node will get output in the other one, as shown in the image:

lxc-over

Now we can create the other containers and see what happens.

In node01 we will create the container lhs2 connected to vxlan20 and the same IP address than lhs1 (i.e. 192.168.1.1):

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 lhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec lhs2 ip addr add 192.168.1.1/24 dev eth0

At this point, if we try to ping to IP address 192.168.1.2 (which is assigned to rhs1), it should not work, as it is in the other VXLAN:

root@node01:~# lxc exec lhs2 -- ping -c 2 192.168.1.2
PING 192.168.1.2 (192.168.1.2): 56 data bytes

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

Finally, in node02, we will create the container rhs2, attached to vxlan20, and the IP address 192.168.1.2:

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 rhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec rhs2 -- ip addr add 192.168.1.2/24 dev eth0

And now we can verify that each pair of nodes can communicate between them and the other traffic will not arrive. The next figure shows that it works!test02

Now you could have fun capturing the traffic in the hosts and get things like this:

traffic.png

You ping a host in vxlan20 and if you dump the traffic from ens3 you will get the top left traffic (the traffic in “instance 20”, i.e. VNI 20), but there is no traffic in br-vxlan10.

I suggest to have fun with wireshark to look in depth at what is happening (watch the UDP traffic, how it is translated using the VXLAN protocol, etc.).

Verify using other devices

If you cannot manage to use VMs or LXD containers, you can create a veth device and assign the IP addresses to it. Then ping through that interface to generate traffic.

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up

And we will create other interface too

ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up

Now we will set the IP 192.168.1.1 to eth10 and 192.168.1.2 to eth20, and will try to ping from one to the other:

# ip addr add 192.168.1.1/24 dev eth10
# ip addr add 192.168.2.1/24 dev eth20
# ping 192.168.2.1 -c 2 -I eth10
PING 192.168.2.1 (192.168.2.1) from 192.168.1.1 eth10: 56(84) bytes of data.
From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
From 192.168.1.1 icmp_seq=2 Destination Host Unreachable

--- 192.168.2.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
pipe 2

Here we see that it does not work.

I had to set IP addresses in different ranges. Otherwise the interfaces do not work properly.

Now, in node02, we will create the interfaces and set IP addresses to them (192.168.1.2 to eth10 and 192.168.2.2 to eth20).

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up
ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up
ip addr add 192.168.1.2/24 dev eth10
ip addr add 192.168.2.2/24 dev eth20

And now we can try to ping to the interfaces in the corresponding VXLAN.

root@node01:~# ping 192.168.1.2 -c 2 -I eth10
PING 192.168.1.2 (192.168.1.2) from 192.168.1.1 eth10: 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=10.1 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=4.53 ms

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 4.539/7.364/10.189/2.825 ms
root@node01:~# ping 192.168.2.2 -c 2 -I eth10
PING 192.168.2.2 (192.168.2.2) from 192.168.1.1 eth10: 56(84) bytes of data.

--- 192.168.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

If we inspect what is happening using tcpdump, we’ll see that the traffic arrives to one interface and not to the other, as it is shown in the next figure:

dump.png

What we got here…

At the end, we have achived to a situation in which we have multiple isolated LANs over a single LAN. The traffic in one LAN is not seen in the other LANs.

This enables to create multi-tenant networks for Cloud datacenters.

Troubleshooting

During the tests I created a bridge in which the traffic was not forwarded from one port to the others. I tried to debug what was happening, whether it was affected by ebtables, iptables, etc. and at first I found no reason.

I was able to solve it by following the advice in this post. In fact, I did not trusted on it and rebooted, and while some of the settings were set to 1, it worked from then on.

$ cd /proc/sys/net/bridge
$ ls
 bridge-nf-call-arptables bridge-nf-call-iptables
 bridge-nf-call-ip6tables bridge-nf-filter-vlan-tagged
$ for f in bridge-nf-*; do echo 0 > $f; done

The machine in which I was doing the tests is not usually powered off, so maybe it was on for at least 2 months. Maybe some previous tests drove me to that problem.

I have faced this problem again and I was not comfortable with a solution “based on the faith”. So I searched a bit more and I found this post. Now I know what these files in /proc/sys/net/bridge mean and now I know that the problem was about iptables. The problem is that the files bridge-nf-call-iptables, etc. mean if the rules should go through iptables/arptables… before forwarding them to the ports in the bridge. So if you set a zero to these files, you will not have any iptables related problem.

If you find that the traffic is not forwarded to the ports, you should double-check the iptables and so on. In my case the “problem” was that forwarding was prevented by default. A “easy to check” solution is to check the filter table of iptables:

# iptables -t filter -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
...

In my case, the filter dropped the traffic. If I want the traffic to be forwarded, I must explicitly accept it by adding a rule such as

# iptables -t FILTER -A FORWARD -i br-vxlan20 -j ACCEPT

 

How to avoid the automatic installation of the recommended packages in Ubuntu

I am used to install ubuntu servers, and I want them to use the less disk possible. So I usually install packages adding the flag –no-install-recommends. But I needed to include the flag each time. So this time I learned…

How to avoid the automatic installation of the recommended packages in Ubuntu

This is a very simple trick that I found in this post, but I want to keep it simple for me to find it.

It is needed to include some settings in one file in /etc/apt/apt.conf.d/. In order to isolate these settings, I will create a new file:

$ cat > /etc/apt/apt.conf.d/99_disablerecommends <\EOF
APT::Install-Recommends "false";
APT::AutoRemove::RecommendsImportant "false";
APT::AutoRemove::SuggestsImportant "false";
EOF

And from now on, when you issue apt-get install commands, the recommended packages will not be installed.

IMPORTANT: now that you have installed all those recommended packages, you can get rid of them just issuing a command like the next one:

$ apt-get autoremove --purge