How to install a cluster with NIS and NFS in Ubuntu 16.04

I am used to create computing clusters. A cluster consists of a set of computers that work together to solve one task. In a cluster you usually have an interface to access to the cluster, a network that interconnect the nodes and a set of tools to manage the cluster. The interface to access to the cluster usually is a node named front-end to which the users can SSH. The other nodes are usually named the working-nodes. Another common component is a shared filesystem to ease simple communication between the WN.

A very common set-up is to install a NIS server in the front-end so that the users can access to the WN (i.e. using SSH), getting the same credentials than in the front-end. NIS is still useful because is very simple and it integrates very well with NFS, that is commonly used to share a file system.

It was easy to install all of this, but it is also a bit tricky (in special, NIS), and so this time I had to re-learn…

How to install a cluster with NIS and NFS in Ubuntu 16.04

We start from 3 nodes that have a fresh installation of Ubuntu 16.04. These nodes are in the network 10.0.0.1/24. Their names are hpcmd00 (10.0.0.35), hpcmd01 (10.0.0.36) and hpcmd02 (10.0.0.37). In this example, hpcmd00 will be the front-end node and the others will act as the working nodes.

Preparation

First of all we are updating ubuntu in all the nodes:

root@hpcmd00:~# apt-get update && apt-get -y dist-upgrade

Installing and configuring NIS

Install NIS in the Server

Now that the system is up to date, we are installing the NIS server in hpcmd00. It is very simple:

root@hpcmd00:~# apt-get install -y rpcbind nis

During the installation, we will be asked for the name of the domain (as in the next picture):

nisWe have selected the name hpcmd.nis for our domain. It will be kept in the file /etc/defaultdomain. Anyway we can change the name of the domain at any time by executing the next command:

root@hpcmd00:~# dpkg-reconfigure nis

And we will be prompted again for the name of the domain.

Now we need to adjust some parameters of the NIS server, that consist in editing the files /etc/default/nis and /etc/ypserv.securenets. In the first case we have to set the variable NISSERVER to the value “master”. In the second file (ypserv.securents) we are setting which IP addresses are allowed to access to the NIS service. In our case, we are allowing all the nodes in the subnet 10.0.0.0/24.

root@hpcmd00:~# sed -i 's/NISSERVER=.*$/NISSERVER=master/' /etc/default/nis
root@hpcmd00:~# sed 's/^\(0.0.0.0[\t ].*\)$/#\1/' -i /etc/ypserv.securenets
root@hpcmd00:~# echo "255.255.255.0 10.0.0.0" >> /etc/ypserv.securenets

Now we are including the name of the server in the /etc/hosts file, so that the server is able to solve its IP address, and then we are initializing the NIS service. As we have only one master server, we are including its name and let the initialization to proceed.

root@hpcmd00:~# echo "10.0.0.35 hpcmd00" >> /etc/hosts
root@hpcmd00:~# /usr/lib/yp/ypinit -m
At this point, we have to construct a list of the hosts which will run NIS
servers. hpcmd00 is in the list of NIS server hosts. Please continue to add
the names for the other hosts, one per line. When you are done with the
list, type a <control D>.
 next host to add: hpcmd00
 next host to add: 
The current list of NIS servers looks like this:
hpcmd00
Is this correct? [y/n: y] y
We need a few minutes to build the databases...
Building /var/yp/hpcmd.nis/ypservers...
Running /var/yp/Makefile...
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
...
Updating shadow.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'

hpcmd00 has been set up as a NIS master server.

Now you can run ypinit -s hpcmd00 on all slave server.

Finally we are exporting the users of our system by issuing the next command:

root@hpcmd00:~# make -C /var/yp/

Take into account that everytime that you create a new user in the front-end, you need to export the users by issuing the make -C /var/yp command. So it is advisable to create a cron task that runs that command, to make it sure that the users are exported.

root@hpcmd00:~# cat > /etc/cron.hourly/ypexport <<\EOT
#!/bin/sh
make -C /var/yp
EOT
root@hpcmd00:~# chmod +x /etc/cron.hourly/ypexport

The users in NIS

When issuing the command make…, you are exporting the users that have an identifier of 1000 and above. If you want to change it, you can adjust the parameters in the file /var/yp/Makefile.

In particular, you can change the variables MINUID and MINGID to match your needs.

In the default configuration, the users with id 1000 and above are exported because the user 1000 is the first user that is created in the system.

Install the NIS clients

Now that we have installed the NIS server, we can proceed to install the NIS clients. In this example we are installing hpcmd01, but it will be the same procedure for all the nodes.

First install NIS using the next command:

root@hpcmd01:~# apt-get install -y rpcbind nis

As it occurred in the server, you will be prompted for the name of the domain. In our case, it is hpcmd.nis because we set that name in the server.

root@hpcmd01:~# echo "domain hpcmd.nis server hpcmd00" >> /etc/yp.conf 
root@hpcmd01:~# sed -i 's/compat$/compat nis/g;s/dns$/dns nis/g' /etc/nsswitch.conf 
root@hpcmd01:~# systemctl restart nis

Fix the rpcbind bug in Ubuntu 16.04

At this time the NIS services (both in server and clients) are ready to be used, but… WARNING because the rpcbind package needed by NIS has a bug in Ubuntu and as you reboot any of your system, rpc is dead and so the NIS server will not work. You can check it by issuing the next command:

root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: inactive (dead)

Here you can see that it is inactive. And if you start it by hand, it will be properly running:

root@hpcmd00:~# systemctl start rpcbind
root@hpcmd00:~# systemctl status rpcbind
● rpcbind.service - RPC bind portmap service
 Loaded: loaded (/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
 Drop-In: /run/systemd/generator/rpcbind.service.d
 └─50-rpcbind-$portmap.conf
 Active: active (running) since vie 2017-05-12 12:57:00 CEST; 1s ago
 Main PID: 1212 (rpcbind)
 Tasks: 1
 Memory: 684.0K
 CPU: 8ms
 CGroup: /system.slice/rpcbind.service
 └─1212 /sbin/rpcbind -f -w
may 12 12:57:00 hpcmd00 systemd[1]: Starting RPC bind portmap service...
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/rpcbind.xdr: failed
may 12 12:57:00 hpcmd00 rpcbind[1212]: rpcbind: xdr_/run/rpcbind/portmap.xdr: failed
may 12 12:57:00 hpcmd00 systemd[1]: Started RPC bind portmap service.

There are some patches, and it seems that it will be solved in the new versions. But for now, we are including a very simple workaround that consists in adding the next lines to the file /etc/rc.local, just before the “exit 0” line:

systemctl restart rpcbind
systemctl restart nis

Now if you reboot your system, it will be properly running the rpcbind service.

WARNING: this needs to be done in all the nodes.

Installing and configuring NFS

We are configuring NFS in a very straightforward way. If you need more security or other features, you should deep into NFS configuration options to adapt it to your deployment.

In particular, we are sharing the /home folder in hpcmd00 to be available for the WN. Then, the users will have their files available at each node. I followed the instructions at this blog post.

Sharing /home at front-end

In order to install NFS in the server, you just need to issue the next command

root@hpcmd00:~# apt-get install -y nfs-kernel-server

And to share the /home folder, you just need to add a line to the /etc/exports file

root@hpcmd00:~# cat >> /etc/exports << \EOF
/home hpcmd*(rw,sync,no_root_squash,no_subtree_check)
EOF

There are a lot of options to share a folder using NFS, but we are just using some of them that are common for a /home folder. Take into account that you can restrict the hosts to which you can share the folder using their names (that is our case: hpcmdXXXX) or using IP addresses. It is noticeable that you can use wildcards such as “*”.

Finally you need to restart the NFS daemon, and you will be able to verify that the exports are ready.

root@hpcmd00:~# service nfs-kernel-server restart
root@hpcmd00:~# showmount -e localhost
Export list for localhost:
/home hpcmd*

Mount the /home folder in the WN

In order to be able to use NFS endpoints, you just need to run the next command on each node:

root@hpcmd01:~# apt-get install -y nfs-common

Now you will be able to list the folders shared at the server

root@hpcmd01:~# showmount -e hpcmd00
Export list for hpcmd00:
/home hpcmd*

At this moment it is possible to mount the /home folder just issuing a command like

root@hpcmd01:~# mount -t nfs hpcmd00:/home /home

But we’d prefer to add a line to the /etc/fstab file. Using this approach, the mount will be available at boot time. In order to make it, we’ll add the proper line:

root@hpcmd01:~# cat >> /etc/fstab << \EOT
hpcmd00:/home /home nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0
EOT

Now you can also issue the following command to start using your share without the need of rebooting:

root@hpcmd01:~# mount /home/

Verification

At the hpcmd00 node you can create a user, and verify that the home folder has been created:

root@hpcmd00:~# adduser testuser
Añadiendo el usuario `testuser' ...
Añadiendo el nuevo grupo `testuser' (1002) ...
Añadiendo el nuevo usuario `testuser' (1002) con grupo `testuser' ...
...
¿Es correcta la información? [S/n] S
root@hpcmd00:~# ls -l /home/
total 4
drwxr-xr-x 2 testuser testuser 4096 may 15 10:06 testuser

If you ssh to the internal nodes, it will fail (the user will not be available), because the user has not been exported:

root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Permission denied, please try again.

But the home folder for that user is already available in these nodes (because the folder is shared using NFS).

Once we export the users at hpcmd00 the user will be available in the domain and we will be able to ssh to the WN using that user:

root@hpcmd00:~# make -C /var/yp/
make: se entra en el directorio '/var/yp'
make[1]: se entra en el directorio '/var/yp/hpcmd.nis'
Updating passwd.byname...
Updating passwd.byuid...
Updating group.byname...
Updating group.bygid...
Updating netid.byname...
make[1]: se sale del directorio '/var/yp/hpcmd.nis'
make: se sale del directorio '/var/yp'
root@hpcmd00:~# ssh testuser@hpcmd01
testuser@hpcmd01's password: 
Welcome to Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-77-generic x86_64)

testuser@hpcmd01:~$ pwd
/home/testuser

 

How to create overlay networks using Linux Bridges and VXLANs

Some time ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. Digging in the topic of overlay networks, I saw that linux bridges had included VXLAN capabilities, and also saw how some people were using it to create overlay networks in a LAN. As an example, it is made in this way in the OpenStack linux bridges plugin. So I decided to work by hand in this topic (in order to better understand how OpenStack was working) and learned…

How to create overlay networks using Linux Bridges and VXLANs

Well, I do not know when OpenStack started to use overlay networks using Linux Bridges, but as I started to search for information on how to do it by hand, I realized that it is a widespread topic. As an example, I found this post from VMWare that is a must if we want to better understand what is going on here.

Scenario

I have a single LAN, and I want to have several overlay networks with multiple VMs in each of them. I want that one set of VMs can communicate between them, but I don’t want that other set of VMs even know about the first set: I want to isolate networks from multiple tenants.

The next figure shows what I wan to do:

overlay-1

The left hand part of the image shows that will happen, and the right hand side of the image shows what the users in the hosts will have the vision that happen. The hosts which end in “1” will see that they are they are alone in their LAN, and the hosts which end in “2” will see that they are alone in their LAN.

Set up

We will create a “poor man setup” in which we will have two VMs that are simulating the hosts, and we will use LXC containers that will act as “guests”.

The next figure shows what we are creating

overlay-2

node01 and node02 are the hosts that will host the containers. Each of them have a physical interface named ens3, with IPs 10.0.0.28 and 10.0.0.31. We will create on each of them a bridge named br-vxlan-<ID> to which we should be able to bridge our containers. And these containers will have a interface (eth0) with an IP addresses in the range of 192.168.1.1/24.

To isolate the networks, we are using VXLANS with different VXLAN Network Identifier (VNI). In our case, 10 and 20.

Starting point

We have 2 hosts that can ping one to each other (node01 and node02).

root@node01:~# ping -c 2 node02
PING node02 (10.0.0.31) 56(84) bytes of data.
64 bytes from node02 (10.0.0.31): icmp_seq=1 ttl=64 time=1.17 ms
64 bytes from node02 (10.0.0.31): icmp_seq=2 ttl=64 time=0.806 ms

and

root@node02:~# ping -c 2 node01
PING node01 (10.0.0.28) 56(84) bytes of data.
64 bytes from node01 (10.0.0.28): icmp_seq=1 ttl=64 time=0.740 ms
64 bytes from node01 (10.0.0.28): icmp_seq=2 ttl=64 time=0.774 ms

In each of them I will make sure that I have installed the package iproute2 (i.e. the command “ip”).

In order to verify that everything is properly working, in each node, we will install the latest version of lxd according to this (in my case, I have lxd version 2.8). The one shipped with ubuntu 16.04.1 is 2.0 and will not be useful for us because we want that it is able to manage networks.

Anyway, I will offer an alternative for non-ubuntu users that will consist in creating a extra interface that will be bridged to the br-vxlan interface.

Let’s begin

The implementation of vxlan for the linux bridges works encapsulating traffic in multicast UDP messages that are distributed using IGMP.

In order to enable that the TCP/IP traffic is encapsulated through these interfaces, we will create a bridge and will attach the vxlan interface to that bridge. At the end, a bridge works like a network hub and forwards the traffic to the ports that are connected to it. So the traffic that appears in the bridge will be encapsulated into the UDP multicast messages.

For the creation of the first VXLAN (with VNI 10) we will need to issue the next commands (in each of the nodes)

ip link add vxlan10 type vxlan id 10 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan10 type bridge
ip link set vxlan10 master br-vxlan10
ip link set vxlan10 up
ip link set br-vxlan10 up

In these lines…

  1. First we create a vxlan port with VNI 10 that will use the device ens3 to multicast the UDP traffic using group 239.1.1.1 (using dstport 0 makes use of the default port).
  2. Then we will create a bridge named br-vxlan10 to which we will bridge the previously created vxlan port.
  3. Finally we will set both ports up.

Now that we have the first VXLAN, we will proceed with the second:

ip link add vxlan20 type vxlan id 20 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan20 type bridge
ip link set vxlan20 master br-vxlan20
ip link set vxlan20 up
ip link set br-vxlan20 up

Both VXLANs will be created in both nodes node01 and node02.

Tests and verification

At this point, we have the VXLANs ready to be used, and the traffic of each port that we bridge to the br-vxlan10 or br-vxlan20 will be multicasted using UDP to the network. As we have several nodes in the LAN, we will have VXLANs that span across multiple nodes.

In practise, the bridges br-vxlan10 of each node will be in LAN (each port included in such bridge of each node will be in the same LAN). The same occurs for br-vxlan20.

And the traffic of br-vxlan10 will not be visible in br-vxlan20 and vice-versa.

Verify using LXD containers

This is the test that will be more simple to understand as it is conceptually what we want. The only difference is that we will create containers instead of VMs.

In order to verify that it works, we will create the containers lhs1 (in node01) and and rhs1 (in node02) that will be attached to the br-vxlan10. In node01 we will execute the following commands:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 lhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec lhs1 ip addr add 192.168.1.1/24 dev eth0

What we are doing is the next:

  1. Creating a LXC profile, to ensure that it has not any network interface.
  2. Making that the profile uses the bridge that we created for the VXLAN.
  3. Creating a container that uses the profile (and so, will be attached to the VXLAN).
  4. Assigning the IP address 192.168.1.1 to the container.

In node02, we will create other container (rhs1) with IP 192.168.1.2:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 rhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec rhs1 ip addr add 192.168.1.2/24 dev eth0

And now, we have one container in each node that feels like if it was in a LAN with the other container.

In order to verify it, we will use a simple server that echoes the information sent. So in node01, in lhs1 we will start netcat listening in port 9999:

root@node01:~# lxc exec lhs1 -- nc -l -p 9999

And in node02, in rhs1 we will start netcat connected to the lhs1 IP and port (192.168.1.1:9999):

root@node02:~# lxc exec rhs1 -- nc 192.168.1.1 9999

Anything that we write in this node will get output in the other one, as shown in the image:

lxc-over

Now we can create the other containers and see what happens.

In node01 we will create the container lhs2 connected to vxlan20 and the same IP address than lhs1 (i.e. 192.168.1.1):

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 lhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec lhs2 ip addr add 192.168.1.1/24 dev eth0

At this point, if we try to ping to IP address 192.168.1.2 (which is assigned to rhs1), it should not work, as it is in the other VXLAN:

root@node01:~# lxc exec lhs2 -- ping -c 2 192.168.1.2
PING 192.168.1.2 (192.168.1.2): 56 data bytes

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

Finally, in node02, we will create the container rhs2, attached to vxlan20, and the IP address 192.168.1.2:

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 rhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec rhs2 -- ip addr add 192.168.1.2/24 dev eth0

And now we can verify that each pair of nodes can communicate between them and the other traffic will not arrive. The next figure shows that it works!test02

Now you could have fun capturing the traffic in the hosts and get things like this:

traffic.png

You ping a host in vxlan20 and if you dump the traffic from ens3 you will get the top left traffic (the traffic in “instance 20”, i.e. VNI 20), but there is no traffic in br-vxlan10.

I suggest to have fun with wireshark to look in depth at what is happening (watch the UDP traffic, how it is translated using the VXLAN protocol, etc.).

Verify using other devices

If you cannot manage to use VMs or LXD containers, you can create a veth device and assign the IP addresses to it. Then ping through that interface to generate traffic.

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up

And we will create other interface too

ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up

Now we will set the IP 192.168.1.1 to eth10 and 192.168.1.2 to eth20, and will try to ping from one to the other:

# ip addr add 192.168.1.1/24 dev eth10
# ip addr add 192.168.2.1/24 dev eth20
# ping 192.168.2.1 -c 2 -I eth10
PING 192.168.2.1 (192.168.2.1) from 192.168.1.1 eth10: 56(84) bytes of data.
From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
From 192.168.1.1 icmp_seq=2 Destination Host Unreachable

--- 192.168.2.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
pipe 2

Here we see that it does not work.

I had to set IP addresses in different ranges. Otherwise the interfaces do not work properly.

Now, in node02, we will create the interfaces and set IP addresses to them (192.168.1.2 to eth10 and 192.168.2.2 to eth20).

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up
ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up
ip addr add 192.168.1.2/24 dev eth10
ip addr add 192.168.2.2/24 dev eth20

And now we can try to ping to the interfaces in the corresponding VXLAN.

root@node01:~# ping 192.168.1.2 -c 2 -I eth10
PING 192.168.1.2 (192.168.1.2) from 192.168.1.1 eth10: 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=10.1 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=4.53 ms

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 4.539/7.364/10.189/2.825 ms
root@node01:~# ping 192.168.2.2 -c 2 -I eth10
PING 192.168.2.2 (192.168.2.2) from 192.168.1.1 eth10: 56(84) bytes of data.

--- 192.168.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

If we inspect what is happening using tcpdump, we’ll see that the traffic arrives to one interface and not to the other, as it is shown in the next figure:

dump.png

What we got here…

At the end, we have achived to a situation in which we have multiple isolated LANs over a single LAN. The traffic in one LAN is not seen in the other LANs.

This enables to create multi-tenant networks for Cloud datacenters.

Troubleshooting

For some unknown reason, during the tests I created a bridge in which the traffic was not forwarded from one port to the others. I tried to debug what was happening, whether it was affected by ebtables, iptables, etc. and found no reason.

I was able to solve it by following the advice in this post. In fact, I did not trusted on it and rebooted, and while some of the settings were set to 1, it worked from then on.

$ cd /proc/sys/net/bridge
$ ls
 bridge-nf-call-arptables bridge-nf-call-iptables
 bridge-nf-call-ip6tables bridge-nf-filter-vlan-tagged
$ for f in bridge-nf-*; do echo 0 > $f; done

The machine in which I was doing the tests is not usually powered off, so maybe it was on for at least 2 months. Maybe some previous tests drove me to that problem.

 

How to avoid the automatic installation of the recommended packages in Ubuntu

I am used to install ubuntu servers, and I want them to use the less disk possible. So I usually install packages adding the flag –no-install-recommends. But I needed to include the flag each time. So this time I learned…

How to avoid the automatic installation of the recommended packages in Ubuntu

This is a very simple trick that I found in this post, but I want to keep it simple for me to find it.

It is needed to include some settings in one file in /etc/apt/apt.conf.d/. In order to isolate these settings, I will create a new file:

$ cat > /etc/apt/apt.conf.d/99_disablerecommends <\EOF
APT::Install-Recommends "false";
APT::AutoRemove::RecommendsImportant "false";
APT::AutoRemove::SuggestsImportant "false";
EOF

And from now on, when you issue apt-get install commands, the recommended packages will not be installed.

IMPORTANT: now that you have installed all those recommended packages, you can get rid of them just issuing a command like the next one:

$ apt-get autoremove --purge

How to automatically SSH to a non-default port and other cool things of SSH

In the last months I have been working with MPI. In particular, with OpenMPI. The basis of the work of OpenMPI consist in SSH-ing without password to the nodes that are part of a parallel calculation.

The problem is that in my case I was not using the default SSH port because I was trying to use OpenMPI with Docker and the SSH server is mapped to other port (in a later post I will write about this). And so this time I learned…

How to automatically SSH to a non-default port and other cool things of SSH

Yes, I know that I can SSH to a non default port using a syntax like this one:

$ ssh me@my.server -p 4000

But the problem is that in some cases I cannot change the commandline. As an example, when using OpenMPI, it is not possible to modify the port to which the master will SSH the slaves. It will just ssh the slaves.

At the end, SSH has the possibility of creating a ssh-config file that enables to change the port to which the SSH client will try to connect, and also will enable to configure some other cool things. I found a very straightforward explanation about the SSH config file in this post.

So making that a ssh me@my.server will connecto to the port 4000 without including it in the commandline will consist in creating a file named $HOME/.ssh/config with the following content:

Hostname my.server
Port 4000

I am used to use the ssh config file, but I did not know about its potential. In particular, I didn’t know about the possibility of changing the default port. Digging a bit on it, the ssh-config is powerful, as it enables to avoid the annoying messages about the host keys in a internal network, to assign aliases to hosts or even to enable the port forwarding.

An example of config file that I use to SSH the internal nodes of my clusters is the next one:

Host node*
StrictHostKeyChecking no
UserKnownHostsFile /dev/null

And well… now I know that I can change the port of the SSH server (perhaps to avoid attacks from users).

How to deal with operations and parameters in bash script like a master

This is a simple post, based in a previous post in which I learned how to work with parameters in bash, to have option (e.g. -h, –help, etc.), combined options (e.g. -cf that means the same than –cool –function), etc. That post consisted in pre-processing the commandline to get the proper flags and dealing with them.

Now I want to implement bash scripts that work in the form of

$ ./my_script operation --flag-op

And this is why I extended the previous post to learn

How to deal with operations and parameters in bash script like a master.

In this case I want to have scripts that accept the following syntax:

$ ./my_script operation -p <parameter> -ob --file conf.file

And even sintax like the next one:

$ ./my_script --global-option operation -p <parameter> -ob --file conf.file

Using the preprocessing introduced in the previous post, this is very straightforward to implement, as we only need to intercept the name of the operations and continue with the options.

As a plus, in my solution I will delegate each operation to deal with its specific options. My solution is the next one:

function get() {
 while (( $# > 0 )); do
  case "$1" in
   --all|-a) ALL=1;;
   *) usage && exit 1;;
  esac
  shift
 done
 # implement_the_operation
}

n=0
while (( $# > 0 )); do
 if [ "${1:0:1}" == "-" -a "${1:1:1}" != "-" ]; then
  for f in $(echo "${1:1}" | sed 's/\(.\)/-\1 /g' ); do
   ARR[$n]="$f"
   n=$(($n+1))
  done
 else
  ARR[$n]="$1"
  n=$(($n+1))
 fi
 shift
done

n=0
COMMAND=
while [ $n -lt ${#ARR[@]} -a "$COMMAND" == "" ]; do
 PARAM="${ARR[$n]}"
 case "$PARAM" in
  get) COMMAND="$PARAM";;
  --help | -h) usage && exit 0;;
  *) usage && exit 1;;
 esac
 n=$(($n+1))
done

if [ "$COMMAND" != "" ]; then
 $COMMAND "${ARR[@]:$n}"
else
 echo "no command issued" && usage && exit 1
fi

In this script I only accept the operation (COMMAND) “get”. And it is self-serviced using a function with the same name. In order to implement more operations, it is as easy as including it in the detection and creating a function with the same name.

How to connect complex networking infrastructures with Open vSwitch and LXC containers

Some days ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. In order to extend the features of the set-up that I did there, I wanted to introduce some services: a DHCP server, a router, etc. to create a more complex infrastructure. And so this time I learned…

How to connect complex networking infrastructures with Open vSwitch and LXC containers

My setup is based on the previous one, to introduce common services for networked environments. In particular, I am going to create a router and a DHCP server. So I will have two nodes that will host LXC containers and they will have the following features:

  • Any container in any node will get an IP address from the single DHCP server.
  • Any container will have access to the internet through the single router.
  • The containers will be able to connect between them using their private IP addresses.

We had the set-up in the next figure:

ovs

And now we want to get to the following set-up:

ovs2

Well… we are not making anything new, because we have worked with this before in How to create a multi-LXC infrastructure using custom NAT and DHCP server. But we can see this post as an integration post.

Update of the previous setup

On each of the nodes we have to create the bridge br-cont0 and the containers that we want. Moreover, we have to create the virtual swithc ovsbr0 and to connect it to the other node.

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-br ovsbr0
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

Warning: we are not starting the containers, because we want them to get the IP address from our dhcp server.

Preparing a bridge to the outern world (NAT bridge)

We need a bridge that will act as a router to the external world for the router in our LAN. This is because we only have two known IP addresses (the one for ovsnode01 and the one for ovsnode02). So we’ll provide access to the Internet through one of them (according to the figure, it will be ovsnode01).

So we will create the bridge and will give it a local IP address:

ovsnode01:~# brctl addbr br-out
ovsnode01:~# ip addr add dev br-out 10.0.1.1/24

And now we will provide access to the containers that connect to that bridge through NAT. So let’s create the following script and execute it:

ovsnode01:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth0
IFACE_LAN=br-out
NETWORK_LAN=10.0.1.0/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
ovsnode01:~# chmod +x enable_nat.sh
ovsnode01:~# ./enable_nat.sh

And that’s all. Now ovsnode01 will act as a router for IP addresses in the range 10.0.1.0/24.

DHCP server

Creating a DHCP server is as easy as creating a new container, installing dnsmasq and configuring it.

ovsnode01:~# cat > ./nat-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-out 
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f nat-network.tmpl -n dhcpserver -t ubuntu
ovsnode01:~# lxc-start -dn dhcpserver
ovsnode01:~# lxc-attach -n dhcpserver -- bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip addr add 10.0.1.2/24 dev eth0
route add default gw 10.0.1.1'
ovsnode01:~# lxc-attach -n dhcpserver

WARNING: we created the container attached to br-out, because we want it to have internet access to be able to install dnsmasq. Moreover we needed to give it an IP address and set the nameserver to the one from google. Once the dhcpserver is configured, we’ll change the configuration to attach to br-cont0, because the dhcpserver only needs to access to the internal network.

Now we have to install dnsmasq:

apt-get update
apt-get install -y dnsmasq

Now we’ll configure the static network interface (172.16.0.202), by modifying the file /etc/network/interfaces

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 172.16.1.202
netmask 255.255.255.0
EOF

And finally, we’ll configure dnsmasq

cat > /etc/dnsmasq.conf << EOF
interface=eth0
except-interface=lo
listen-address=172.16.1.202
bind-interfaces
dhcp-range=172.16.1.1,172.16.1.200,1h
dhcp-option=26,1400
dhcp-option=option:router,172.16.1.201
EOF

In this configuration we have created our range of IP addresses (from 172.16.1.1 to 172.16.1.200). We have stated that our router will have the IP address 172.16.1.201 and one important thing: we have set the MTU to 1400 (remember that when using OVS we had to set the MTU to a lower size).

Now we are ready to connect the container to br-cont0. In order to make it, we have to modify the file /var/lib/lxc/dhcpserver/config. In particular, we have to change the value of the attribute lxc.network.link from br-out to br-cont0. Once I modified it, my network configuration in that file is as follows:

# Network configuration
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br-cont0
lxc.network.hwaddr = 00:16:3e:9f:ae:3f

Finally we can reboot our container

ovsnode01:~# lxc-stop -n dhcpserver 
ovsnode01:~# lxc-start -dn dhcpserver

And we can check that our server gets the proper IP address:

root@ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 
dhcpserver RUNNING 0 - 172.16.1.202 -

We could also check that it is connected to the bridge:

ovsnode01:~# ip addr
...
83: vethGUV3HB: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-cont0 state UP group default qlen 1000
 link/ether fe:86:05:f6:f4:55 brd ff:ff:ff:ff:ff:ff
ovsnode01:~# brctl show br-cont0
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe3b968e0937 no vethGUV3HB
ovsnode01:~# lxc-attach -n dhcpserver -- ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
82: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
 link/ether 00:16:3e:9f:ae:3f brd ff:ff:ff:ff:ff:ff
ovsnode01:~# ethtool -S vethGUV3HB
NIC statistics:
 peer_ifindex: 82

No matter if you do not understand this… it is a very advanced issue for this post. The important thing is that bridge br-cont0 has the device vethGUV3HB, whose number is 83 and its peer interface is the 82 that, in fact, is the eth0 device from inside the container.

Installing the router

Now that we have our dhcpserver ready, we are going to create a container that will act as a router for our network. It is very easy (in fact, we have already created a router). And… this fact arises a question: why are we creating another router?

We create a new router because it has to have an IP address inside the private network and other interface in the network to which we want to provide acess from the internal network.

Once we have this issue clear, let’s create the router, which as an IP in the bridge in the internal network (br-cont0):

ovsnode01:~# cat > ./router-network.tmpl << EOF
 lxc.network.type = veth
 lxc.network.link = br-cont0
 lxc.network.flags = up
 lxc.network.hwaddr = 00:16:3e:xx:xx:xx
 lxc.network.type = veth
lxc.network.link = br-out
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
 ovsnode01:~# lxc-create  -t ubuntu -f router-network.tmpl -n router

WARNING: I don’t know why, but for some reason sometimes lxc 2.0.3 fails in Ubuntu 14.04 when starting containers if they are created using two NICs.

Now we can start the container and start to work with it:

ovsnode01:~# lxc-start -dn router
ovsnode01:~# lxc-attach -n router

Now we simply have to configure the IP addresses for the router (eth0 is the interface in the internal network, bridged to br-cont0, and eth1 is bridged to br-out)

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.1.201
netmask 255.255.255.0

auto eth1
iface eth1 inet static
address 10.0.1.2
netmask 255.255.255.0
gateway 10.0.1.1
EOF

And finally create the router by using a script which is similar to the previous one:

router:~# apt-get install -y iptables
router:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth1
IFACE_LAN=eth0
NETWORK_LAN=172.16.1.201/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
router:~# chmod +x enable_nat.sh
router:~# ./enable_nat.sh

Now we have our router ready to be used.

Starting the containers

Now we can simply start the containers that we created before, and we can check that they get an IP address by DHCP:

ovsnode01:~# lxc-start -n node01c01
ovsnode01:~# lxc-start -n node01c02
ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
dhcpserver RUNNING 0 - 172.16.1.202 -
node01c01 RUNNING 0 - 172.16.1.39 -
node01c02 RUNNING 0 - 172.16.1.90 -
router RUNNING 0 - 10.0.1.2, 172.16.1.201 -

And also we can check all the hops in our network, to check that it is properly configured:

ovsnode01:~# lxc-attach -n node01c01 -- apt-get install traceroute
(...)
ovsnode01:~# lxc-attach -n node01c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.085 ms 0.040 ms 0.041 ms
 2 10.0.1.1 0.079 ms 0.144 ms 0.067 ms
 3 10.10.2.201 0.423 ms 0.517 ms 0.514 ms
...
12 216.58.210.131 8.247 ms 8.096 ms 8.195 ms

Now we can go to the other host and create the bridges, the virtual switch and the containers, as we did in the previous post.

WARNING: Just to remember, I leave this snip of code here:

ovsnode02:~# brctl addbr br-cont0
ovsnode02:~# ip link set dev br-cont0 up
ovsnode02:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode02:~# apt-get install openvswitch-switch
ovsnode02:~# ovs-vsctl add-br ovsbr0
ovsnode02:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode02:~# ovs-vsctl add-port ovsbr0 vxlan0 — set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

And finally, we can start the containers and check that they get IP addresses from the DHCP server, and that they have connectivity to the internet using the routers that we have created:

ovsnode02:~# lxc-start -n node02c01
ovsnode02:~# lxc-start -n node02c02
ovsnode02:~# lxc-ls -f
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------
node02c01 RUNNING 172.16.1.50 - NO
node02c02 RUNNING 172.16.1.133 - NO
ovsnode02:~# lxc-attach -n node02c01 -- apt-get install traceroute
(...)
ovsnode02:~# lxc-attach -n node02c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.904 ms 0.722 ms 0.679 ms
 2 10.0.1.1 0.853 ms 0.759 ms 0.918 ms
 3 10.10.2.201 1.774 ms 1.496 ms 1.603 ms
...
12 216.58.210.131 8.849 ms 8.773 ms 9.062 ms

What is next?

Well, you’d probably want to persist the settings. Maybe you can set the iptables rules (aka the enable_nat.sh script) as a start script in /etc/init.d

As a further work, you can try VLAN tagging in OVS and so on, to duplicate the networks using the same components, but isolating the different networks.

You can also try to include new services (e.g. a private DNS server, a reverse NAT, etc.).

How to create a overlay network using Open vSwitch in order to connect LXC containers.

Open vSwitch (OVS) is a virtual switch implementation that can be used as a tool for Software Defined Network (SDN). The concepts managed by OVS are the same as the concepts managed by hardware switches (ports, routes, etc.). The most important features (for me) of OVS are

  • It enables to have a virtual switch inside a host in which to connect virtual machines, containers, etc.
  • It enables connect switches in different hosts using network tunnels (gre or vxlan).
  • It is possible to program the switch using OpenFlow.

In order to start working with OVS, this time…

I learned how to create a overlay network using Open vSwitch in order to connect LXC containers.

Well, first of all, I have to say that I have used LXC instead of VM because they are lightweight and very straightforward to use in a Ubuntu distribution. But if you understand this, you should be able to follow this how-to and use VMs instead of containers.

What we want (Our set-up)

What we are creating is shown in the next figure:

ovs

We have two nodes (ovsnodeXX) and several containers deployed on them (nodeXXcYY). The result is that all the containers can connect between them, but the traffic is not seen the LAN 10.10.2.x appart from the connection bewteen the OVS switches (because it is tunnelled).

Starting point

  • I have 2 ubuntu-based nodes (ovsnode01 and ovsnode02), with 1 network interface each, and they are able to ping one to each other:
root@ovsnode01:~# ping ovsnode01 -c 3
PING ovsnode01 (10.10.2.21) 56(84) bytes of data.
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=3 ttl=64 time=0.033 ms

--- ovsnode01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.022/0.027/0.033/0.007 ms
root@ovsnode01:~# ping ovsnode02 -c 3
PING ovsnode02 (10.10.2.22) 56(84) bytes of data.
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=1 ttl=64 time=1.45 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=2 ttl=64 time=0.683 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=3 ttl=64 time=0.756 ms

--- ovsnode02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.683/0.963/1.451/0.347 ms
  • I am able to create lxc containers, by issuing commands such as:
lxc-create -n node1 -t ubuntu
  • I am able to create linux bridges (i.e. I have installed the bridge-utils package).

Spoiler (or quick setup)

If you just want the solution, here you have (later I will explain all the steps). On each node you can follow the next steps in ovsnode01:

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01:~# lxc-start -dn node01c01
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01:~# lxc-start -dn node01c02
ovsnode01:~# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

You will need to follow the same steps for ovsnode02, taking into account the names of the containers and the IP addresses.

Preparing the newtork

First we are going to create a bridge (br-cont0) to which the containers are being bridged

We are not using lxcbr0 because it may have other services such as dnsmasq that we should disable before. Moreover, creating our bridge will be more interesting to understand what we are doing

We will issue this command on both ovsnode01 and ovsnode02 nodes.

brctl addbr br-cont0

As this bridge has no IP, it is down (you can see it using ip command). Now we are going to set it up (also in both nodes):

ip link set dev br-cont0 up

Creating the containers

Now we need a template to associate the network to the containers. So we have to create the file container-template.lxc:

cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF

In this file we are saying that containers should be automatically bridged to bridge br-cont0 (remember that we created it before) and they will hace an interface with a hardware address that will follow the template. We could also modify the file /etc/lxc/default.conf file instead of creating a new one.

Now, we can create the containers on node ovsnode01:

ovsnode01# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01# lxc-start -dn node01c01
ovsnode01# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01# lxc-start -dn node01c02

And also in ovsnode02:

ovsnode02# lxc-create-f ./container-template.lxc -n node02c01 -t ubuntu
ovsnode02# lxc-start -dn node02c01
ovsnode02# lxc-create-f ./container-template.lxc -n node02c02 -t ubuntu
ovsnode02# lxc-start -dn node02c02

If you followed my steps, the containers will not have any IP address. This is because you do not have any dhcp server and the containers do not have static IP addresses. And that is what we are going to do now.

We are setting the IP addresses 192.168.1.11 and 192.168.1.21 to node01c01 and node0201 respetively. In order to do so, we have to issue these two commands (each of them in the corresponding node):

ovsnode01# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0

You need to configure the containers in ovsnode02 too:

ovsnode02# lxc-attach -n node02c01 -- ip addr add 192.168.1.21/24 dev eth0
ovsnode02# lxc-attach -n node02c02 -- ip addr add 192.168.1.22/24 dev eth0

Making connections

At this point, you should be able to connect between the containers in the same node: between node01c01 and node01c02, and between node02c01 and node02c02. But not between containers in different nodes.

ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.11 -c3
PING 192.168.1.11 (192.168.1.11) 56(84) bytes of data.
64 bytes from 192.168.1.11: icmp_seq=1 ttl=64 time=0.136 ms
64 bytes from 192.168.1.11: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 192.168.1.11: icmp_seq=3 ttl=64 time=0.055 ms
ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.21 -c3
PING 192.168.1.21 (192.168.1.21) 56(84) bytes of data.
From 192.168.1.11 icmp_seq=1 Destination Host Unreachable
From 192.168.1.11 icmp_seq=2 Destination Host Unreachable
From 192.168.1.11 icmp_seq=3 Destination Host Unreachable

This is because br-cont0 is somehow a classic “network hub”, in which all the traffic can be listened by all the devices in it. So the containers in the same bridge are in the same LAN (in the same cable, indeed). But there is no connection between the hubs, and we will make it using Open vSwitch.

In the case of Ubuntu, installing OVS is as easy as issuing the next command:

apt-get install openvswitch-switch

Simple, isn’t it? But now we have to prepare the network and we are going to create a virtual switch (on both ovsnode01 and ovsnode02):

ovs-vsctl add-br ovsbr0

Open vSwitch works like a physical switch, with ports that can be connected and so on… And we are going to connect our hub to our switch (i.e. our bridge to the virtual switch):

ovs-vsctl add-port ovsbr0 br-cont0

We’ll make it in both ovsnode01 and ovsnode02

Finally, we’ll connect the ovs switches between them, using a vxlan tunnel:

ovsnode01# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22
ovsnode02# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

We’ll make each of the two commands above on the corresponding node. Take care that the remote IP addresses are set to the other node 😉

We can check the final configuration of the nodes (let’s show only ovsnode01, but the other is very similar):

ovsnode01# brctl show
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe502f26ea2d no veth3BUL7S
                              vethYLRPM2

ovsnode01# ovs-vsctl show
2096d83a-c7b9-47a8-8fff-d38c6d5ab04d
 Bridge "ovsbr0"
     Port "ovsbr0"
         Interface "ovsbr0"
             type: internal
     Port "vxlan0"
         Interface "vxlan0"
             type: vxlan
             options: {remote_ip="10.10.2.22"}
 ovs_version: "2.0.2"

Warning

Using this set up as is, you will get ping working, but probably no other traffic. This is because the traffic is encapsulated in a transport network. Did you know about MTU?

If we check the eth0 interface from one container we’ll get this:

# lxc-attach -n node01c01 -- ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:3e:52:42:2f 
 inet addr:192.168.1.11 Bcast:0.0.0.0 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Pay attention to the MTU value (it is 1500). And if we check the MTU of eth0 from the node, we’ll get this:

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 60:00:00:00:20:15 
 inet addr:10.10.2.21 Bcast:10.10.2.255 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Summarizing, MTU is the size of the message for ethernet, which usually is 1500. But we are sending messages into messages and if we try to use it as is, we are trying to send things that have a size of “1500 + some overhead” in a room of “1500” (we are conciously omiting the units). And “1500 + some overhead” is bigger than “1500” and that is why it will not work.

We have to change the MTU of the containers to a lower size. It is as simple as:

ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400

ovsnode02:~# lxc-attach -n node02c01 -- ifconfig eth0 mtu 1400
ovsnode02:~# lxc-attach -n node02c02 -- ifconfig eth0 mtu 1400

This method is not persistent, and will be lost in case of rebooting the container. In order to persist it, you can set it in the DHCP server (in case that you are using it), or in the network device set up. In the case of ubuntu it is as simple as adding a line with ‘mtu 1400’ to the proper device in /etc/network/interfaces. As an example for container node01c01:

auto eth0
iface eth0 inet static
address 192.168.1.11
netmask 255.255.255.0
mtu 1400

Some words on Virtual Machines

If you have some expertise on Virtual Machines and Linux (I suppose that if you are following this how-to, this is your case), you should be able to make all the set-up for your VMs. When you create your VM, you simply need to bridge the interfaces of the VM to the bridge that we have created (br-cont0) and that’s all.

And now, what?

Well, now we have what we wanted to have. And now you can play with it. I suggest to create a DHCP server (to not to have to set the MTU and the IP addresses), a router, a DNS server, etc.

And as an advanced option, you can play with traffic tagging, in order to create multiple overlay networks and isolating them from each other.