How to create overlay networks using Linux Bridges and VXLANs

Some time ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. Digging in the topic of overlay networks, I saw that linux bridges had included VXLAN capabilities, and also saw how some people were using it to create overlay networks in a LAN. As an example, it is made in this way in the OpenStack linux bridges plugin. So I decided to work by hand in this topic (in order to better understand how OpenStack was working) and learned…

How to create overlay networks using Linux Bridges and VXLANs

Well, I do not know when OpenStack started to use overlay networks using Linux Bridges, but as I started to search for information on how to do it by hand, I realized that it is a widespread topic. As an example, I found this post from VMWare that is a must if we want to better understand what is going on here.

Scenario

I have a single LAN, and I want to have several overlay networks with multiple VMs in each of them. I want that one set of VMs can communicate between them, but I don’t want that other set of VMs even know about the first set: I want to isolate networks from multiple tenants.

The next figure shows what I wan to do:

overlay-1

The left hand part of the image shows that will happen, and the right hand side of the image shows what the users in the hosts will have the vision that happen. The hosts which end in “1” will see that they are they are alone in their LAN, and the hosts which end in “2” will see that they are alone in their LAN.

Set up

We will create a “poor man setup” in which we will have two VMs that are simulating the hosts, and we will use LXC containers that will act as “guests”.

The next figure shows what we are creating

overlay-2

node01 and node02 are the hosts that will host the containers. Each of them have a physical interface named ens3, with IPs 10.0.0.28 and 10.0.0.31. We will create on each of them a bridge named br-vxlan-<ID> to which we should be able to bridge our containers. And these containers will have a interface (eth0) with an IP addresses in the range of 192.168.1.1/24.

To isolate the networks, we are using VXLANS with different VXLAN Network Identifier (VNI). In our case, 10 and 20.

Starting point

We have 2 hosts that can ping one to each other (node01 and node02).

root@node01:~# ping -c 2 node02
PING node02 (10.0.0.31) 56(84) bytes of data.
64 bytes from node02 (10.0.0.31): icmp_seq=1 ttl=64 time=1.17 ms
64 bytes from node02 (10.0.0.31): icmp_seq=2 ttl=64 time=0.806 ms

and

root@node02:~# ping -c 2 node01
PING node01 (10.0.0.28) 56(84) bytes of data.
64 bytes from node01 (10.0.0.28): icmp_seq=1 ttl=64 time=0.740 ms
64 bytes from node01 (10.0.0.28): icmp_seq=2 ttl=64 time=0.774 ms

In each of them I will make sure that I have installed the package iproute2 (i.e. the command “ip”).

In order to verify that everything is properly working, in each node, we will install the latest version of lxd according to this (in my case, I have lxd version 2.8). The one shipped with ubuntu 16.04.1 is 2.0 and will not be useful for us because we want that it is able to manage networks.

Anyway, I will offer an alternative for non-ubuntu users that will consist in creating a extra interface that will be bridged to the br-vxlan interface.

Let’s begin

The implementation of vxlan for the linux bridges works encapsulating traffic in multicast UDP messages that are distributed using IGMP.

In order to enable that the TCP/IP traffic is encapsulated through these interfaces, we will create a bridge and will attach the vxlan interface to that bridge. At the end, a bridge works like a network hub and forwards the traffic to the ports that are connected to it. So the traffic that appears in the bridge will be encapsulated into the UDP multicast messages.

For the creation of the first VXLAN (with VNI 10) we will need to issue the next commands (in each of the nodes)

ip link add vxlan10 type vxlan id 10 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan10 type bridge
ip link set vxlan10 master br-vxlan10
ip link set vxlan10 up
ip link set br-vxlan10 up

In these lines…

  1. First we create a vxlan port with VNI 10 that will use the device ens3 to multicast the UDP traffic using group 239.1.1.1 (using dstport 0 makes use of the default port).
  2. Then we will create a bridge named br-vxlan10 to which we will bridge the previously created vxlan port.
  3. Finally we will set both ports up.

Now that we have the first VXLAN, we will proceed with the second:

ip link add vxlan20 type vxlan id 20 group 239.1.1.1 dstport 0 dev ens3
ip link add br-vxlan20 type bridge
ip link set vxlan20 master br-vxlan20
ip link set vxlan20 up
ip link set br-vxlan20 up

Both VXLANs will be created in both nodes node01 and node02.

Tests and verification

At this point, we have the VXLANs ready to be used, and the traffic of each port that we bridge to the br-vxlan10 or br-vxlan20 will be multicasted using UDP to the network. As we have several nodes in the LAN, we will have VXLANs that span across multiple nodes.

In practise, the bridges br-vxlan10 of each node will be in LAN (each port included in such bridge of each node will be in the same LAN). The same occurs for br-vxlan20.

And the traffic of br-vxlan10 will not be visible in br-vxlan20 and vice-versa.

Verify using LXD containers

This is the test that will be more simple to understand as it is conceptually what we want. The only difference is that we will create containers instead of VMs.

In order to verify that it works, we will create the containers lhs1 (in node01) and and rhs1 (in node02) that will be attached to the br-vxlan10. In node01 we will execute the following commands:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 lhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec lhs1 ip addr add 192.168.1.1/24 dev eth0

What we are doing is the next:

  1. Creating a LXC profile, to ensure that it has not any network interface.
  2. Making that the profile uses the bridge that we created for the VXLAN.
  3. Creating a container that uses the profile (and so, will be attached to the VXLAN).
  4. Assigning the IP address 192.168.1.1 to the container.

In node02, we will create other container (rhs1) with IP 192.168.1.2:

lxc profile create vxlan10
lxc network attach-profile br-vxlan10 vxlan10
lxc launch images:alpine/3.4 rhs1 -p vxlan10
sleep 10 # to wait for the container to be up and ready
lxc exec rhs1 ip addr add 192.168.1.2/24 dev eth0

And now, we have one container in each node that feels like if it was in a LAN with the other container.

In order to verify it, we will use a simple server that echoes the information sent. So in node01, in lhs1 we will start netcat listening in port 9999:

root@node01:~# lxc exec lhs1 -- nc -l -p 9999

And in node02, in rhs1 we will start netcat connected to the lhs1 IP and port (192.168.1.1:9999):

root@node02:~# lxc exec rhs1 -- nc 192.168.1.1 9999

Anything that we write in this node will get output in the other one, as shown in the image:

lxc-over

Now we can create the other containers and see what happens.

In node01 we will create the container lhs2 connected to vxlan20 and the same IP address than lhs1 (i.e. 192.168.1.1):

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 lhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec lhs2 ip addr add 192.168.1.1/24 dev eth0

At this point, if we try to ping to IP address 192.168.1.2 (which is assigned to rhs1), it should not work, as it is in the other VXLAN:

root@node01:~# lxc exec lhs2 -- ping -c 2 192.168.1.2
PING 192.168.1.2 (192.168.1.2): 56 data bytes

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

Finally, in node02, we will create the container rhs2, attached to vxlan20, and the IP address 192.168.1.2:

lxc profile create vxlan20
lxc network attach-profile br-vxlan20 vxlan20
lxc launch images:alpine/3.4 rhs2 -p vxlan20
sleep 10 # to wait for the container to be up and ready
lxc exec rhs2 -- ip addr add 192.168.1.2/24 dev eth0

And now we can verify that each pair of nodes can communicate between them and the other traffic will not arrive. The next figure shows that it works!test02

Now you could have fun capturing the traffic in the hosts and get things like this:

traffic.png

You ping a host in vxlan20 and if you dump the traffic from ens3 you will get the top left traffic (the traffic in “instance 20”, i.e. VNI 20), but there is no traffic in br-vxlan10.

I suggest to have fun with wireshark to look in depth at what is happening (watch the UDP traffic, how it is translated using the VXLAN protocol, etc.).

Verify using other devices

If you cannot manage to use VMs or LXD containers, you can create a veth device and assign the IP addresses to it. Then ping through that interface to generate traffic.

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up

And we will create other interface too

ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up

Now we will set the IP 192.168.1.1 to eth10 and 192.168.1.2 to eth20, and will try to ping from one to the other:

# ip addr add 192.168.1.1/24 dev eth10
# ip addr add 192.168.2.1/24 dev eth20
# ping 192.168.2.1 -c 2 -I eth10
PING 192.168.2.1 (192.168.2.1) from 192.168.1.1 eth10: 56(84) bytes of data.
From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
From 192.168.1.1 icmp_seq=2 Destination Host Unreachable

--- 192.168.2.1 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1000ms
pipe 2

Here we see that it does not work.

I had to set IP addresses in different ranges. Otherwise the interfaces do not work properly.

Now, in node02, we will create the interfaces and set IP addresses to them (192.168.1.2 to eth10 and 192.168.2.2 to eth20).

ip link add eth10 type veth peer name eth10p
ip link set eth10p master br-vxlan10
ip link set eth10 up
ip link set eth10p up
ip link add eth20 type veth peer name eth20p
ip link set eth20p master br-vxlan20
ip link set eth20 up
ip link set eth20p up
ip addr add 192.168.1.2/24 dev eth10
ip addr add 192.168.2.2/24 dev eth20

And now we can try to ping to the interfaces in the corresponding VXLAN.

root@node01:~# ping 192.168.1.2 -c 2 -I eth10
PING 192.168.1.2 (192.168.1.2) from 192.168.1.1 eth10: 56(84) bytes of data.
64 bytes from 192.168.1.2: icmp_seq=1 ttl=64 time=10.1 ms
64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=4.53 ms

--- 192.168.1.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 4.539/7.364/10.189/2.825 ms
root@node01:~# ping 192.168.2.2 -c 2 -I eth10
PING 192.168.2.2 (192.168.2.2) from 192.168.1.1 eth10: 56(84) bytes of data.

--- 192.168.2.2 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

If we inspect what is happening using tcpdump, we’ll see that the traffic arrives to one interface and not to the other, as it is shown in the next figure:

dump.png

What we got here…

At the end, we have achived to a situation in which we have multiple isolated LANs over a single LAN. The traffic in one LAN is not seen in the other LANs.

This enables to create multi-tenant networks for Cloud datacenters.

Troubleshooting

During the tests I created a bridge in which the traffic was not forwarded from one port to the others. I tried to debug what was happening, whether it was affected by ebtables, iptables, etc. and at first I found no reason.

I was able to solve it by following the advice in this post. In fact, I did not trusted on it and rebooted, and while some of the settings were set to 1, it worked from then on.

$ cd /proc/sys/net/bridge
$ ls
 bridge-nf-call-arptables bridge-nf-call-iptables
 bridge-nf-call-ip6tables bridge-nf-filter-vlan-tagged
$ for f in bridge-nf-*; do echo 0 > $f; done

The machine in which I was doing the tests is not usually powered off, so maybe it was on for at least 2 months. Maybe some previous tests drove me to that problem.

I have faced this problem again and I was not comfortable with a solution “based on the faith”. So I searched a bit more and I found this post. Now I know what these files in /proc/sys/net/bridge mean and now I know that the problem was about iptables. The problem is that the files bridge-nf-call-iptables, etc. mean if the rules should go through iptables/arptables… before forwarding them to the ports in the bridge. So if you set a zero to these files, you will not have any iptables related problem.

If you find that the traffic is not forwarded to the ports, you should double-check the iptables and so on. In my case the “problem” was that forwarding was prevented by default. A “easy to check” solution is to check the filter table of iptables:

# iptables -t filter -S
-P INPUT ACCEPT
-P FORWARD DROP
-P OUTPUT ACCEPT
...

In my case, the filter dropped the traffic. If I want the traffic to be forwarded, I must explicitly accept it by adding a rule such as

# iptables -t FILTER -A FORWARD -i br-vxlan20 -j ACCEPT

 

How to connect complex networking infrastructures with Open vSwitch and LXC containers

Some days ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. In order to extend the features of the set-up that I did there, I wanted to introduce some services: a DHCP server, a router, etc. to create a more complex infrastructure. And so this time I learned…

How to connect complex networking infrastructures with Open vSwitch and LXC containers

My setup is based on the previous one, to introduce common services for networked environments. In particular, I am going to create a router and a DHCP server. So I will have two nodes that will host LXC containers and they will have the following features:

  • Any container in any node will get an IP address from the single DHCP server.
  • Any container will have access to the internet through the single router.
  • The containers will be able to connect between them using their private IP addresses.

We had the set-up in the next figure:

ovs

And now we want to get to the following set-up:

ovs2

Well… we are not making anything new, because we have worked with this before in How to create a multi-LXC infrastructure using custom NAT and DHCP server. But we can see this post as an integration post.

Update of the previous setup

On each of the nodes we have to create the bridge br-cont0 and the containers that we want. Moreover, we have to create the virtual swithc ovsbr0 and to connect it to the other node.

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-br ovsbr0
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

Warning: we are not starting the containers, because we want them to get the IP address from our dhcp server.

Preparing a bridge to the outern world (NAT bridge)

We need a bridge that will act as a router to the external world for the router in our LAN. This is because we only have two known IP addresses (the one for ovsnode01 and the one for ovsnode02). So we’ll provide access to the Internet through one of them (according to the figure, it will be ovsnode01).

So we will create the bridge and will give it a local IP address:

ovsnode01:~# brctl addbr br-out
ovsnode01:~# ip addr add dev br-out 10.0.1.1/24

And now we will provide access to the containers that connect to that bridge through NAT. So let’s create the following script and execute it:

ovsnode01:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth0
IFACE_LAN=br-out
NETWORK_LAN=10.0.1.0/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
ovsnode01:~# chmod +x enable_nat.sh
ovsnode01:~# ./enable_nat.sh

And that’s all. Now ovsnode01 will act as a router for IP addresses in the range 10.0.1.0/24.

DHCP server

Creating a DHCP server is as easy as creating a new container, installing dnsmasq and configuring it.

ovsnode01:~# cat > ./nat-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-out 
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f nat-network.tmpl -n dhcpserver -t ubuntu
ovsnode01:~# lxc-start -dn dhcpserver
ovsnode01:~# lxc-attach -n dhcpserver -- bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip addr add 10.0.1.2/24 dev eth0
route add default gw 10.0.1.1'
ovsnode01:~# lxc-attach -n dhcpserver

WARNING: we created the container attached to br-out, because we want it to have internet access to be able to install dnsmasq. Moreover we needed to give it an IP address and set the nameserver to the one from google. Once the dhcpserver is configured, we’ll change the configuration to attach to br-cont0, because the dhcpserver only needs to access to the internal network.

Now we have to install dnsmasq:

apt-get update
apt-get install -y dnsmasq

Now we’ll configure the static network interface (172.16.0.202), by modifying the file /etc/network/interfaces

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 172.16.1.202
netmask 255.255.255.0
EOF

And finally, we’ll configure dnsmasq

cat > /etc/dnsmasq.conf << EOF
interface=eth0
except-interface=lo
listen-address=172.16.1.202
bind-interfaces
dhcp-range=172.16.1.1,172.16.1.200,1h
dhcp-option=26,1400
dhcp-option=option:router,172.16.1.201
EOF

In this configuration we have created our range of IP addresses (from 172.16.1.1 to 172.16.1.200). We have stated that our router will have the IP address 172.16.1.201 and one important thing: we have set the MTU to 1400 (remember that when using OVS we had to set the MTU to a lower size).

Now we are ready to connect the container to br-cont0. In order to make it, we have to modify the file /var/lib/lxc/dhcpserver/config. In particular, we have to change the value of the attribute lxc.network.link from br-out to br-cont0. Once I modified it, my network configuration in that file is as follows:

# Network configuration
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br-cont0
lxc.network.hwaddr = 00:16:3e:9f:ae:3f

Finally we can reboot our container

ovsnode01:~# lxc-stop -n dhcpserver 
ovsnode01:~# lxc-start -dn dhcpserver

And we can check that our server gets the proper IP address:

root@ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 
dhcpserver RUNNING 0 - 172.16.1.202 -

We could also check that it is connected to the bridge:

ovsnode01:~# ip addr
...
83: vethGUV3HB: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-cont0 state UP group default qlen 1000
 link/ether fe:86:05:f6:f4:55 brd ff:ff:ff:ff:ff:ff
ovsnode01:~# brctl show br-cont0
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe3b968e0937 no vethGUV3HB
ovsnode01:~# lxc-attach -n dhcpserver -- ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
82: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
 link/ether 00:16:3e:9f:ae:3f brd ff:ff:ff:ff:ff:ff
ovsnode01:~# ethtool -S vethGUV3HB
NIC statistics:
 peer_ifindex: 82

No matter if you do not understand this… it is a very advanced issue for this post. The important thing is that bridge br-cont0 has the device vethGUV3HB, whose number is 83 and its peer interface is the 82 that, in fact, is the eth0 device from inside the container.

Installing the router

Now that we have our dhcpserver ready, we are going to create a container that will act as a router for our network. It is very easy (in fact, we have already created a router). And… this fact arises a question: why are we creating another router?

We create a new router because it has to have an IP address inside the private network and other interface in the network to which we want to provide acess from the internal network.

Once we have this issue clear, let’s create the router, which as an IP in the bridge in the internal network (br-cont0):

ovsnode01:~# cat > ./router-network.tmpl << EOF
 lxc.network.type = veth
 lxc.network.link = br-cont0
 lxc.network.flags = up
 lxc.network.hwaddr = 00:16:3e:xx:xx:xx
 lxc.network.type = veth
lxc.network.link = br-out
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
 ovsnode01:~# lxc-create  -t ubuntu -f router-network.tmpl -n router

WARNING: I don’t know why, but for some reason sometimes lxc 2.0.3 fails in Ubuntu 14.04 when starting containers if they are created using two NICs.

Now we can start the container and start to work with it:

ovsnode01:~# lxc-start -dn router
ovsnode01:~# lxc-attach -n router

Now we simply have to configure the IP addresses for the router (eth0 is the interface in the internal network, bridged to br-cont0, and eth1 is bridged to br-out)

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.1.201
netmask 255.255.255.0

auto eth1
iface eth1 inet static
address 10.0.1.2
netmask 255.255.255.0
gateway 10.0.1.1
EOF

And finally create the router by using a script which is similar to the previous one:

router:~# apt-get install -y iptables
router:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth1
IFACE_LAN=eth0
NETWORK_LAN=172.16.1.201/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
router:~# chmod +x enable_nat.sh
router:~# ./enable_nat.sh

Now we have our router ready to be used.

Starting the containers

Now we can simply start the containers that we created before, and we can check that they get an IP address by DHCP:

ovsnode01:~# lxc-start -n node01c01
ovsnode01:~# lxc-start -n node01c02
ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
dhcpserver RUNNING 0 - 172.16.1.202 -
node01c01 RUNNING 0 - 172.16.1.39 -
node01c02 RUNNING 0 - 172.16.1.90 -
router RUNNING 0 - 10.0.1.2, 172.16.1.201 -

And also we can check all the hops in our network, to check that it is properly configured:

ovsnode01:~# lxc-attach -n node01c01 -- apt-get install traceroute
(...)
ovsnode01:~# lxc-attach -n node01c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.085 ms 0.040 ms 0.041 ms
 2 10.0.1.1 0.079 ms 0.144 ms 0.067 ms
 3 10.10.2.201 0.423 ms 0.517 ms 0.514 ms
...
12 216.58.210.131 8.247 ms 8.096 ms 8.195 ms

Now we can go to the other host and create the bridges, the virtual switch and the containers, as we did in the previous post.

WARNING: Just to remember, I leave this snip of code here:

ovsnode02:~# brctl addbr br-cont0
ovsnode02:~# ip link set dev br-cont0 up
ovsnode02:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode02:~# apt-get install openvswitch-switch
ovsnode02:~# ovs-vsctl add-br ovsbr0
ovsnode02:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode02:~# ovs-vsctl add-port ovsbr0 vxlan0 — set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

And finally, we can start the containers and check that they get IP addresses from the DHCP server, and that they have connectivity to the internet using the routers that we have created:

ovsnode02:~# lxc-start -n node02c01
ovsnode02:~# lxc-start -n node02c02
ovsnode02:~# lxc-ls -f
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------
node02c01 RUNNING 172.16.1.50 - NO
node02c02 RUNNING 172.16.1.133 - NO
ovsnode02:~# lxc-attach -n node02c01 -- apt-get install traceroute
(...)
ovsnode02:~# lxc-attach -n node02c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.904 ms 0.722 ms 0.679 ms
 2 10.0.1.1 0.853 ms 0.759 ms 0.918 ms
 3 10.10.2.201 1.774 ms 1.496 ms 1.603 ms
...
12 216.58.210.131 8.849 ms 8.773 ms 9.062 ms

What is next?

Well, you’d probably want to persist the settings. Maybe you can set the iptables rules (aka the enable_nat.sh script) as a start script in /etc/init.d

As a further work, you can try VLAN tagging in OVS and so on, to duplicate the networks using the same components, but isolating the different networks.

You can also try to include new services (e.g. a private DNS server, a reverse NAT, etc.).

How to create a overlay network using Open vSwitch in order to connect LXC containers.

Open vSwitch (OVS) is a virtual switch implementation that can be used as a tool for Software Defined Network (SDN). The concepts managed by OVS are the same as the concepts managed by hardware switches (ports, routes, etc.). The most important features (for me) of OVS are

  • It enables to have a virtual switch inside a host in which to connect virtual machines, containers, etc.
  • It enables connect switches in different hosts using network tunnels (gre or vxlan).
  • It is possible to program the switch using OpenFlow.

In order to start working with OVS, this time…

I learned how to create a overlay network using Open vSwitch in order to connect LXC containers.

Well, first of all, I have to say that I have used LXC instead of VM because they are lightweight and very straightforward to use in a Ubuntu distribution. But if you understand this, you should be able to follow this how-to and use VMs instead of containers.

What we want (Our set-up)

What we are creating is shown in the next figure:

ovs

We have two nodes (ovsnodeXX) and several containers deployed on them (nodeXXcYY). The result is that all the containers can connect between them, but the traffic is not seen the LAN 10.10.2.x appart from the connection bewteen the OVS switches (because it is tunnelled).

Starting point

  • I have 2 ubuntu-based nodes (ovsnode01 and ovsnode02), with 1 network interface each, and they are able to ping one to each other:
root@ovsnode01:~# ping ovsnode01 -c 3
PING ovsnode01 (10.10.2.21) 56(84) bytes of data.
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=3 ttl=64 time=0.033 ms

--- ovsnode01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.022/0.027/0.033/0.007 ms
root@ovsnode01:~# ping ovsnode02 -c 3
PING ovsnode02 (10.10.2.22) 56(84) bytes of data.
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=1 ttl=64 time=1.45 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=2 ttl=64 time=0.683 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=3 ttl=64 time=0.756 ms

--- ovsnode02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.683/0.963/1.451/0.347 ms
  • I am able to create lxc containers, by issuing commands such as:
lxc-create -n node1 -t ubuntu
  • I am able to create linux bridges (i.e. I have installed the bridge-utils package).

Spoiler (or quick setup)

If you just want the solution, here you have (later I will explain all the steps). On each node you can follow the next steps in ovsnode01:

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01:~# lxc-start -dn node01c01
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01:~# lxc-start -dn node01c02
ovsnode01:~# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

You will need to follow the same steps for ovsnode02, taking into account the names of the containers and the IP addresses.

Preparing the newtork

First we are going to create a bridge (br-cont0) to which the containers are being bridged

We are not using lxcbr0 because it may have other services such as dnsmasq that we should disable before. Moreover, creating our bridge will be more interesting to understand what we are doing

We will issue this command on both ovsnode01 and ovsnode02 nodes.

brctl addbr br-cont0

As this bridge has no IP, it is down (you can see it using ip command). Now we are going to set it up (also in both nodes):

ip link set dev br-cont0 up

Creating the containers

Now we need a template to associate the network to the containers. So we have to create the file container-template.lxc:

cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF

In this file we are saying that containers should be automatically bridged to bridge br-cont0 (remember that we created it before) and they will hace an interface with a hardware address that will follow the template. We could also modify the file /etc/lxc/default.conf file instead of creating a new one.

Now, we can create the containers on node ovsnode01:

ovsnode01# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01# lxc-start -dn node01c01
ovsnode01# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01# lxc-start -dn node01c02

And also in ovsnode02:

ovsnode02# lxc-create-f ./container-template.lxc -n node02c01 -t ubuntu
ovsnode02# lxc-start -dn node02c01
ovsnode02# lxc-create-f ./container-template.lxc -n node02c02 -t ubuntu
ovsnode02# lxc-start -dn node02c02

If you followed my steps, the containers will not have any IP address. This is because you do not have any dhcp server and the containers do not have static IP addresses. And that is what we are going to do now.

We are setting the IP addresses 192.168.1.11 and 192.168.1.21 to node01c01 and node0201 respetively. In order to do so, we have to issue these two commands (each of them in the corresponding node):

ovsnode01# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0

You need to configure the containers in ovsnode02 too:

ovsnode02# lxc-attach -n node02c01 -- ip addr add 192.168.1.21/24 dev eth0
ovsnode02# lxc-attach -n node02c02 -- ip addr add 192.168.1.22/24 dev eth0

Making connections

At this point, you should be able to connect between the containers in the same node: between node01c01 and node01c02, and between node02c01 and node02c02. But not between containers in different nodes.

ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.11 -c3
PING 192.168.1.11 (192.168.1.11) 56(84) bytes of data.
64 bytes from 192.168.1.11: icmp_seq=1 ttl=64 time=0.136 ms
64 bytes from 192.168.1.11: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 192.168.1.11: icmp_seq=3 ttl=64 time=0.055 ms
ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.21 -c3
PING 192.168.1.21 (192.168.1.21) 56(84) bytes of data.
From 192.168.1.11 icmp_seq=1 Destination Host Unreachable
From 192.168.1.11 icmp_seq=2 Destination Host Unreachable
From 192.168.1.11 icmp_seq=3 Destination Host Unreachable

This is because br-cont0 is somehow a classic “network hub”, in which all the traffic can be listened by all the devices in it. So the containers in the same bridge are in the same LAN (in the same cable, indeed). But there is no connection between the hubs, and we will make it using Open vSwitch.

In the case of Ubuntu, installing OVS is as easy as issuing the next command:

apt-get install openvswitch-switch

Simple, isn’t it? But now we have to prepare the network and we are going to create a virtual switch (on both ovsnode01 and ovsnode02):

ovs-vsctl add-br ovsbr0

Open vSwitch works like a physical switch, with ports that can be connected and so on… And we are going to connect our hub to our switch (i.e. our bridge to the virtual switch):

ovs-vsctl add-port ovsbr0 br-cont0

We’ll make it in both ovsnode01 and ovsnode02

Finally, we’ll connect the ovs switches between them, using a vxlan tunnel:

ovsnode01# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22
ovsnode02# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

We’ll make each of the two commands above on the corresponding node. Take care that the remote IP addresses are set to the other node 😉

We can check the final configuration of the nodes (let’s show only ovsnode01, but the other is very similar):

ovsnode01# brctl show
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe502f26ea2d no veth3BUL7S
                              vethYLRPM2

ovsnode01# ovs-vsctl show
2096d83a-c7b9-47a8-8fff-d38c6d5ab04d
 Bridge "ovsbr0"
     Port "ovsbr0"
         Interface "ovsbr0"
             type: internal
     Port "vxlan0"
         Interface "vxlan0"
             type: vxlan
             options: {remote_ip="10.10.2.22"}
 ovs_version: "2.0.2"

Warning

Using this set up as is, you will get ping working, but probably no other traffic. This is because the traffic is encapsulated in a transport network. Did you know about MTU?

If we check the eth0 interface from one container we’ll get this:

# lxc-attach -n node01c01 -- ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:3e:52:42:2f 
 inet addr:192.168.1.11 Bcast:0.0.0.0 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Pay attention to the MTU value (it is 1500). And if we check the MTU of eth0 from the node, we’ll get this:

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 60:00:00:00:20:15 
 inet addr:10.10.2.21 Bcast:10.10.2.255 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Summarizing, MTU is the size of the message for ethernet, which usually is 1500. But we are sending messages into messages and if we try to use it as is, we are trying to send things that have a size of “1500 + some overhead” in a room of “1500” (we are conciously omiting the units). And “1500 + some overhead” is bigger than “1500” and that is why it will not work.

We have to change the MTU of the containers to a lower size. It is as simple as:

ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400

ovsnode02:~# lxc-attach -n node02c01 -- ifconfig eth0 mtu 1400
ovsnode02:~# lxc-attach -n node02c02 -- ifconfig eth0 mtu 1400

This method is not persistent, and will be lost in case of rebooting the container. In order to persist it, you can set it in the DHCP server (in case that you are using it), or in the network device set up. In the case of ubuntu it is as simple as adding a line with ‘mtu 1400’ to the proper device in /etc/network/interfaces. As an example for container node01c01:

auto eth0
iface eth0 inet static
address 192.168.1.11
netmask 255.255.255.0
mtu 1400

Some words on Virtual Machines

If you have some expertise on Virtual Machines and Linux (I suppose that if you are following this how-to, this is your case), you should be able to make all the set-up for your VMs. When you create your VM, you simply need to bridge the interfaces of the VM to the bridge that we have created (br-cont0) and that’s all.

And now, what?

Well, now we have what we wanted to have. And now you can play with it. I suggest to create a DHCP server (to not to have to set the MTU and the IP addresses), a router, a DNS server, etc.

And as an advanced option, you can play with traffic tagging, in order to create multiple overlay networks and isolating them from each other.

 

 

How to set the hostname from DHCP in Ubuntu 14.04

I have a lot of virtual servers, and I like preparing a “golden image” and instantiate it many times. One of the steps is to set the hostname for the host, from my private DHCP server. It usually works, but sometimes it fails and I didn’t know why. So I got tired of such indetermination and this time…

I learned how to set the hostname from DHCP in Ubuntu 14.04

Well, so I have my dnsmasq server that acts both as a dns server and as a dhcp server. I investigated how a host gets its hostname from DHCP and it seems that it occurs when the DHCP server sends it by default (option hostname).

I debugged the DHCP messages using this command from other server

# tcpdump -i eth1 -vvv port 67 or port 68

And I saw that my dhcp server was properly sending the hostname (dnsmasq sends it by default), and I realized that the problem was in the dhclient script. I googled a lot and I found some clever scripts that got the name from a nameserver and were started as hooks from the dhcpclient script. But if the dhcp protocol sets the hostname, why do I have to create other script to set the hostname???.

Finally I got this blog entry and I realized that it was a bug in the dhclient script: if there exists an old dhcp entry in /var/lib/dhcp/dhclient.eth0.leases (or eth1, etc.), it does not set the hostname.

At this point you have two options:

  • The easy: in /etc/rc.local include a line that removes that file
  • The clever (suggested in the blog entry): include a hook script that unsets the hostname
echo unset old_host_name >/etc/dhcp/dhclient-enter-hooks.d/unset_old_hostname

How to configure a simple router with iptables in Ubuntu

If you have a server with two network cards, you can set a simple router that NATs a private range to a public one, by simply installing iptables and configuring it. This time…

I learned how to configure a simple router with iptables in Ubuntu

Scenario

  1. I have one server with two NICs: eth0 and eth1, and several servers that have at least one NIC (e.g. eth1).
  2. The eth0 NIC is connected to a public network, with IP 111.22.33.44
  3. I want that the other servers have access to the public network through the main server.

How to do it

I will make it using IPTables, so I need to install IPTables

$ apt-get install iptables

My main server has the IP address 10.0.0.1 in eth1. The other servers also have their IPs in the range of 10.0.0.x (either using static IPs or DHCP).

Now I will create some iptables rules in the server, by adding these lines to /etc/rc.local file just before the exit 0 line.

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -s 10.0.0.0/24 ! -d 10.0.0.0/24 -j MASQUERADE
iptables -A FORWARD -d 10.0.0.0/24 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s 10.0.0.0/24 -i eth1 -j ACCEPT

These rules mean that:

  1. I want to forward traffic
  2. The traffic that comes from the network 10.0.0.x and is not directed to the network gains access to the internet through NAT.
  3. We forward the traffic from the connections made from the internal network to the origin IP.
  4. We accept the traffic that cames from the internal network and goes to the internal network.

Easier to modify

Here it is a script that you can use to customize the NAT for your site:

ovsnode01:~# cat > enable_nat <<\EOF
#!/bin/bash
IFACE_WAN=eth0
IFACE_LAN=eth1
NETWORK_LAN=10.0.0.0/24

case "$1" in
start)
echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
exit 0;;
stop)
iptables -t nat -D POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -D FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -D FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
exit 0;;
esac
exit 1
EOF
ovsnode01:~# chmod +x enable_nat
ovsnode01:~# ./enable_nat start

Now you can use this script in the ubuntu startup. Just move it to /etc/init.d and issue the next command:

update-rc.d enable_nat defaults 99 00

How to disable IPv6 in Ubuntu 14.04

I am tired of trying to update Ubuntu and see that it is trying to use the IPv6 that hangs the installation (ocasionally forever). My solution still is to disable IPv6 until it is properly used, so this time…

I learned how to disable IPv6 in Ubuntu 14.04

There are a lot of resources on how to disable IPv6 in Ubuntu 14.04, and this is yet another one (but this is mine).

It is as easy as

$ sudo su -
$ cat >> /etc/sysctl.conf << EOT
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
EOT
$ sysctl -p

To check if it has worked, you can issue a command like

$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If it returns 1, it has been properly disabled. Otherwise try to reboot and check it again.