How to connect complex networking infrastructures with Open vSwitch and LXC containers

Some days ago, I learned How to create a overlay network using Open vSwitch in order to connect LXC containers. In order to extend the features of the set-up that I did there, I wanted to introduce some services: a DHCP server, a router, etc. to create a more complex infrastructure. And so this time I learned…

How to connect complex networking infrastructures with Open vSwitch and LXC containers

My setup is based on the previous one, to introduce common services for networked environments. In particular, I am going to create a router and a DHCP server. So I will have two nodes that will host LXC containers and they will have the following features:

  • Any container in any node will get an IP address from the single DHCP server.
  • Any container will have access to the internet through the single router.
  • The containers will be able to connect between them using their private IP addresses.

We had the set-up in the next figure:

ovs

And now we want to get to the following set-up:

ovs2

Well… we are not making anything new, because we have worked with this before in How to create a multi-LXC infrastructure using custom NAT and DHCP server. But we can see this post as an integration post.

Update of the previous setup

On each of the nodes we have to create the bridge br-cont0 and the containers that we want. Moreover, we have to create the virtual swithc ovsbr0 and to connect it to the other node.

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode01:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-br ovsbr0
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

Warning: we are not starting the containers, because we want them to get the IP address from our dhcp server.

Preparing a bridge to the outern world (NAT bridge)

We need a bridge that will act as a router to the external world for the router in our LAN. This is because we only have two known IP addresses (the one for ovsnode01 and the one for ovsnode02). So we’ll provide access to the Internet through one of them (according to the figure, it will be ovsnode01).

So we will create the bridge and will give it a local IP address:

ovsnode01:~# brctl addbr br-out
ovsnode01:~# ip addr add dev br-out 10.0.1.1/24

And now we will provide access to the containers that connect to that bridge through NAT. So let’s create the following script and execute it:

ovsnode01:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth0
IFACE_LAN=br-out
NETWORK_LAN=10.0.1.0/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
ovsnode01:~# chmod +x enable_nat.sh
ovsnode01:~# ./enable_nat.sh

And that’s all. Now ovsnode01 will act as a router for IP addresses in the range 10.0.1.0/24.

DHCP server

Creating a DHCP server is as easy as creating a new container, installing dnsmasq and configuring it.

ovsnode01:~# cat > ./nat-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-out 
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f nat-network.tmpl -n dhcpserver -t ubuntu
ovsnode01:~# lxc-start -dn dhcpserver
ovsnode01:~# lxc-attach -n dhcpserver -- bash -c 'echo "nameserver 8.8.8.8" > /etc/resolv.conf
ip addr add 10.0.1.2/24 dev eth0
route add default gw 10.0.1.1'
ovsnode01:~# lxc-attach -n dhcpserver

WARNING: we created the container attached to br-out, because we want it to have internet access to be able to install dnsmasq. Moreover we needed to give it an IP address and set the nameserver to the one from google. Once the dhcpserver is configured, we’ll change the configuration to attach to br-cont0, because the dhcpserver only needs to access to the internal network.

Now we have to install dnsmasq:

apt-get update
apt-get install -y dnsmasq

Now we’ll configure the static network interface (172.16.0.202), by modifying the file /etc/network/interfaces

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet static
address 172.16.1.202
netmask 255.255.255.0
EOF

And finally, we’ll configure dnsmasq

cat > /etc/dnsmasq.conf << EOF
interface=eth0
except-interface=lo
listen-address=172.16.1.202
bind-interfaces
dhcp-range=172.16.1.1,172.16.1.200,1h
dhcp-option=26,1400
dhcp-option=option:router,172.16.1.201
EOF

In this configuration we have created our range of IP addresses (from 172.16.1.1 to 172.16.1.200). We have stated that our router will have the IP address 172.16.1.201 and one important thing: we have set the MTU to 1400 (remember that when using OVS we had to set the MTU to a lower size).

Now we are ready to connect the container to br-cont0. In order to make it, we have to modify the file /var/lib/lxc/dhcpserver/config. In particular, we have to change the value of the attribute lxc.network.link from br-out to br-cont0. Once I modified it, my network configuration in that file is as follows:

# Network configuration
lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br-cont0
lxc.network.hwaddr = 00:16:3e:9f:ae:3f

Finally we can reboot our container

ovsnode01:~# lxc-stop -n dhcpserver 
ovsnode01:~# lxc-start -dn dhcpserver

And we can check that our server gets the proper IP address:

root@ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 
dhcpserver RUNNING 0 - 172.16.1.202 -

We could also check that it is connected to the bridge:

ovsnode01:~# ip addr
...
83: vethGUV3HB: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-cont0 state UP group default qlen 1000
 link/ether fe:86:05:f6:f4:55 brd ff:ff:ff:ff:ff:ff
ovsnode01:~# brctl show br-cont0
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe3b968e0937 no vethGUV3HB
ovsnode01:~# lxc-attach -n dhcpserver -- ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default 
 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
82: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
 link/ether 00:16:3e:9f:ae:3f brd ff:ff:ff:ff:ff:ff
ovsnode01:~# ethtool -S vethGUV3HB
NIC statistics:
 peer_ifindex: 82

No matter if you do not understand this… it is a very advanced issue for this post. The important thing is that bridge br-cont0 has the device vethGUV3HB, whose number is 83 and its peer interface is the 82 that, in fact, is the eth0 device from inside the container.

Installing the router

Now that we have our dhcpserver ready, we are going to create a container that will act as a router for our network. It is very easy (in fact, we have already created a router). And… this fact arises a question: why are we creating another router?

We create a new router because it has to have an IP address inside the private network and other interface in the network to which we want to provide acess from the internal network.

Once we have this issue clear, let’s create the router, which as an IP in the bridge in the internal network (br-cont0):

ovsnode01:~# cat > ./router-network.tmpl << EOF
 lxc.network.type = veth
 lxc.network.link = br-cont0
 lxc.network.flags = up
 lxc.network.hwaddr = 00:16:3e:xx:xx:xx
 lxc.network.type = veth
lxc.network.link = br-out
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
 ovsnode01:~# lxc-create  -t ubuntu -f router-network.tmpl -n router

WARNING: I don’t know why, but for some reason sometimes lxc 2.0.3 fails in Ubuntu 14.04 when starting containers if they are created using two NICs.

Now we can start the container and start to work with it:

ovsnode01:~# lxc-start -dn router
ovsnode01:~# lxc-attach -n router

Now we simply have to configure the IP addresses for the router (eth0 is the interface in the internal network, bridged to br-cont0, and eth1 is bridged to br-out)

cat > /etc/network/interfaces << EOF
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.1.201
netmask 255.255.255.0

auto eth1
iface eth1 inet static
address 10.0.1.2
netmask 255.255.255.0
gateway 10.0.1.1
EOF

And finally create the router by using a script which is similar to the previous one:

router:~# apt-get install -y iptables
router:~# cat > enable_nat.sh <<\EOF
#!/bin/bash
IFACE_WAN=eth1
IFACE_LAN=eth0
NETWORK_LAN=172.16.1.201/24

echo "1" > /proc/sys/net/ipv4/ip_forward
iptables -t nat -A POSTROUTING -o $IFACE_WAN -s $NETWORK_LAN ! -d $NETWORK_LAN -j MASQUERADE
iptables -A FORWARD -d $NETWORK_LAN -i $IFACE_WAN -o $IFACE_LAN -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -s $NETWORK_LAN -i $IFACE_LAN -j ACCEPT
EOF
router:~# chmod +x enable_nat.sh
router:~# ./enable_nat.sh

Now we have our router ready to be used.

Starting the containers

Now we can simply start the containers that we created before, and we can check that they get an IP address by DHCP:

ovsnode01:~# lxc-start -n node01c01
ovsnode01:~# lxc-start -n node01c02
ovsnode01:~# lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6
dhcpserver RUNNING 0 - 172.16.1.202 -
node01c01 RUNNING 0 - 172.16.1.39 -
node01c02 RUNNING 0 - 172.16.1.90 -
router RUNNING 0 - 10.0.1.2, 172.16.1.201 -

And also we can check all the hops in our network, to check that it is properly configured:

ovsnode01:~# lxc-attach -n node01c01 -- apt-get install traceroute
(...)
ovsnode01:~# lxc-attach -n node01c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.085 ms 0.040 ms 0.041 ms
 2 10.0.1.1 0.079 ms 0.144 ms 0.067 ms
 3 10.10.2.201 0.423 ms 0.517 ms 0.514 ms
...
12 216.58.210.131 8.247 ms 8.096 ms 8.195 ms

Now we can go to the other host and create the bridges, the virtual switch and the containers, as we did in the previous post.

WARNING: Just to remember, I leave this snip of code here:

ovsnode02:~# brctl addbr br-cont0
ovsnode02:~# ip link set dev br-cont0 up
ovsnode02:~# cat > ./internal-network.tmpl << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c01 -t ubuntu
ovsnode02:~# lxc-create -f ./internal-network.tmpl -n node01c02 -t ubuntu
ovsnode02:~# apt-get install openvswitch-switch
ovsnode02:~# ovs-vsctl add-br ovsbr0
ovsnode02:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode02:~# ovs-vsctl add-port ovsbr0 vxlan0 — set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

And finally, we can start the containers and check that they get IP addresses from the DHCP server, and that they have connectivity to the internet using the routers that we have created:

ovsnode02:~# lxc-start -n node02c01
ovsnode02:~# lxc-start -n node02c02
ovsnode02:~# lxc-ls -f
NAME STATE IPV4 IPV6 AUTOSTART
-------------------------------------------------
node02c01 RUNNING 172.16.1.50 - NO
node02c02 RUNNING 172.16.1.133 - NO
ovsnode02:~# lxc-attach -n node02c01 -- apt-get install traceroute
(...)
ovsnode02:~# lxc-attach -n node02c01 -- traceroute -n www.google.es
traceroute to www.google.es (216.58.210.131), 30 hops max, 60 byte packets
 1 172.16.1.201 0.904 ms 0.722 ms 0.679 ms
 2 10.0.1.1 0.853 ms 0.759 ms 0.918 ms
 3 10.10.2.201 1.774 ms 1.496 ms 1.603 ms
...
12 216.58.210.131 8.849 ms 8.773 ms 9.062 ms

What is next?

Well, you’d probably want to persist the settings. Maybe you can set the iptables rules (aka the enable_nat.sh script) as a start script in /etc/init.d

As a further work, you can try VLAN tagging in OVS and so on, to duplicate the networks using the same components, but isolating the different networks.

You can also try to include new services (e.g. a private DNS server, a reverse NAT, etc.).

Advertisements

How to create a overlay network using Open vSwitch in order to connect LXC containers.

Open vSwitch (OVS) is a virtual switch implementation that can be used as a tool for Software Defined Network (SDN). The concepts managed by OVS are the same as the concepts managed by hardware switches (ports, routes, etc.). The most important features (for me) of OVS are

  • It enables to have a virtual switch inside a host in which to connect virtual machines, containers, etc.
  • It enables connect switches in different hosts using network tunnels (gre or vxlan).
  • It is possible to program the switch using OpenFlow.

In order to start working with OVS, this time…

I learned how to create a overlay network using Open vSwitch in order to connect LXC containers.

Well, first of all, I have to say that I have used LXC instead of VM because they are lightweight and very straightforward to use in a Ubuntu distribution. But if you understand this, you should be able to follow this how-to and use VMs instead of containers.

What we want (Our set-up)

What we are creating is shown in the next figure:

ovs

We have two nodes (ovsnodeXX) and several containers deployed on them (nodeXXcYY). The result is that all the containers can connect between them, but the traffic is not seen the LAN 10.10.2.x appart from the connection bewteen the OVS switches (because it is tunnelled).

Starting point

  • I have 2 ubuntu-based nodes (ovsnode01 and ovsnode02), with 1 network interface each, and they are able to ping one to each other:
root@ovsnode01:~# ping ovsnode01 -c 3
PING ovsnode01 (10.10.2.21) 56(84) bytes of data.
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=1 ttl=64 time=0.022 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from ovsnode01 (10.10.2.21): icmp_seq=3 ttl=64 time=0.033 ms

--- ovsnode01 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1998ms
rtt min/avg/max/mdev = 0.022/0.027/0.033/0.007 ms
root@ovsnode01:~# ping ovsnode02 -c 3
PING ovsnode02 (10.10.2.22) 56(84) bytes of data.
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=1 ttl=64 time=1.45 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=2 ttl=64 time=0.683 ms
64 bytes from ovsnode02 (10.10.2.22): icmp_seq=3 ttl=64 time=0.756 ms

--- ovsnode02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 0.683/0.963/1.451/0.347 ms
  • I am able to create lxc containers, by issuing commands such as:
lxc-create -n node1 -t ubuntu
  • I am able to create linux bridges (i.e. I have installed the bridge-utils package).

Spoiler (or quick setup)

If you just want the solution, here you have (later I will explain all the steps). On each node you can follow the next steps in ovsnode01:

ovsnode01:~# brctl addbr br-cont0
ovsnode01:~# ip link set dev br-cont0 up
ovsnode01:~# cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01:~# lxc-start -dn node01c01
ovsnode01:~# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01:~# lxc-start -dn node01c02
ovsnode01:~# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400
ovsnode01:~# apt-get install openvswitch-switch
ovsnode01:~# ovs-vsctl add-port ovsbr0 br-cont0
ovsnode01:~# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22

You will need to follow the same steps for ovsnode02, taking into account the names of the containers and the IP addresses.

Preparing the newtork

First we are going to create a bridge (br-cont0) to which the containers are being bridged

We are not using lxcbr0 because it may have other services such as dnsmasq that we should disable before. Moreover, creating our bridge will be more interesting to understand what we are doing

We will issue this command on both ovsnode01 and ovsnode02 nodes.

brctl addbr br-cont0

As this bridge has no IP, it is down (you can see it using ip command). Now we are going to set it up (also in both nodes):

ip link set dev br-cont0 up

Creating the containers

Now we need a template to associate the network to the containers. So we have to create the file container-template.lxc:

cat > ./container-template.lxc << EOF
lxc.network.type = veth
lxc.network.link = br-cont0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:xx:xx:xx
EOF

In this file we are saying that containers should be automatically bridged to bridge br-cont0 (remember that we created it before) and they will hace an interface with a hardware address that will follow the template. We could also modify the file /etc/lxc/default.conf file instead of creating a new one.

Now, we can create the containers on node ovsnode01:

ovsnode01# lxc-create -f ./container-template.lxc -n node01c01 -t ubuntu
ovsnode01# lxc-start -dn node01c01
ovsnode01# lxc-create -f ./container-template.lxc -n node01c02 -t ubuntu
ovsnode01# lxc-start -dn node01c02

And also in ovsnode02:

ovsnode02# lxc-create-f ./container-template.lxc -n node02c01 -t ubuntu
ovsnode02# lxc-start -dn node02c01
ovsnode02# lxc-create-f ./container-template.lxc -n node02c02 -t ubuntu
ovsnode02# lxc-start -dn node02c02

If you followed my steps, the containers will not have any IP address. This is because you do not have any dhcp server and the containers do not have static IP addresses. And that is what we are going to do now.

We are setting the IP addresses 192.168.1.11 and 192.168.1.21 to node01c01 and node0201 respetively. In order to do so, we have to issue these two commands (each of them in the corresponding node):

ovsnode01# lxc-attach -n node01c01 -- ip addr add 192.168.1.11/24 dev eth0
ovsnode01# lxc-attach -n node01c02 -- ip addr add 192.168.1.12/24 dev eth0

You need to configure the containers in ovsnode02 too:

ovsnode02# lxc-attach -n node02c01 -- ip addr add 192.168.1.21/24 dev eth0
ovsnode02# lxc-attach -n node02c02 -- ip addr add 192.168.1.22/24 dev eth0

Making connections

At this point, you should be able to connect between the containers in the same node: between node01c01 and node01c02, and between node02c01 and node02c02. But not between containers in different nodes.

ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.11 -c3
PING 192.168.1.11 (192.168.1.11) 56(84) bytes of data.
64 bytes from 192.168.1.11: icmp_seq=1 ttl=64 time=0.136 ms
64 bytes from 192.168.1.11: icmp_seq=2 ttl=64 time=0.051 ms
64 bytes from 192.168.1.11: icmp_seq=3 ttl=64 time=0.055 ms
ovsnode01# lxc-attach -n node01c01 -- ping 192.168.1.21 -c3
PING 192.168.1.21 (192.168.1.21) 56(84) bytes of data.
From 192.168.1.11 icmp_seq=1 Destination Host Unreachable
From 192.168.1.11 icmp_seq=2 Destination Host Unreachable
From 192.168.1.11 icmp_seq=3 Destination Host Unreachable

This is because br-cont0 is somehow a classic “network hub”, in which all the traffic can be listened by all the devices in it. So the containers in the same bridge are in the same LAN (in the same cable, indeed). But there is no connection between the hubs, and we will make it using Open vSwitch.

In the case of Ubuntu, installing OVS is as easy as issuing the next command:

apt-get install openvswitch-switch

Simple, isn’t it? But now we have to prepare the network and we are going to create a virtual switch (on both ovsnode01 and ovsnode02):

ovs-vsctl add-br ovsbr0

Open vSwitch works like a physical switch, with ports that can be connected and so on… And we are going to connect our hub to our switch (i.e. our bridge to the virtual switch):

ovs-vsctl add-port ovsbr0 br-cont0

We’ll make it in both ovsnode01 and ovsnode02

Finally, we’ll connect the ovs switches between them, using a vxlan tunnel:

ovsnode01# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.22
ovsnode02# ovs-vsctl add-port ovsbr0 vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=10.10.2.21

We’ll make each of the two commands above on the corresponding node. Take care that the remote IP addresses are set to the other node 😉

We can check the final configuration of the nodes (let’s show only ovsnode01, but the other is very similar):

ovsnode01# brctl show
bridge name bridge id STP enabled interfaces
br-cont0 8000.fe502f26ea2d no veth3BUL7S
                              vethYLRPM2

ovsnode01# ovs-vsctl show
2096d83a-c7b9-47a8-8fff-d38c6d5ab04d
 Bridge "ovsbr0"
     Port "ovsbr0"
         Interface "ovsbr0"
             type: internal
     Port "vxlan0"
         Interface "vxlan0"
             type: vxlan
             options: {remote_ip="10.10.2.22"}
 ovs_version: "2.0.2"

Warning

Using this set up as is, you will get ping working, but probably no other traffic. This is because the traffic is encapsulated in a transport network. Did you know about MTU?

If we check the eth0 interface from one container we’ll get this:

# lxc-attach -n node01c01 -- ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:16:3e:52:42:2f 
 inet addr:192.168.1.11 Bcast:0.0.0.0 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Pay attention to the MTU value (it is 1500). And if we check the MTU of eth0 from the node, we’ll get this:

# ifconfig eth0
eth0 Link encap:Ethernet HWaddr 60:00:00:00:20:15 
 inet addr:10.10.2.21 Bcast:10.10.2.255 Mask:255.255.255.0
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 ...

Summarizing, MTU is the size of the message for ethernet, which usually is 1500. But we are sending messages into messages and if we try to use it as is, we are trying to send things that have a size of “1500 + some overhead” in a room of “1500” (we are conciously omiting the units). And “1500 + some overhead” is bigger than “1500” and that is why it will not work.

We have to change the MTU of the containers to a lower size. It is as simple as:

ovsnode01:~# lxc-attach -n node01c01 -- ifconfig eth0 mtu 1400
ovsnode01:~# lxc-attach -n node01c02 -- ifconfig eth0 mtu 1400

ovsnode02:~# lxc-attach -n node02c01 -- ifconfig eth0 mtu 1400
ovsnode02:~# lxc-attach -n node02c02 -- ifconfig eth0 mtu 1400

This method is not persistent, and will be lost in case of rebooting the container. In order to persist it, you can set it in the DHCP server (in case that you are using it), or in the network device set up. In the case of ubuntu it is as simple as adding a line with ‘mtu 1400’ to the proper device in /etc/network/interfaces. As an example for container node01c01:

auto eth0
iface eth0 inet static
address 192.168.1.11
netmask 255.255.255.0
mtu 1400

Some words on Virtual Machines

If you have some expertise on Virtual Machines and Linux (I suppose that if you are following this how-to, this is your case), you should be able to make all the set-up for your VMs. When you create your VM, you simply need to bridge the interfaces of the VM to the bridge that we have created (br-cont0) and that’s all.

And now, what?

Well, now we have what we wanted to have. And now you can play with it. I suggest to create a DHCP server (to not to have to set the MTU and the IP addresses), a router, a DNS server, etc.

And as an advanced option, you can play with traffic tagging, in order to create multiple overlay networks and isolating them from each other.

 

 

How to Recover Partitions from LVM Volume

Yesterday I had a problem with a disk… while trying to increase the size of a LVM volume, I lost the disk. What I did was: add the new disk to the LVM volume, mirror the volume and removing the original disk. But the problem is that I added the disk back again and the things started to go wrong. The volume did not boot, etc.

The original system was a Scientific Linux 6.7 and it had different partitions: one /boot ext-4 partition and a LVM volume in which we had 2 partitions: lv-root and lv-swap.

At the end of the LVM problem I had the original volume with LVM bad metadata that did not allow me to use the original information. Luckily I had not written any other information in the disks… so the information had to be there.

Once the disaster was there…

I learned how to recover the partitions from a LVM volume.

I had to recover the partitions, to create a new disk with the partitions.

Recovering partitions

After some failed tests, I got to the situation in which I had /dev/sda with a single GPT partition in there. I remembered about TestDisk, which is a tool that helps in forensics, and I started to play with it.

1

The first that I did was to try to figure out what could I do with my partition. So I started my system with a ubuntu 14.04 desktop LiveCD and /dev/sda, downloaded TestDisk and tried

$ ./testdisk_static /dev/sda

Then I used the GPT option and analysed the disk. In the disk I found two partitions: my boot partition and a LVM partition. I wrote the partition table and got to the initial page where I entered in the advanced options to dump the partition (Image Creation).

2

Then I had the boot partition dumped and it could be used as a raw partition dumped with dd.

Then I exited from the app and started it back again, because the LVM partition was now accesible in /dev/sda2. Now I tried

$ ./testdisk_static /dev/sda2

Now I selected the disk and selected the Intel partition option. TestDisk found the two partitions: the Linux and the Linux Swap.

3.png

And now I dumped the Linux partition.

Disclaimer

This is my specific organization for the disk, but the key is that TestDisk helped me to figure out where the partitions were and to get their raw images.

Creating the new disk

Now that I have the partition images: image.dd (for the boot partition) and sda1.dd, and now I have to create a new disk. So I booted the ubuntu desktop again, with a new disk (/dev/sdb).

The first thing is to get the size of the partitions and we will use the fdisk utility on the dumped files:

$ fdisk -l image.dd
$ fdisk -l sda1.dd

Using these commands I will get the size in sectors of the disk. And using that size I can make the partitions in /dev/sdb. According to my case, I will create a partition for the boot and other for the filesystem. My options were the next (please pay attention to the images to check that the values of the sectors match between the size of the partitions and the size of the NEW partitions and so on).

$ fdisk /dev/sdb
n
<enter>
<enter>
<enter>
+1024000
n
<enter>
<enter>
<enter>
+15810560
w

3.png

The key is to pay attention to the size (in sectors) of the partitions obtained with the fdisk -l command issued before. If you have any doubt, please check the images.

And now you are ready to dump the images in the disk:

$ dd if=image.dd of=/dev/sdb1
...
$ dd if=sda1.dd of=/dev/sdb2
...

Check this cool tip!

The process of dd costs a lot. If you want to see the progress, you can open other commandline and issue the command

$ killall -USR1 dd

The dd command will output the size that it has dumped in its console. If you want to see the progress, you can issue a command like this one:

$ watch -n 10 killall -USR1 dd

This will make that dd outputs the size dumped every 10 seconds.

More on this

Once I had the partition dumped, I used gparted to resize the second partition (as it was almost full). My disk was far bigger than the original, but if you only want to get the data or you have free space, this won’t probably be useful for you (so I am skipping it).

 

How to compact a QCOW2 or a VMDK file

When you create a Virtual Machine (VM), you usually have the option of use a format that reserves the whole size of the disk (e.g. RAW), or to use a format that grows according to the used space in the disk (e.g. QCOW or VMDK).

The problem is that the space actually used in the disk grows as the disk files are written, but it is not decreased as they are deleted. But if you writed a lot of files and you deleted after they were needed, you’d probably have a lot of space reserved for the VMDK file, while that space is not actually used. I wanted to reclaim that space, to move the VMs using less space, and so this time…

I learned how to compact a VMDK file (the same method applies to QCOW2)

The method is, in fact, very easy… you simply have to re-encode the file using the same output format. If you have your original-disk.vmdk file, you simply have to issue a command like this one:

$ qemu-img convert -O vmdk original-disk.vmdk compressed-disk.vmdk

And that will make the magic (easy, isn’t it?).

But if you want to compact it more, you can claim more space from the disk before re-enconding the disk. First, I’d go to the solution and then I’ll explain it:

If the VM is a linux-based, you can boot it and create a zero-file, and once the file has exhausted the disk, delete it:

$ dd if=/dev/zero of=/tmp/zerofile.raw...$ rm /tmp/zerofile.raw

If the VM is a Windows-based, you can get the command sdelete from Microsoft website decompress it and execute the following commandline:

c:\> sdelete -z c:

Now you can power off the VM and issue the qemu-img command. You’ll get a file that correspond to only the used space in the disk:

$ qemu-img convert -O vmdk original-disk.vmdk compressed-disk.vmdk

Explanation

(Disclaimer: Please take into account that this is a simple and conceptual explanation)

If you knew about how the disks are managed, you’d probably know that when a file is deleted, it is not actually deleted from the disk. Instead, the space that it was using is marked as “ready to be used in case that it is needed”. So if a new file is created in the disk, it is possible that it uses that physical space (or not).

That is the trick from which the file recovery applications work: trying to find those “ready to be used” sectors. And that is why the “low-level format” exists: in order to “zero” the disk and to avoid that files are recovered.

When you created the /tmp/zerofile.raw file, you started to write zeros in the disk. When the physical empty space was exhausted, the disk controller started to use the “ready to be used” sectors, and the zerofile wrote zeros on them, and the zeros were written in the VMDK file.

The good thing here is that when a VMDK file is created (from any format… in our case, it is VMDK), the qemu-img application does not write those zeros in the file that contains the disk, and that is how the storage space is reclaimed.

How to set the hostname from DHCP in Ubuntu 14.04

I have a lot of virtual servers, and I like preparing a “golden image” and instantiate it many times. One of the steps is to set the hostname for the host, from my private DHCP server. It usually works, but sometimes it fails and I didn’t know why. So I got tired of such indetermination and this time…

I learned how to set the hostname from DHCP in Ubuntu 14.04

Well, so I have my dnsmasq server that acts both as a dns server and as a dhcp server. I investigated how a host gets its hostname from DHCP and it seems that it occurs when the DHCP server sends it by default (option hostname).

I debugged the DHCP messages using this command from other server

# tcpdump -i eth1 -vvv port 67 or port 68

And I saw that my dhcp server was properly sending the hostname (dnsmasq sends it by default), and I realized that the problem was in the dhclient script. I googled a lot and I found some clever scripts that got the name from a nameserver and were started as hooks from the dhcpclient script. But if the dhcp protocol sets the hostname, why do I have to create other script to set the hostname???.

Finally I got this blog entry and I realized that it was a bug in the dhclient script: if there exists an old dhcp entry in /var/lib/dhcp/dhclient.eth0.leases (or eth1, etc.), it does not set the hostname.

At this point you have two options:

  • The easy: in /etc/rc.local include a line that removes that file
  • The clever (suggested in the blog entry): include a hook script that unsets the hostname
echo unset old_host_name >/etc/dhcp/dhclient-enter-hooks.d/unset_old_hostname

How to deal with parameters in bash scripts like a pro

I use to develop bash scripts, and I usually have a problem with flags and parameters. I like to allow parameters like a pro: using the long flags (e.g. –flag), the reduced flags (e.g. -f), but I want to allow combinations of several flags (e.g. -fc). And so this time…

I learned how to deal with parameters in bash scripts like a pro

This time I have started to use bash arrays, that are like C arrays or python arrays, but in bash 😉 I could explain little by little my script, but I’m including here the final script (this is an extract from one of my developments: ec4docker):

CREATE=
TERMINATE=
CONFIG_FILE=
n=0
while [ $# -gt 0 ]; do
    if [ "${1:0:1}" == "-" -a "${1:1:1}" != "-" ]; then
        for f in $(echo "${1:1}" | sed 's/\(.\)/-\1 /g' ); do
            ARR[$n]="$f"
            n=$(($n+1))
        done
    else
        ARR[$n]="$1"
        n=$(($n+1))
    fi
    shift
done
n=0
while [ $n -lt ${#ARR[@]} ]; do
    case "${ARR[$n]}" in
        --create | -c)          CREATE=True;;
        --terminate | -t)       TERMINATE=True;;
        --yes | -y)             ASSUME_YES=True;;
        --config-file | -f)     n=$(($n+1))
                                [ $n -ge ${#ARR[@]} ] && usage && exit 1
                                CONFIG_FILE="${ARR[$n]}";;
        --help | -h)            usage && exit 0;;
        *)                      usage && exit 1;;
    esac
    n=$(($n+1))
done

In this way, you allow to issue commands like

$ ./myapp -ctyf config.conf

But also mix parameter styles

$ ./myapp --create -ty --config-file myapp.conf

Technical details

I like the solution, but I also like the technical details (because I am a code-freak). So I share some technical issues here:

  • The first “while” simply parses the commandline to expand the combined parameters. In fact, if searches for expressions like ‘-fct’ and splits them into a set of expressions ‘-f’, ‘-c’, ‘-t’. So if you do not want to split parameters in this way, you can substitute the first “while” by
ARR=( "$@" )
  • The second “while” is needed because we want to allow parameters that need more than one flag (e.g. -f <config file>). Any time that it is expected to have a parameter for a flag, we need to check if we have enough parameters and if not, raise an error. If you do not need any parameter with extra values, you could substitute the second while by:
for ARRVAL in "${ARR[@]}"; do
  case "$ARRVAL" in

How to create a simple Docker Swarm cluster

I have an old computer cluster, and the nodes have not any virtualization extensions. So I’m trying to use it to run Docker containers. But I do not want to choose in which of the internal nodes I have to run the containers. So I am using Docker Swarm, and I will use it as a single Docker host, by calling the main node to execute the containers and the swarm will decide the host in which the container will be ran. So this time…

I learned how to create a simple Docker Swarm cluster with a single front-end and multiple internal nodes

The official documentation of Docker includes this post that describes how to do it, but whereas it is very easy, I prefer to describe my specific use case.

Scenario

  • 1 Master node with the public IP 111.22.33.44 and the private IP 10.100.0.1.
  • 3 Nodes with the private IPs 10.100.0.2, 10.100.0.3 and 10.100.0.4

I want to call the master node to create a container from other computer (e.g. 111.22.33.55), and leave the master to choose in which internal node is hosted the container.

Preparing the master node

First of all, I will install Docker

$ curl -sSL https://get.docker.com/ | sh

Now it is needed to install consul that is a backend for key-value storage. It will run as a container in the front-end (and it will be used by the internal nodes to synchronize with the master)

$ docker run -d -p 8500:8500 --name=consul progrium/consul -server -bootstrap

Finally I will launch the swarm master

$ docker run -d -p 4000:4000 swarm manage -H :4000 --advertise 10.100.0.1:4000 consul://10.100.0.1:8500

(*) remember that consul is installed in the front-end, but you could detach it and install in another node if you want (need) to.

Installing the internal nodes

Again, we should install Docker and export docker through the IP

$ curl -sSL https://get.docker.com/ | sh

And once it is running, it is needed to expose the docker API through the IP address of the node. The easy way to test it is to launch the daemon using the following option:

$ docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock

Now you should be able to issue command line options such as

$ docker -H :2375 info

or even from other hosts

$ docker -H 10.100.0.2:2375 info

The underlying aim is that with swarm you are able to expose the local docker daemon to be used remotely in the swarm.

To make the changes persistent, you should set the parameters in the docker configuration file /etc/default/docker:

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"

It seems that docker version 1.11 has a bug and does not properly use that file (at least in ubuntu 16.04). So you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock -H fd://

Finally now we have to launch the swarm on each node

  • On node 10.100.0.2
docker run --restart=always -d swarm join --advertise=10.100.0.2:2375 consul://10.100.0.1:8500
  • On node 10.100.0.3
docker run --restart=always -d swarm join --advertise=10.100.0.3:2375 consul://10.100.0.1:8500
  • On node 10.100.0.4
docker run --restart=always -d swarm join --advertise=10.100.0.4:2375 consul://10.100.0.1:8500Next steps: communicating containers between them

Next steps: communicating the containers

If you launch new containers as usual (i.e. docker run -it containerimage bash), you will get containers with overlapping IPs. This is because you are using the default network scheme in the individual docker servers.

If you want to have a common network, you need to create an overlay network that spans across the different docker daemons.

But in order to be able to make it, you need to change the way that the docker daemons are being started. You need a system to coordinate the network, and it can be the same consul that we are using.

So you have to append the next flags to the command line that starts docker:

 --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500

You can add the parameters to the docker configuration file /etc/default/docker. In the case of the internal nodes, the result will be the next (according to our previous modifications):

DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500"

As stated before, docker version 1.11 has a bug and does not properly use that file. In the meanwhile you can modify the file /lib/systemd/system/docker.service and set new commandline to launch the docker daemon.

ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock --cluster-advertise eth1:2376 --cluster-store consul://10.100.0.1:8500

(*) We are using eth1 because it is the device in which our internal IP address is. You should use the device to which the 10.100.0.x address is assigned.

Now you must restart the docker daemons of ALL the nodes in the swarm.

Once they have been restarted, you can create a new network for the swarm:

$ docker -H 10.100.0.1:4000 network create swarm-network

And then you can use it for the creation of the containers:

$ docker -H 10.100.0.1:4000 run -it --net=swarm-network ubuntu:latest bash

Now the IPs will be given in a coordinated way, and the containers will have several IPs (the IP in the swarm and its IP in the local docker server).

Some more words on this

This post is made in May/2016. Both docker and swarm are evolving and maybe this post is outdated soon.

Some things that bother me on this installation…

  • While using the overlay network, if you expose one port using the flag -p, the port is exposed in the IP from the internal docker host. I think that you should be able to express in which IP you want to expose the port or use the IP from the main server.
    • I solve this issue by using a development made by me IPFloater: Once I create the container, I get the internal IP in which the port is exposed and I create a redirection in IPFloater, to be able to access the container through a specific IP.
  • Consul fails A LOT. If I leave the swarm running for hours (i.e. 8 hours) consul will probably fail. If I run a command like this: “docker run –rm=true swarm list consul://10.100.0.1:8500”, it states that it has a fail. Then I have to delete the container and create a new one.