How to move from a linear disk to an LVM disk and join the two disks into an LVM-like RAID-0

I had the recent need for adding a disk to an existing installation of Ubuntu, to make the / folder bigger. In such a case, I have two possibilities: to move my whole system to a new bigger disk (and e.g. dispose of the original disk) or to convert my disk to an LVM volume and add a second disk to enable the volume to grow. The first case was the subject of a previous post, but this time I learned…

How to move from a linear disk to an LVM disk and join the two disks into an LVM-like RAID-0

The starting point is simple:

  • I have one 14 Gb. disk (/dev/vda) with a single partition that is mounted in / (The disk has a GPT table and UEFI format and so it has extra partitions that we’ll keep as they are).
  • I have an 80 Gb. brand new disk (/dev/vdb)
  • I want to have one 94 Gb. volume built from the two disks
root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part /
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk /mnt
vdc 252:32 0 4G 0 disk [SWAP]

The steps are the next:

  1. Creating a boot partition in /dev/vdb (this is needed because Grub cannot boot from LVM and needs an ext or VFAT partition)
  2. Format the boot partition and put the content of the current /boot folder
  3. Create an LVM volume using the extra space in /dev/vdb and initialize it using an ext4 filesystem
  4. Put the contents of the current / folder into the new partition
  5. Update grub to boot from the new disk
  6. Update the mount point for our system
  7. Reboot (and check)
  8. Add the previous disk to the LVM volume.

Let’s start…

Separate the /boot partition

When installing an LVM system, it is needed to have a /boot partition in a common format (e.g. ext2 or ext4), because GRUB cannot read from LVM. Then GRUB reads the contents of that partition and starts the proper modules to read the LVM volumes.

So we need to create the /boot partition. In our case, we are using ext2 format, because has no digest (we do not need it for the content of /boot) and it is faster. We are using 1 Gb. for the /boot partition, but 512 Mb. will probably be enough:

root@somove:~# fdisk /dev/vdb

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (1-4, default 1):
First sector (2048-167772159, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-167772159, default 167772159): +1G

Created a new partition 1 of type 'Linux' and of size 1 GiB.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

root@somove:~# mkfs.ext2 /dev/vdb1
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 262144 4k blocks and 65536 inodes
Filesystem UUID: 24618637-d2d4-45fe-bf83-d69d37f769d0
Superblock backups stored on blocks:
32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

Now we’ll make a mount point for this partition, mount the partition and copy the contents of the current /boot folder to that partition:

root@somove:~# mkdir /mnt/boot
root@somove:~# mount /dev/vdb1 /mnt/boot/
root@somove:~# cp -ax /boot/* /mnt/boot/

Create an LVM volume in the extra space of /dev/vdb

First, we will create a new partition for our LVM system, and we’ll get the whole free space:

root@somove:~# fdisk /dev/vdb

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): n
Partition type
p primary (1 primary, 0 extended, 3 free)
e extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (2-4, default 2):
First sector (2099200-167772159, default 2099200):
Last sector, +sectors or +size{K,M,G,T,P} (2099200-167772159, default 167772159):

Created a new partition 2 of type 'Linux' and of size 79 GiB.

Command (m for help): w
The partition table has been altered.
Syncing disks.

Now we will create a Physical Volume, a Volume Group and the Logical Volume for our root filesystem, using the new partition:

root@somove:~# pvcreate /dev/vdb2
Physical volume "/dev/vdb2" successfully created.
root@somove:~# vgcreate rootvg /dev/vdb2
Volume group "rootvg" successfully created
root@somove:~# lvcreate -l +100%free -n rootfs rootvg
Logical volume "rootfs" created.

If you want to learn about LVM to better understand what we are doing, you can read my previous post.

Now we are initializing the filesystem of the new /dev/rootvg/rootfs volume using ext4, and then we’ll copy the existing filesystem except from the special folders and the /boot folder (which we have separated in the other partition):

root@somove:~# mkfs.ext4 /dev/rootvg/rootfs
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 20708352 4k blocks and 5177344 inodes
Filesystem UUID: 47b4b698-4b63-4933-98d9-f8904ad36b2e
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000

Allocating group tables: done
Writing inode tables: done
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done

root@somove:~# mkdir /mnt/rootfs
root@somove:~# mount /dev/rootvg/rootfs /mnt/rootfs/
root@somove:~# rsync -aHAXx --delete --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/boot/*,/lost+found} / /mnt/rootfs/

Update the system to boot from the new /boot partition and the LVM volume

At this point we have our /boot partition (/dev/vdb1) and the / filesystem (/dev/rootvg/rootfs). Now we need to prepare GRUB to boot using these new resources. And here comes the magic…

root@somove:~# mount --bind /dev /mnt/rootfs/dev/
root@somove:~# mount --bind /sys /mnt/rootfs/sys/
root@somove:~# mount -t proc /proc /mnt/rootfs/proc/
root@somove:~# chroot /mnt/rootfs/

We are binding the special mount points /dev and /sys to the same folders in the new filesystem which is mounted in /mnt/rootfs. We are also creating the /proc mount point which holds the information about the processes. You can find some more information about why this is needed in my previous post on chroot and containers.

Intuitively, we are somehow “in the new filesystem” and now we can update things as if we had already booted into it.

At this point, we need to update the mount point in /etc/fstab to mount the proper disks once the system boots. So we are getting the UUIDs for our partitions:

root@somove:/# blkid
/dev/vda1: LABEL="cloudimg-rootfs" UUID="135ecb53-0b91-4a6d-8068-899705b8e046" TYPE="ext4" PARTUUID="b27490c5-04b3-4475-a92b-53807f0e1431"
/dev/vda14: PARTUUID="14ad2c62-0a5e-4026-a37f-0e958da56fd1"
/dev/vda15: LABEL="UEFI" UUID="BF99-DB4C" TYPE="vfat" PARTUUID="9c37d9c9-69de-4613-9966-609073fba1d3"
/dev/vdb1: UUID="24618637-d2d4-45fe-bf83-d69d37f769d0" TYPE="ext2"
/dev/vdb2: UUID="Uzt1px-ANds-tXYj-Xwyp-gLYj-SDU3-pRz3ed" TYPE="LVM2_member"
/dev/mapper/rootvg-rootfs: UUID="47b4b698-4b63-4933-98d9-f8904ad36b2e" TYPE="ext4"
/dev/vdc: UUID="3377ec47-a0c9-4544-b01b-7267ea48577d" TYPE="swap"

And we are updating /etc/fstab to mount /dev/mapper/rootvg-rootfs as the / folder. But we need to mount partition /dev/vdb1 in /boot. Using our example, the /etc/fstab file will be this one:

UUID="47b4b698-4b63-4933-98d9-f8904ad36b2e" / ext4 defaults 0 0
UUID="24618637-d2d4-45fe-bf83-d69d37f769d0" /boot ext2 defaults 0 0
LABEL=UEFI /boot/efi vfat defaults 0 0
UUID="3377ec47-a0c9-4544-b01b-7267ea48577d" none swap sw,comment=cloudconfig 0 0

We are using the UUID to mount / and /boot folders because the devices may change their names or location and that may lead to breaking our system.

And now we are ready to mount our /boot partition, update grub, and to install it in the /dev/vda disk (because we are keeping both disks).

root@somove:/# mount /boot
root@somove:/# update-grub
Generating grub configuration file ...
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Found linux image: /boot/vmlinuz-4.15.0-43-generic
Found initrd image: /boot/initrd.img-4.15.0-43-generic
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Found Ubuntu 18.04.1 LTS (18.04) on /dev/vda1
done
root@somove:/# grub-install /dev/vda
Installing for i386-pc platform.
Installation finished. No error reported.

Reboot and check

We are almost done, and now we are exiting the chroot and rebooting

root@somove:/# exit
root@somove:~# reboot

And the result should be the next one:

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk
├─vdb1 252:17 0 1G 0 part /boot
└─vdb2 252:18 0 79G 0 part
└─rootvg-rootfs 253:0 0 79G 0 lvm /
vdc 252:32 0 4G 0 disk [SWAP]

root@somove:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 676K 394M 1% /run
/dev/mapper/rootvg-rootfs 78G 993M 73G 2% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/vdb1 1008M 43M 915M 5% /boot
/dev/vda15 105M 3.6M 101M 4% /boot/efi
tmpfs 395M 0 395M 0% /run/user/1000

We have our / system mounted from the new LVM Logical Volume /dev/rootvg/rootfs, the /boot partition from /dev/vdb1, and the /boot/efi from the existing partition (just in case that we need it).

Add the previous disk to the LVM volume

Here we are facing the easier part, which is to integrate the original /dev/vda1 volume in the LVM volume.

Once we have double-checked that we had copied every file from the original / folder in /dev/vda1, we can initialize it for using it in LVM:

WARNING: This step wipes the content of /dev/vda1.

root@somove:~# pvcreate /dev/vda1
WARNING: ext4 signature detected on /dev/vda1 at offset 1080. Wipe it? [y/n]: y
Wiping ext4 signature on /dev/vda1.
Physical volume "/dev/vda1" successfully created.

Finally, we can integrate the new partition in our volume group and extend the logical volume to use the free space:

root@somove:~# vgextend rootvg /dev/vda1
Volume group "rootvg" successfully extended
root@somove:~# lvextend -l +100%free /dev/rootvg/rootfs
Size of logical volume rootvg/rootfs changed from <79.00 GiB (20223 extents) to 92.88 GiB (23778 extents).
Logical volume rootvg/rootfs successfully resized.
root@somove:~# resize2fs /dev/rootvg/rootfs
resize2fs 1.44.1 (24-Mar-2018)
Filesystem at /dev/rootvg/rootfs is mounted on /; on-line resizing required
old_desc_blocks = 10, new_desc_blocks = 12
The filesystem on /dev/rootvg/rootfs is now 24348672 (4k) blocks long.

And now we have the new 94 Gb. / folder which is made from /dev/vda1 and /dev/vdb2:

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part
│ └─rootvg-rootfs 253:0 0 92.9G 0 lvm /
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk
├─vdb1 252:17 0 1G 0 part /boot
└─vdb2 252:18 0 79G 0 part
└─rootvg-rootfs 253:0 0 92.9G 0 lvm /
vdc 252:32 0 4G 0 disk [SWAP]
root@somove:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 676K 394M 1% /run
/dev/mapper/rootvg-rootfs 91G 997M 86G 2% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/vdb1 1008M 43M 915M 5% /boot
/dev/vda15 105M 3.6M 101M 4% /boot/efi
tmpfs 395M 0 395M 0% /run/user/1000

(optional) Having the /boot partition to /dev/vda

In case we wanted to have the /boot partition in /dev/vda, the procedure will be a bit different:

  1. Instead of creating the LVM volume in /dev/vdb1, I would prefer to create a single partition /dev/vdb1 (ext4) which does not imply the separation of /boot and /.
  2. Once created /dev/vdb1, copy the filesystem in /dev/vda1 to /dev/vdb1 and prepare to boot from /dev/vdb1 (chroot, adjust mount points, update-grub, grub-install…).
  3. Boot from the new partition and wipe the original /dev/vda1 partition.
  4. Create a partition /dev/vda1 for the new /boot and initialize it using ext2, copy the contents of /boot according to the instructions in this post.
  5. Create a partition /dev/vda2, create the LVM volume, initialize it and copy the contents of /dev/vdb1 except from /boot
  6. Prepare to boot from /dev/vda (chroot, adjust mount points, mount /boot, update-grub, grub-install…)
  7. Boot from the new /root+LVM / and decide whether you want to add /dev/vdb to the LVM volume or not.

Using this procedure, you will get from linear to LVM with a single disk. And then you can decide whether to make the LVM to grow or not. Moreover you may decide whether to create a LVM-Raid(1,5,…) with the new or other disks.

How to use LVM

LVM stands for Logical Volume Manager, and it provides logical volume management for Linux kernels. It enables us to manage multiple physical disks from a single manager and to create logical volumes that take profit from having multiple disks (e.g. RAID, thin provisioning, volumes that span across disks, etc.).

I needed using LVM multiple times, and in special it is of interest when dealing with LVM backed cinder in OpenStack.

So this time I learned…

How to use LVM

LVM is installed in multiple Linux distros and they are usually LVM-aware, to be able to boot from LVM volumes.

For the purpose of this post, we’ll consider LVM as a mechanism to manage multiple physical disks and to create logical volumes on them. Then LVM will show the operating system the logical volumes as if they were disks.

Testlab

LVM is intended for physical disks (e.g. /dev/sda, /dev/sdb, etc.). But we are creating a test lab to avoid the need of buying physical disks.

We are creating 4 fake disks of size 256Mb each. To create each of them we simply create a file of the proper size (that will store the data), and then we attach that file to a loop device:

root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.0 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.1 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.2 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.3 bs=1M count=256
...
root@s1:/tmp# losetup /dev/loop0 ./fake-disk-256.0
root@s1:/tmp# losetup /dev/loop1 ./fake-disk-256.1
root@s1:/tmp# losetup /dev/loop2 ./fake-disk-256.2
root@s1:/tmp# losetup /dev/loop3 ./fake-disk-256.3

And now you have 4 working disks for our tests:

root@s1:/tmp# fdisk -l
Disk /dev/loop0: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop1: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop2: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop3: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/sda: (...)

For the system, these devices can be used as regular disks (e.g. format them, mount, etc.):

root@s1:/tmp# mkfs.ext4 /dev/loop0
mke2fs 1.44.1 (24-Mar-2018)
...
Writing superblocks and filesystem accounting information: done

root@s1:/tmp# mkdir -p /tmp/mnt/disk0
root@s1:/tmp# mount /dev/loop0 /tmp/mnt/disk0/
root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# touch this-is-a-file
root@s1:/tmp/mnt/disk0# ls -l
total 16
drwx------ 2 root root 16384 Apr 16 12:35 lost+found
-rw-r--r-- 1 root root 0 Apr 16 12:38 this-is-a-file
root@s1:/tmp/mnt/disk0# cd /tmp/
root@s1:/tmp# umount /tmp/mnt/disk0

Concepts of LVM

LVM has simple actors:

  • Physical volume: which is a physical disk.
  • Volume group: which is a set of physical disks managed together.
  • Logical volume: which is a block device.

Logical Volumes (LV) are stored in Volume Groups (VG), which are backed by Physical Volumes (PV).

PVs are managed using pv* commands (e.g. pvscan, pvs, pvcreate, etc.). VGs are managed using vg* commands (e.g. vgs, vgdisplay, vgextend, etc.). LGs are managed using lg* commands (e.g. lvdisplay, lvs, lvextend, etc.).

Simple workflow with LVM

To have an LVM system, we have to first initialize a physical volume. That is somehow “initializing a disk in LVM format”, and that wipes the content of the disk:

root@s1:/tmp# pvcreate /dev/loop0
WARNING: ext4 signature detected on /dev/loop0 at offset 1080. Wipe it? [y/n]: y
Wiping ext4 signature on /dev/loop0.
Physical volume "/dev/loop0" successfully created.

Now we have to create a volume group (we’ll call it test-vg):

root@s1:/tmp# vgcreate test-vg /dev/loop0
Volume group "test-vg" successfully created

And finally, we can create a logical volume

root@s1:/tmp# lvcreate -l 100%vg --name test-vol test-vg
Logical volume "test-vol" created.

And now we have a simple LVM system that is built from one single physical disk (/dev/loop0) that contains one single volume group (test-vg) that holds a single logical volume (test-vol).

Examining things in LVM

  • The commands to examine PVs: pvs and pvdisplay. Each of them offers different information. pvscan also exists, but it is not needed in current versions of LVM.
  • The commands to examine VGs: vgs and vgdisplay. Each of them offers different information. vgscan also exists, but it is not needed in current versions of LVM.
  • The commands to examine PVs: lvs and lvdisplay. Each of them offers different information. lvscan also exists, but it is not needed in current versions of LVM.

Each command has several options, which we are not exploring here. We are just using the commands and we’ll present some options in the next examples.

At this time we should have a PV, a VG, and LV, and we can see them by using pvs, vgs and lvs:

root@s1:/tmp# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 0
root@s1:/tmp# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 1 1 0 wz--n- 252.00m 0
root@s1:/tmp# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
test-vol test-vg -wi-a----- 252.00m

Now we can use the test-vol as if it was a partition:

root@s1:/tmp# mkfs.ext4 /dev/mapper/test--vg-test--vol
mke2fs 1.44.1 (24-Mar-2018)
...
Writing superblocks and filesystem accounting information: done

root@s1:/tmp# mount /dev/mapper/test--vg-test--vol /tmp/mnt/disk0/
root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# touch this-is-my-file
root@s1:/tmp/mnt/disk0# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 241M 2.1M 222M 1% /tmp/mnt/disk0

Adding another disk to grow the filesystem

Imagine that we have filled our 241Mb volume in our 256Mb disk (/dev/loop0), and we need some more storage space. We could buy an extra disk (i.e. /dev/loop1) and add it to the LVM (using command vgextend).

 

root@s1:/tmp# pvcreate /dev/loop1
Physical volume "/dev/loop1" successfully created.
root@s1:/tmp# vgextend test-vg /dev/loop1
Volume group "test-vg" successfully extended

And now we have two physical volumes added to a single volume group. The VG is of size 504Mb and there are 252Mb free.

root@s1:/tmp# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 0
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m
root@s1:/tmp# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 2 1 0 wz--n- 504.00m 252.00m

We could think as if the VG was somehow a disk and the LV are the partitions. So we can make grow the LV in the VG and then grow the filesystem:

root@s1:/tmp/mnt/disk0# lvscan
ACTIVE '/dev/test-vg/test-vol' [252.00 MiB] inherit

root@s1:/tmp/mnt/disk0# lvextend -l +100%free /dev/test-vg/test-vol
Size of logical volume test-vg/test-vol changed from 252.00 MiB (63 extents) to 504.00 MiB (126 extents).
Logical volume test-vg/test-vol successfully resized.

root@s1:/tmp/mnt/disk0# resize2fs /dev/test-vg/test-vol
resize2fs 1.44.1 (24-Mar-2018)
Filesystem at /dev/test-vg/test-vol is mounted on /tmp/mnt/disk0; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 4
The filesystem on /dev/test-vg/test-vol is now 516096 (1k) blocks long.

root@s1:/tmp/mnt/disk0# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 485M 2.3M 456M 1% /tmp/mnt/disk0

root@s1:/tmp/mnt/disk0# ls -l
total 12
drwx------ 2 root root 12288 Apr 22 16:46 lost+found
-rw-r--r-- 1 root root 0 Apr 22 16:46 this-is-my-file

Now we have the new LV with double size.

Downsize the LV

Now we have obtained some free space and we want to keep only 1 disk (e.g. /dev/loop0). So we can downsize the filesystem (e.g. to 200Mb), and then downsize the LV.

This method needs unmounting the filesystem. So if you want to resize the root partition, you would need to use a live system or pivoting root to an unused filesystem as described [here](https://unix.stackexchange.com/a/227318).

First, unmount the filesystem and check it:

root@s1:/tmp# umount /tmp/mnt/disk0
root@s1:/tmp# e2fsck -ff /dev/mapper/test--vg-test--vol
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/test--vg-test--vol: 12/127008 files (0.0% non-contiguous), 22444/516096 blocks

Then change the size of the filesystem to the desired size:

root@s1:/tmp# resize2fs /dev/test-vg/test-vol 200M
resize2fs 1.44.1 (24-Mar-2018)
Resizing the filesystem on /dev/test-vg/test-vol to 204800 (1k) blocks.
The filesystem on /dev/test-vg/test-vol is now 204800 (1k) blocks long.

And now, we’ll reduce the logical volume to the new size and re-check the filesystem:

root@s1:/tmp# lvreduce -L 200M /dev/test-vg/test-vol
WARNING: Reducing active logical volume to 200.00 MiB.
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce test-vg/test-vol? [y/n]: y
Size of logical volume test-vg/test-vol changed from 504.00 MiB (126 extents) to 200.00 MiB (50 extents).
Logical volume test-vg/test-vol successfully resized.
root@s1:/tmp# lvdisplay
--- Logical volume ---
LV Path /dev/test-vg/test-vol
LV Name test-vol
VG Name test-vg
LV UUID xGh4cd-R93l-UpAL-LGTV-qnxq-vvx2-obSubY
LV Write Access read/write
LV Creation host, time s1, 2020-04-22 16:26:48 +0200
LV Status available
# open 0
LV Size 200.00 MiB
Current LE 50
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0

root@s1:/tmp# resize2fs /dev/test-vg/test-vol
resize2fs 1.44.1 (24-Mar-2018)
The filesystem is already 204800 (1k) blocks long. Nothing to do!

And now we are ready to use the disk with the new size:

root@s1:/tmp# mount /dev/test-vg/test-vol /tmp/mnt/disk0/
root@s1:/tmp# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 190M 1.6M 176M 1% /tmp/mnt/disk0root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# ls -l
total 12
drwx------ 2 root root 12288 Apr 22 16:46 lost+found
-rw-r--r-- 1 root root 0 Apr 22 16:46 this-is-my-file

Removing a PV

Now we want to remove /dev/loop0 (which was our original disks) and keep the replacement (/dev/loop1).

We just need to free /dev/loop0 from VGs and remove it from the VG, for being able to safely remove it from the system. First, we check the PVs:

root@s1:/tmp/mnt/disk0# pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/loop0 test-vg lvm2 a-- 252.00m 52.00m 200.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m 0

We can see that /dev/loop0 is used, so we need to move its data to another PV:

root@s1:/tmp/mnt/disk0# pvmove /dev/loop0
/dev/loop0: Moved: 100.00%
root@s1:/tmp/mnt/disk0# pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m 0
/dev/loop1 test-vg lvm2 a-- 252.00m 52.00m 200.00m

Now /dev/loop0 is 100% free and we can remove it from the VG:

root@s1:/tmp/mnt/disk0# vgreduce test-vg /dev/loop0
Removed "/dev/loop0" from volume group "test-vg"
root@s1:/tmp/mnt/disk0# pvremove /dev/loop0
Labels on physical volume "/dev/loop0" successfully wiped.

Thin provisioning with LVM

Thin provisioning consists of providing the user with the illusion of having an amount of storage space, but only storing the amount of space that is actually used. It is similar to qcow2 or vmdk disk formats for virtual machines. The idea is that, if you have 1 Gb of storage space used, the backend will only store these data, even if the volume is 10Gb. The total amount of storage space will be used as it is requested.

First, you need to reserve an effective storage space in the form of a “thin pool”. The next example reserves 200M (-L 200M) as a thin storage (-T) as the thin pool thinpool in VG test-vg:

root@s1:/tmp/mnt# lvcreate -L 200M -T test-vg/thinpool
Using default stripesize 64.00 KiB.
Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data.
Logical volume "thinpool" created.

The result is that we’ll have a volume of size 200M, which is like a volume but is marked as a thin pool.

Now we can create two thin-provisioned volumes of the same size:

root@s1:/tmp/mnt# lvcreate -V 200M -T test-vg/thinpool -n thin-vol-1
Using default stripesize 64.00 KiB.
Logical volume "thin-vol-1" created.
root@s1:/tmp/mnt# lvcreate -V 200M -T test-vg/thinpool -n thin-vol-2
Using default stripesize 64.00 KiB.
WARNING: Sum of all thin volume sizes (400.00 MiB) exceeds the size of thin pool test-vg/thinpool and the amount of free space in volume group (296.00 MiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "thin-vol-2" created.
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 0.00
thin-vol-2 200.00m 0.00
thinpool 200.00m 0.00 2

The result is that we have 2 volumes of 200Mb each, while actually having only 200Mb. But each of the volumes is empty. Now as we use the volumes, the space will be occupied:

root@s1:/tmp/mnt# mkfs.ext4 /dev/test-vg/thin-vol-1
(...)
Writing superblocks and filesystem accounting information: done

root@s1:/tmp/mnt# mount /dev/test-vg/thin-vol-1 /tmp/mnt/disk0/
root@s1:/tmp/mnt# dd if=/dev/random of=/tmp/mnt/disk0/randfile bs=1K count=1024
dd: warning: partial read (94 bytes); suggest iflag=fullblock
0+1024 records in
0+1024 records out
48623 bytes (49 kB, 47 KiB) copied, 0.0701102 s, 694 kB/s
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 5.56
thin-vol-2 200.00m 0.00
thinpool 200.00m 5.56 2
root@s1:/tmp/mnt# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-thin--vol--1 190M 1.6M 175M 1% /tmp/mnt/disk0

In this example, we have used 49kb, which means 5.56%. If we repeat the process for the other volume:

root@s1:/tmp/mnt# mkfs.ext4 /dev/test-vg/thin-vol-2
(...)
Writing superblocks and filesystem accounting information: done

root@s1:/tmp/mnt# mkdir -p /tmp/mnt/disk1
root@s1:/tmp/mnt# mount /dev/test-vg/thin-vol-2 /tmp/mnt/disk1/
root@s1:/tmp/mnt# dd if=/dev/random of=/tmp/mnt/disk1/randfile bs=1K count=1024
dd: warning: partial read (86 bytes); suggest iflag=fullblock
0+1024 records in
0+1024 records out
38821 bytes (39 kB, 38 KiB) copied, 0.0473561 s, 820 kB/s
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 5.56
thin-vol-2 200.00m 5.56
thinpool 200.00m 11.12 2
root@s1:/tmp/mnt# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-thin--vol--1 190M 1.6M 175M 1% /tmp/mnt/disk0
/dev/mapper/test--vg-thin--vol--2 190M 1.6M 175M 1% /tmp/mnt/disk1

We can see that the space is being consumed, but we still see the 200Mb for each of the volumes.

Using RAID with LVM

Apart from using LVM as RAID-0 (i.e. stripping an LV across multiple physical devices), it is possible to create other types of raids using LVM.

Some of the most popular types of raids are RAID-1, RAID-10, and RAID-5. In the context of LVM, they intuitively mean:

  • RAID-1: mirror an LV in multiple PV
  • RAID-10: mirror an LV in multiple PV, but also stripping parts of the volume in different PVs.
  • RAID-5: distributing the LV in multiple PV, and using some parity data to be able to continue working if PV fails.

You can check more specific information on RAIDs in this link.

You can use other RAID utilities (such as on-board RAIDs or software like mdadm), but using LVM you will also be able to take profit from LVM features.

Mirroring an LV with LVM

The first use case is to get a volume that is mirrored in multiple PV. We’ll get back to the 2 PV set-up:

root@s1:/tmp/mnt# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m
root@s1:/tmp/mnt# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 2 0 0 wz--n- 504.00m 504.00m

And now we can create an LV that is in RAID1. But we can also set the number of copies that we want for the volume (using flag -m). In this case, we’ll create LV lv-mirror (-n lv-mirror) that is mirrored once (-m 1) with a size of 100Mb (-L 100M).

root@s1:/tmp/mnt# lvcreate --type raid1 -m 1 -L 100M -n lv-mirror test-vg
Logical volume "lv-mirror" created.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror 100.00 lv-mirror_rimage_0(0),lv-mirror_rimage_1(0)
[lv-mirror_rimage_0] /dev/loop1(1)
[lv-mirror_rimage_1] /dev/loop0(1)
[lv-mirror_rmeta_0] /dev/loop1(0)
[lv-mirror_rmeta_1] /dev/loop0(0)

As you can see, LV lv-mirror is built from lv-mirror_rimage_0, lv-mirror_rmeta_0, lv-mirror_rimage_1 and lv-mirror_rmeta_1 “devices”, and in which PV is located each part.

For the case of RAID1, you can convert it to and from linear volumes. This feature is not (yet) implemented for other types of raid such as RAID10 or RAID5.

You can change the number of mirror copies for an LV (even getting the volume to linear), by using command lvconvert:

root@s1:/tmp/mnt# lvconvert -m 0 /dev/test-vg/lv-mirror
Are you sure you want to convert raid1 LV test-vg/lv-mirror to type linear losing all resilience? [y/n]: y
Logical volume test-vg/lv-mirror successfully converted.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror /dev/loop1(1)
root@s1:/tmp/mnt# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 152.00m

In this case, we have converted the volume to linear (i.e. zero copies).

But you can also get mirror capabilities for a linear volume:

root@s1:/tmp/mnt# lvconvert -m 1 /dev/test-vg/lv-mirror
Are you sure you want to convert linear LV test-vg/lv-mirror to raid1 with 2 images enhancing resilience? [y/n]: y
Logical volume test-vg/lv-mirror successfully converted.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror 100.00 lv-mirror_rimage_0(0),lv-mirror_rimage_1(0)
[lv-mirror_rimage_0] /dev/loop1(1)
[lv-mirror_rimage_1] /dev/loop0(1)
[lv-mirror_rmeta_0] /dev/loop1(0)
[lv-mirror_rmeta_1] /dev/loop0(0)

More on RAIDs

Using command lvcreate it is also possible to create other types of RAID (e.g. RAID10, RAID5, RAID6, etc.). The only requirement is to have enough PVs for the type of RAID. But you must also have in mind that it is not possible to convert from these other types of raid to or from LV.

You can find much more information on LVM and RAIDs in this link.

More on LVM

There are a lot of features and tweaks to adjust in LVM, but this post shows the basics (and a bit more) to deal with LVM. You are advised to check the file /etc/lvm.conf and the man page.

How to install Horizon Dashboard in OpenStack Rocky and upgrade noVNC

Some time ago I wrote a series of posts on installing OpenStack Rocky: Part 1, Part 2 and Part 3. That installation was usable by the command line, but now I learned…

How to install Horizon Dashboard in OpenStack Rocky and upgrade noVNC

In this post, I start from the working installation of OpenStack Rocky in Ubuntu created in the previous posts in this series.

Installing the Horizon Dashboard if you have used the configuration settings that I suggested is very simple (I’m following the official documentation). You just need to install the dashboard packages and dependencies by issuing the next command:

$ apt install openstack-dashboard

And now you need to configure the dashboard settings in the file /etc/openstack-dashboard/local_settings.py. The basic configuration can be made with the next lines:

$ sed -i 's/OPENSTACK_HOST = "[^"]*"/OPENSTACK_HOST = "controller"/g' /etc/openstack-dashboard/local_settings.py
$ sed -i 's/^\(CACHES = {\)/SESSION_ENGINE = "django.contrib.sessions.backends.cache"\n\1/' /etc/openstack-dashboard/local_settings.py
$ sed -i "s/'LOCATION': '127\.0\.0\.1:11211',/'LOCATION': 'controller:11211'/" /etc/openstack-dashboard/local_settings.py
$ sed -i 's/^\(#OPENSTACK_API_VERSIONS = {\)/OPENSTACK_API_VERSIONS = {\n"identity": 3,\n"image": 2,\n"volume": 2,\n}\n\1'
$ sed -i 's/^OPENSTACK_KEYSTONE_DEFAULT_ROLE = "[^"]*"/OPENSTACK_KEYSTONE_DEFAULT_ROLE = "user"/'
$ sed -i "/^#OPENSTACK_KEYSTONE_DEFAULT_DOMAIN =.*$/aOPENSTACK_KEYSTONE_DEFAULT_DOMAIN='Default'" /etc/openstack-dashboard/local_settings.py
$ sed -i 's/^TIME_ZONE = "UTC"/TIME_ZONE = "Europe\/Madrid"/' /etc/openstack-dashboard/local_settings.py

Each line consists in:

  1. Setting the IP address of the controller (we set it in the /etc/hosts file).
  2. Setting a cache engine for the pages.
  3. Setting the memcached server (we installed it in the controller).
  4. Set the version of the APIs that we installed.
  5. Setting the default role for the users that log into the portal to “user” instead of the default one (which is “member”).
  6. Set the default domain to “Default” to avoid the need for querying for the domain to the users.
  7. Set the timezone of the site (you can check the code for your timezone here)

In our installation, we used the self-service option for the networks. If you changed it, please make sure that variable OPENSTACK_NEUTRON_NETWORK to match your platform.

And that’s all on the basic configuration of Horizon. Now we just need to check that file /etc/apache2/conf-available/openstack-dashboard.conf contains the next line:

WSGIApplicationGroup %{GLOBAL}

And finally, you need to restart apache:

$ service apache2 restart

Now you should be able to log in to the horizon portal by using the address https://controller.my.server/horizon in your web browser.

Please take into account that controller.my.server corresponds to the routable IP address of your server (if you followed the previous posts, it is 158.42.1.1).

Captura de pantalla 2020-04-01 a las 13.09.29

Configuring VNC

One of the most common complaints about the Horizon dashboard is that VNC does not work. The effect is that you can reach to the “console” tab of the instances, but you cannot see the console. And if you try to open the console in a new tab, you will probably find a Not Found web page.

The most common error, in this case, is that you have not configured the noVNC settings in the /etc/nova/nova.conf file, in the computing nodes. So please check the section [vnc] in that file. It should look like the next:

[vnc]
enabled = true
server_listen = 0.0.0.0
server_proxyclient_address = $my_ip
novncproxy_base_url = http://controller.my.server:6080/vnc_auto.html

There are two keys in this configuration:

  • Make sure that controller.my.server corresponds to the routable IP address of your server (the same that you used in the web browser).
  • Make sure that file vnc_auto.html exists in folder /usr/share/novnc in the host where horizon is installed.

Captura de pantalla 2020-04-01 a las 13.22.32

Upgrading noVNC

OpenStack Rocky comes with a very old version of noVNC (0.4), while at the moment of writing this post, noVNC has already released version 1.1.0 (see here).

To update noVNC is as easy as getting the file that contains the release and to put it in /usr/share/novnc, in the front-end:

$ cd /tmp
$ wget https://github.com/novnc/noVNC/archive/v1.1.0.tar.gz -O novnc-v1.1.0.tar.gz
$ tar xfz novnc-v1.1.0.tar.gz
$ mv /tmp/noVNC-1.1.0 /usr/share
$ cd /usr/share
$ mv novnc novnc-0.4
$ ln -s noVNC-1.1.0 novnc

Now we need to configure the new settings in the compute nodes. So we need to update file /etc/nova/nova.conf in each compute node. We need to modify the [vnc] section to match the new version of noVNC. The section will look like the next one:

[vnc]
enabled = true
server_listen = 0.0.0.0
server_proxyclient_address = $my_ip
novncproxy_base_url = http://controller.my.server:6080/vnc_lite.html

Finally, you will need to restart nova-compute on each compute node:

$ service nova-compute restart

and to restart apache2 in the server in which horizon is installed:

$ service apache2 restart

*WARNING* It is not guaranteed that the changes will be updated for the running instances, but it will be applied for the new ones.

 

How to install OpenStack Rocky – part 2

This is the second post on the installation of OpenStack Rocky in an Ubuntu based deployment.

In this post I am explaning how to install the essential OpenStack services in the controller. Please check the previous port How to install OpenStack Rocky – part 1 to learn on how to prepare the servers to host our OpenStack Rocky platform.

Recap

In the last post, we prepared the network for both the node controller and the compute elements. The description is in the next figure:

horsemen

We also installed the prerrequisites for the controller.

Installation of the controller

In this section we are installing keystone, glance, nova, neutron and the dashboard, in the controller.

Repositories

In first place, we need to install the OpenStack repositories:

# apt install software-properties-common
# add-apt-repository cloud-archive:rocky
# apt update && apt -y dist-upgrade
# apt install -y python-openstackclient

Keystone

To install keystone, first we need to create the database:

# mysql -u root -p <<< "CREATE DATABASE keystone;
GRANT ALL PRIVILEGES ON keystone.* TO 'keystone'@'localhost' IDENTIFIED BY 'KEYSTONE_DBPASS';
GRANT ALL PRIVILEGES ON keystone.* TO 'keystone'@'%' IDENTIFIED BY 'KEYSTONE_DBPASS';"

And now, we’ll install keystone

# apt install -y keystone apache2 libapache2-mod-wsgi

We are creating the minimal keystone.conf configuration, according to the basic deployment:

# cat > /etc/keystone/keystone.conf  <<EOT
[DEFAULT]
log_dir = /var/log/keystone
[database]
connection = mysql+pymysql://keystone:KEYSTONE_DBPASS@controller/keystone
[extra_headers]
Distribution = Ubuntu
[token]
provider = fernet"
fi
EOT

Now we need to execute some commands to prepare the keystone service

# su keystone -s /bin/sh -c 'keystone-manage db_sync'
# keystone-manage fernet_setup --keystone-user keystone --keystone-group keystone
# keystone-manage credential_setup --keystone-user keystone --keystone-group keystone
# keystone-manage bootstrap --bootstrap-password "ADMIN_PASS" --bootstrap-admin-url http://controller:5000/v3/ --bootstrap-internal-url http://controller:5000/v3/ --bootstrap-public-url http://controller:5000/v3/ --bootstrap-region-id RegionOne

At this moment, we have to configure apache2, because it is used as the http backend.

# echo "ServerName controller" >> /etc/apache2/apache2.conf
# service apache2 restart

Finally we’ll prepare a file that contains a set of variables that will be used to access openstack. This file will be called admin-openrc and its content is the next:

# cat > admin-openrc <<EOT
export OS_PROJECT_DOMAIN_NAME=Default
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=ADMIN_PASS
export OS_AUTH_URL=http://controller:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_IMAGE_API_VERSION=2
EOT

And now we are almost ready to operate keystone. Now we need to source that file:

# source admin-openrc

And now we are ready to issue commands in OpenStack. And we are testing by create a project that will host the OpenStack services:

# openstack project create --domain default --description "Service Project" service

Demo Project

In the OpenStack installation guide, a demo project is created. We are including the creation of this demo project, although it is not needed:

# openstack project create --domain default --description "Demo Project" myproject
# openstack user create --domain default --password "MYUSER_PASS" myuser
# openstack role create myrole
# openstack role add --project myproject --user myuser myrole

We are also creating the set of variables needed in the system to execute commands in OpenStack, using this demo user

# cat > demo-openrc << EOT
export OS_PROJECT_DOMAIN_NAME=Default
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_NAME=myproject
export OS_USERNAME=myuser
export OS_PASSWORD=MYUSER_PASS
export OS_AUTH_URL=http://controller:5000/v3
export OS_IDENTITY_API_VERSION=3
export OS_IMAGE_API_VERSION=2"
EOT

In case that you want to use this demo user and project, you will be either login in the horizon portal (once it is installed in further steps), using the pair myuser/MYUSER_PASS credentials, or sourcing the file demo-openrc to use the commandline.

Glance

Glance is the OpenStack service dedicated to manage the VM images. And this using this steps, we will be able to make a basic installation where the images are stored in the filesystem of the controller.

First we need to create a database and user in mysql:

# mysql -u root -p <<< "CREATE DATABASE glance;
GRANT ALL PRIVILEGES ON glance.* TO 'glance'@'localhost' IDENTIFIED BY 'GLANCE_DBPASS';
GRANT ALL PRIVILEGES ON glance.* TO 'glance'@'%' IDENTIFIED BY 'GLANCE_DBPASS';"

Now we need to create the user dedicated to run the service and the endpoints in keystone, but first we’ll make sure that we have the proper env variables by sourcing the admin credentials:

# source admin-openrc# openstack user create --domain default --password "GLANCE_PASS" glance
# openstack role add --project service --user glance admin
# openstack service create --name glance --description "OpenStack Image" image
# openstack endpoint create --region RegionOne image public http://controller:9292
# openstack endpoint create --region RegionOne image internal http://controller:9292
# openstack endpoint create --region RegionOne image admin http://controller:9292

Now we are ready to install the components:

# apt install -y glance

At the time of writing this post, there is an error in the glance package in the OpenStack repositories. That makes that (e.g.) integration with cinder does not work. The problem is that file /etc/glance/rootwrap.conf and folder /etc/glance/rootwrap.d are inside folder /etc/glance/glance. So the patch simply consists in executing

$ mv /etc/glance/glance/rootwrap.* /etc/glance/

 

And now we are creating the basic configuration files, needed to run glance as in the basic installation:

# cat > /etc/glance/glance-api.conf  << EOT
[database]
connection = mysql+pymysql://glance:GLANCE_DBPASS@controller/glance
backend = sqlalchemy
[image_format]
disk_formats = ami,ari,aki,vhd,vhdx,vmdk,raw,qcow2,vdi,iso,ploop.root-tar
[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = glance
password = GLANCE_PASS
[paste_deploy]
flavor = keystone
[glance_store]
stores = file,http
default_store = file
filesystem_store_datadir = /var/lib/glance/images/
EOT

And

# cat > /etc/glance/glance-registry.conf  << EOT
[database]
connection = mysql+pymysql://glance:GLANCE_DBPASS@controller/glance
backend = sqlalchemy
[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = Default
user_domain_name = Default
project_name = service
username = glance
password = GLANCE_PASS
[paste_deploy]
flavor = keystone
EOT

The backend to store the files is the folder /var/lib/glance/images/ in the controller node. If you want to change this folder, please update the variable filesystem_store_datadir in the file glance-api.conf

We have created the files that result from the installation from the official documentation and now we are ready to start glance. First we’ll prepare the database

# su -s /bin/sh -c "glance-manage db_sync" glance

And finally we will restart the services

# service glance-registry restart
# service glance-api restart

At this point, we are creating our first image (the common cirros image):

# wget -q http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-x86_64-disk.img -O /tmp/cirros-0.4.0-x86_64-disk.img
# openstack image create "cirros" --file /tmp/cirros-0.4.0-x86_64-disk.img --disk-format qcow2 --container-format bare --public

Nova (i.e. compute)

Nova is the set of services dedicated to the compute service. As we are installing the controller, this server will not run any VM. Instead it will coordinate the creation of the VMs in the working nodes.

First we need to create the databases and users in mysql:

# mysql -u root -p <<< "CREATE DATABASE nova_api;
CREATE DATABASE nova;
CREATE DATABASE nova_cell0;
CREATE DATABASE placement;
GRANT ALL PRIVILEGES ON nova_api.* TO 'nova'@'localhost' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON nova_api.* TO 'nova'@'%' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON nova.* TO 'nova'@'localhost' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON nova.* TO 'nova'@'%' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'@'localhost' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON nova_cell0.* TO 'nova'@'%' IDENTIFIED BY 'NOVA_DBPASS';
GRANT ALL PRIVILEGES ON placement.* TO 'placement'@'localhost' IDENTIFIED BY 'PLACEMENT_DBPASS';
GRANT ALL PRIVILEGES ON placement.* TO 'placement'@'%' IDENTIFIED BY 'PLACEMENT_DBPASS';"

And now we will create the users that will manage the services and the endpoints in keystone. But first we’ll make sure that we have the proper env variables by sourcing the admin credentials:

# source admin-openrc
# openstack user create --domain default --password "NOVA_PASS" nova
# openstack role add --project service --user nova admin
# openstack service create --name nova --description "OpenStack Compute" compute
# openstack endpoint create --region RegionOne compute public http://controller:8774/v2.1
# openstack endpoint create --region RegionOne compute internal http://controller:8774/v2.1
# openstack endpoint create --region RegionOne compute admin http://controller:8774/v2.1
# openstack user create --domain default --password "PLACEMENT_PASS" placement
# openstack role add --project service --user placement admin
# openstack service create --name placement --description "Placement API" placement
# openstack endpoint create --region RegionOne placement public http://controller:8778
# openstack endpoint create --region RegionOne placement internal http://controller:8778
# openstack endpoint create --region RegionOne placement admin http://controller:8778

Now we’ll install the services

# apt -y install nova-api nova-conductor nova-consoleauth nova-novncproxy nova-scheduler nova-placement-api

Once the services have been installed, we are creating the basic configuration file

# cat > /etc/nova/nova.conf  <<\EOT
[DEFAULT]
lock_path = /var/lock/nova
state_path = /var/lib/nova
transport_url = rabbit://openstack:RABBIT_PASS@controller
my_ip = 192.168.1.240
use_neutron = true
firewall_driver = nova.virt.firewall.NoopFirewallDriver
[api]
auth_strategy = keystone
[api_database]
connection = mysql+pymysql://nova:NOVA_DBPASS@controller/nova_api
[cells]
enable = False
[database]
connection = mysql+pymysql://nova:NOVA_DBPASS@controller/nova
[glance]
api_servers = http://controller:9292
[keystone_authtoken]
auth_url = http://controller:5000/v3
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = nova
password = NOVA_PASS
[neutron]
url = http://controller:9696
auth_url = http://controller:5000
auth_type = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = neutron
password = NEUTRON_PASS
service_metadata_proxy = true
metadata_proxy_shared_secret = METADATA_SECRET
[oslo_concurrency]
lock_path = /var/lib/nova/tmp
[placement]
os_region_name = openstack
region_name = RegionOne
project_domain_name = Default
project_name = service
auth_type = password
user_domain_name = Default
auth_url = http://controller:5000/v3
username = placement
password = PLACEMENT_PASS
[placement_database]
connection = mysql+pymysql://placement:PLACEMENT_DBPASS@controller/placement
[scheduler]
discover_hosts_in_cells_interval = 300
[vnc]
enabled = true
server_listen = $my_ip
server_proxyclient_address = $my_ip
EOT

In this file, the most important value to tweak is “my_ip” that corresponds to the internal IP address of the controller.

Also remember that we are using simple passwords, to ease its tracking. In case that you need to make the deployment more secure, please set secure passwords.

At this point we need to synchronize the databases and create the openstack cells

# su -s /bin/sh -c "nova-manage api_db sync" nova
# su -s /bin/sh -c "nova-manage cell_v2 map_cell0" nova
# su -s /bin/sh -c "nova-manage cell_v2 create_cell --name=cell1 --verbose" nova
# su -s /bin/sh -c "nova-manage db sync" nova

Finally we need to restart the compute services.

# service nova-api restart
# service nova-consoleauth restart
# service nova-scheduler restart
# service nova-conductor restart
# service nova-novncproxy restart

We have to take into account that this is the controller node, and will not host any virtual machine.

Neutron

Neutron is the networking service in OpenStack. In this post we are installing the “self-sevice networks” option, so that the users will be able to create their isolated networks.

In first place, we are creating the database for the neutron service.

# mysql -u root -p <<< "CREATE DATABASE neutron;
GRANT ALL PRIVILEGES ON neutron.* TO 'neutron'@'localhost' IDENTIFIED BY 'NEUTRON_DBPASS';
GRANT ALL PRIVILEGES ON neutron.* TO 'neutron'@'%' IDENTIFIED BY 'NEUTRON_DBPASS'
"

Now we will create the openstack user and endpoints, but first we need to ensure that we set the env variables:

# source admin-openrc 
# openstack user create --domain default --password "NEUTRON_PASS" neutron
# openstack role add --project service --user neutron admin
# openstack service create --name neutron --description "OpenStack Networking" network
# openstack endpoint create --region RegionOne network public http://controller:9696
# openstack endpoint create --region RegionOne network internal http://controller:9696
# openstack endpoint create --region RegionOne network admin http://controller:9696

Now we are ready to install the packages related to neutron:

# apt install -y neutron-server neutron-plugin-ml2 neutron-linuxbridge-agent neutron-l3-agent neutron-dhcp-agent neutron-metadata-agent

And now we need to create the configuration files for neutron. In first place, the general file /etc/neutron/neutron.conf

# cat > /etc/neutron/neutron.conf <<\EOT
[DEFAULT]
core_plugin = ml2
core_plugin = ml2
service_plugins = router
allow_overlapping_ips = true
transport_url = rabbit://openstack:RABBIT_PASS@controller
auth_strategy = keystone
notify_nova_on_port_status_changes = true
notify_nova_on_port_data_changes = true
[agent]
root_helper = "sudo /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf"
[database]
connection = mysql+pymysql://neutron:NEUTRON_DBPASS@controller/neutron
[keystone_authtoken]
www_authenticate_uri = http://controller:5000
auth_url = http://controller:5000
memcached_servers = controller:11211
auth_type = password
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = NEUTRON_PASS
[nova]
auth_url = http://controller:5000
auth_type = password
project_domain_name = default
user_domain_name = default
region_name = RegionOne
project_name = service
username = nova
password = NOVA_PASS
[oslo_concurrency]
lock_path = /var/lock/neutron
EOT

Now the file /etc/neutron/plugins/ml2/ml2_conf.ini, that will be used to instruct neutron how to create the LANs:

# cat > /etc/neutron/plugins/ml2/ml2_conf.ini <<\EOT
[ml2]
type_drivers = flat,vlan,vxlan
tenant_network_types = vxlan
mechanism_drivers = linuxbridge,l2population
extension_drivers = port_security
[ml2_type_flat]
flat_networks = provider
[ml2_type_vxlan]
vni_ranges = 1:1000
[securitygroup]
enable_ipset = true
EOT

Now the file /etc/neutron/plugins/ml2/linuxbridge_agent.ini, because we are using linux bridges in this setup:

# cat > /etc/neutron/plugins/ml2/linuxbridge_agent.ini <<\EOT
[linux_bridge]
physical_interface_mappings = provider:eno3
[securitygroup]
firewall_driver = neutron.agent.linux.iptables_firewall.IptablesFirewallDriver
enable_security_group = true
[vxlan]
enable_vxlan = true
local_ip = 192.168.1.240
l2_population = true
EOT

In this file, it is important to tweak the value “eno3” in value physical_interface_mappings, to match the physical interface that has access to the provider’s network (i.e. public network). It is also essential to set the proper value for “local_ip“, which is the IP address of your internal interface with which to communicate to the compute hosts.

Now we have to create the files correponding to the l3_agent and the dhcp agent:

# cat > /etc/neutron/l3_agent.ini <<EOT
[DEFAULT]
interface_driver = linuxbridge
EOT
# cat /etc/neutron/dhcp_agent.ini <<EOT
[DEFAULT]
interface_driver = linuxbridge
dhcp_driver = neutron.agent.linux.dhcp.Dnsmasq
enable_isolated_metadata = true
dnsmasq_dns_servers = 8.8.8.8
EOT

Finally we need to create the file /etc/neutron/metadata_agent.ini

# cat > genfile /etc/neutron/metadata_agent.ini <<EOT
[DEFAULT]
nova_metadata_host = controller
metadata_proxy_shared_secret = METADATA_SECRET
EOT

Once the configuration files have been created, we are synchronizing the database and restarting the services related to neutron.

# su -s /bin/sh -c "neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/ml2_conf.ini upgrade head" neutron
# service nova-api restart
# service neutron-server restart
# service neutron-linuxbridge-agent restart
# service neutron-dhcp-agent restart
# service neutron-metadata-agent restart
# service neutron-l3-agent restart

And that’s all.

At this point we have the controller node installed according to the OpenStack documentation. It is possible to issue any command, but it will not be possible to start any VM, because we have not installed any compute node, yet.

 

 

How to dynamically create on-demand services to respond to incoming TCP connections

Some time ago I had the problem of dynamically start virtual machines, when an incoming connection was received in a port. The exact problem was to have a VM that was powered off, and start it whenever an incoming ssh connection was received, and then forward the network traffic to that VM to serve the ssh request. In this way, I could have a server in a cloud provider (e.g. Amazon), and not to spend money if I was not using it.

This problem has been named “the sleeping beauty”, because of the tale. It was like having a sleeping virtual infrastructure (i.e. the sleeping beauty), that will be awaken as an incoming connection (i.e. the kiss) was received from the user (i.e. the prince).

Now I have figured out how to solve that problem, and that is why this time I learned

How to dynamically create on-demand services to respond to incoming TCP connections

The way to solve it is very straightforward, as it is fully based in the ​socat application.

socat is “a relay for bidirectional data transfer between two independent data channels”. And it can be used to forward the traffic received in a port, to other pair IP:PORT.

A simple example is:

$ socat tcp-listen:10000 tcp:localhost:22 &

And now we can SSH to localhost in the following way:

$ ssh localhost -p 10000

The interesting thing is that socat is able to exec one command upon receiving a connection (using the destination of the relay the address type EXEC or SYSTEM). But the most important thing is that socat will stablish a communication using stdin and stdout.

So it is possible to make this funny thing:

$ socat tcp-listen:10000 SYSTEM:'echo "hello world"' &
[1] 11136
$ wget -q -O- http://localhost:10000
hello world
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'echo "hello world"'

Now that we know that the communication is stablished using stdin and stdout, we can somehow abuse of socat and try this even funnier thing:

$ socat tcp-listen:10000 SYSTEM:'echo "$(date)" >> /tmp/sshtrack.log; socat - "TCP:localhost:22"' &
[1] 27421
$ ssh localhost -p 10000 cat /tmp/sshtrack.log
mié feb 27 14:36:45 CET 2019
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'echo "$(date)" >> /tmp/sshtrack.log; socat - "
TCP:localhost:22"'

The effect is that we can execute commands and redirect the connection to an arbitrary IP:PORT.

Now, it is easy to figure how to dinamically spawn servers to serve the incoming TCP resquests. An example to spawn a one-shot web server in port 8080 to serve requests in port 10000 is the next:

$ socat tcp-listen:10000 SYSTEM:'(echo "hello world" | nc -l -p 8080 -q 1 > /dev/null &) ; socat - "TCP:localhost:8080"' &
[1] 31586
$ wget -q -O- http://localhost:10000
hello world
$
[1]+ Done socat tcp-listen:10000 SYSTEM:'(echo "hello world" | nc -l -p 8080 -q 1 > /dev/null &) ; socat - "TCP:localhost:8080"'

And now you can customize your scripts to create the effective servers on demand.

The sleeping beauty application

I have used these proofs of concept to create the sleeping-beauty application. It is open source, and you can get it in github.

The sleeping beauty is a system that helps to implement serverless infrastructures: you have the servers aslept (or not even created), and they are awaken (or created) as they are needed. Later, they go back to sleep (or they are disposed).

In the sleeping-beauty, you can configure services that listen to a port, and the commands that socat should use to start, check the status or stop the effective services. Moreover it implements an idle-detection mechanism that is able to check whether the effective service is idle, and if it has been idle for a period of time, stop it to save resources.

Example: In the description of the use case, the command to be used to start the service, will contact Amazon AWS and will start a VM. The command to stop the service will contact Amazon AWS to stop the VM. And the command to check whether the service is idle or not will ssh the VM and execute the command ‘who’.

How to install OpenStack Rocky – part 1

This is the first post of a series in which I am describing the installation of a OpenStack site using the latest distribution at writting time: Rocky.

My project is very ambitious, because I have a 2 virtualization nodes (that have different GPU each), 10GbE, a lot of memory and disk, and I want to offer the GPUs to the VMs. The front-end is a 24 core server, with 32Gb. RAM and 6 Tb. disk, with 4 network ports (2x10GbE+2x1GbE), that will also act as block devices server.

We’ll be using Ubuntu 18.04 LTS for all the nodes, and I’ll try to follow the official documentation. But I will try to be very straighforward in the configuration… I want to make it work and I will try to explain how things work, instead of tunning the configuration.

How to install OpenStack Rocky – part 1

My setup for the OpenStack installation is the next one:

horsemen

In the figure I have annotated the most relevant data to identify the servers: the IP addresses for each interface, which is the volume server and the virtualization nodes that will share their GPUs.

At the end, the server horsemen will host the next services: keystone, glance, cinder, neutron and horizon. On the other side, fh01 and fh02 will host the services compute and neutron-related.

In each of the servers we need a network interface (eno1, enp1s0f1 and enp1f0f1) which is intended for administration purposes (i.e. the network 192.168.1.1/24). That interface has a gateway (192.168.1.220) that enables the access to the internet via NAT. From now on, we’ll call these interfaces as the “private interfaces“.

We need an additional interface that is connected to the provider network (i.e. to the internet). That network will hold the publicly routable IP addresses. In my case, I have the network 158.42.1.1/24, that is publicly routable. It is a flat network with its own network services (e.g. gateway, nameservers, dhcp servers, etc.). From now on, we’ll call these interfaces the “public interfaces“.

One note on “eno4” interface in horsemen: I am using this interface for accessing horizon. In case that you do not have a different interface, you can use interface aliasing or providing the IP address in the ifupdown “up” script.

An extra note on “eno2” interface in horsemen: It is an extra interface in the node. It will be left unused during this installation, but it will be configured to work in bond mode with “eno1”.

IMPORTANT: In the whole series of posts, I am using the passwords as they appear: RABBIT_PASS, NOVA_PASS, NOVADB_PASS, etc. You should change it, according to a secure password policy, but they are set as-is to ease understanding the installation. Anyway, most of them will be fine if you have an isolated network and the service listen only in the management network (e.g. mysql will only be configured to listen in the management interface).

Some words on the Openstack network (concepts)

The basic installation of Openstack considers two networks: the provider network and the management network. The provider network means “the network that is attached to the provider” i.e. the network where the VMs can have publicly routable IP addresses.  On the other side, the management network is a private network that is (probably) isolated from the provider one. The computers in that network have private IP addresses (e.g. 192.168.1.1/16, 10.0.0.0/8, 172.16.0.0/12).

In the basic deployment of Openstack, it considers that the controller node do not need to have a routable IP address. Instead, it can be accessed by the admin by the management network. That is why the “eno3” interface has not an IP address.

In the Openstack architecture, horizon is a separated piece, so horizon is the one that will need a routable IP address. As I want to install horizon also in the controller, I need a routable IP address and that is why I put a publicly routable IP address in “eno4” (158.42.1.1).

In my case, I had a spare network interface (eno4) but if you do not have one of them, you can create a bridge and add your “interface connected to the provider network” (i.e. “eno3”) to that bridge, and then add a publicly routable IP address to the bridge.

IMPORTANT: this is not part of my installation. Although it may be part of your installation.

brctl addbr br-publicbrctl addif br-public eno3ip link set dev br-public upip addr add 158.42.1.1/16 dev br-public

Configuring the network

One of the first things that we need to set up is to configure the network for the different servers.

Ubuntu 18.04 has moved to netplan, but at the time of writting this text, I have not found any mechanism to get an interface UP and not providing an IP address for it, using netplan. Moreover, when trying to use ifupdown, netplan is not totally disabled and interferes with options such as dns-nameservers for the static addresses. At the end I need to install ifupdown and make a mix of configuration using both netplan and ifupdown.

It is very important to disable IPv6 for any of the servers, because if not, you will probably face a problem when using the public IP addresses. You can read more in this link.

To disable IPv6, we need to execute the following lines in all the servers (as root):

# sysctl -w net.ipv6.conf.all.disable_ipv6=1
# sysctl -w net.ipv6.conf.default.disable_ipv6=1
# cat >> /etc/default/grub << EOT
GRUB_CMDLINE_LINUX_DEFAULT="ipv6.disable=1"
GRUB_CMDLINE_LINUX="ipv6.disable=1"
EOT
# update-grub

We’ll disable IPv6 for the current session, and persist it by disabling at boot time. If you have customized your grub, you should check the options that we are setting.

Configuring the network in “horsemen”

You need to install ifupdown, to have an anonymous interface connected to the internet, to include neutron-related services

# apt update && apt install -y ifupdown

Edit the file /etc/network/interfaces and adjust it with a content like the next one:

auto eno3
iface eno3 inet manual
up ip link set dev $IFACE up
down ip link set dev $IFACE down

Now edit the file /etc/netplan/50-cloud-init.yaml to set the private IP address:

network:
  ethernets:
    eno4:
      dhcp4: true
    eno1:
      addresses:
        - 192.168.1.240/24
      gateway4: 192.168.1.221
      nameservers:
        addresses: [ 192.168.1.220, 8.8.8.8 ]
  version: 2

When you save these settings, you can issue the next commands:

# netplan generate
# netplan apply

Now we’ll edit the file /etc/hosts, and will add the addresses of each server. My file is the next one:

127.0.0.1       localhost.localdomain   localhost
192.168.1.240   horsemen controller
192.168.1.241   fh01
192.168.1.242   fh02

I have removed the entry 127.0.1.1 because I read thet it may interfer. And I also removed all the crap about IPv6 because I disabled it.

Configuring the network in “fh01” and “fh02”

Here is the short version of the configuration of fh01:

# apt install -y ifupdown
# cat >> /etc/network/interfaces << EOT
auto enp1s0f0
iface enp1s0f0 inet manual
up ip link set dev $IFACE up
down ip link set dev $IFACE down
EOT

Here is my file /etc/netplan/50-cloud-init.yaml for fh01:

network:
  ethernets:
    enp1s0f1:
      addresses:
        - 192.168.1.241/24
      gateway4: 192.168.1.221
    nameservers:
    addresses: [ 192.168.1.220, 8.8.8.8 ]
  version: 2

Here is the file /etc/hosts for fh01:

127.0.0.1 localhost.localdomain localhost
192.168.1.240 horsemen controller
192.168.1.241 fh01
192.168.1.242 fh02

You can export this configuration to fh02 by adjusting the IP address in the /etc/netplan/50-cloud-init.yaml file.

Reboot and test

Now it is a good moment to reboot your systems and test that the network is properly configured. If it is not, please make sure that it is working because. Otherwise the next steps will have no sense.

From each of the hosts you should be able to ping to the outern world and ping between the hosts. These are the tests from horsemen, but you need to be able to repeat them from each of the servers.

root@horsemen# ping -c 2 www.google.es
PING www.google.es (172.217.17.3) 56(84) bytes of data.
64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=1 ttl=54 time=7.26 ms
64 bytes from mad07s09-in-f3.1e100.net (172.217.17.3): icmp_seq=2 ttl=54 time=7.26 ms

--- www.google.es ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 7.262/7.264/7.266/0.002 ms
root@horsemen# ping -c 2 fh01
PING fh01 (192.168.1.241) 56(84) bytes of data.
64 bytes from fh01 (192.168.1.241): icmp_seq=1 ttl=64 time=0.180 ms
64 bytes from fh01 (192.168.1.241): icmp_seq=2 ttl=64 time=0.113 ms

--- fh01 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1008ms
rtt min/avg/max/mdev = 0.113/0.146/0.180/0.035 ms
root@horsemen# ping -c 2 fh02
PING fh02 (192.168.1.242) 56(84) bytes of data.
64 bytes from fh02 (192.168.1.242): icmp_seq=1 ttl=64 time=0.223 ms
64 bytes from fh02 (192.168.1.242): icmp_seq=2 ttl=64 time=0.188 ms

--- fh02 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1027ms
rtt min/avg/max/mdev = 0.188/0.205/0.223/0.022 ms

Prerrequisites for Openstack in the server (horsemen)

Remember: for simplicity, I will use obvious passwords like SERVICE_PASS or SERVICEDB_PASS (e.g. RABBIT_PASS). You should change these passwords, although most of them will be fine if you have an isolated network and the service listen only in the management network.

First of all, we are installing the prerrequisites. We will start with the NTP server, that will keep the hour synchronized between the controller (horsemen) and the virtualization servers (fh01 and fh02). We’ll instal chrony (recommended in the Openstack configuration) and allow any computer in our private network to connect to this new NTP server:

# apt install -y chrony
cat >> /etc/chrony/chrony.conf << EOT
allow 192.168.1.0/24
EOT
# service chrony restart

Now we are installing and configuring the database server (we’ll use mariadb as it is used in the basic installation):

# apt install mariadb-server python-pymysql
# cat > /etc/mysql/mariadb.conf.d/99-openstack.cnf << EOT
[mysqld]
bind-address = 192.168.1.240

default-storage-engine = innodb
innodb_file_per_table = on
max_connections = 4096
collation-server = utf8_general_ci
character-set-server = utf8
EOT
#service mysql restart

Now we are installing rabbitmq, that will be used to orchestrate message interchange between services (please change RABBIT_PASS).

# apt install rabbitmq-server
# rabbitmqctl add_user openstack "RABBIT_PASS"
# rabbitmqctl set_permissions openstack ".*" ".*" ".*"

At this moment, we have to install memcached and configure it to listen in the management interface:

# apt install memcached

# echo "-l 192.168.1.240" >> /etc/memcached.conf

# service memcached restart

Finally we need to install etcd and configure it to be accessible by openstack

# apt install etcd
# cat >> /etc/default/etcd << EOT
ETCD_NAME="controller"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-01"
ETCD_INITIAL_CLUSTER="controller=http://192.168.1.240:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.1.240:2380"
ETCD_ADVERTISE_CLIENT_URLS="http://192.168.1.240:2379"
ETCD_LISTEN_PEER_URLS="http://0.0.0.0:2380"
ETCD_LISTEN_CLIENT_URLS="http://192.168.1.240:2379"
EOT
# systemctl enable etcd
# systemctl start etcd

Now we are ready to with the installation of the OpenStack Rocky packages… (continue to part 2)

How to create compressed and/or encrypted bash scripts

The immediate use of bash scripting is to automate tasks, but there are a lot of applications and tools in the commandline that are powerful enough and easy to use to have the need to use other languages such as python, perl, etc. that are harder to learn just to implement workflows of other tools.

When you try to make these scripts to be usable for third parties, its complexity increases as the amount of code that makes checks, verifications, etc. grows. At the end, you have created scripts that are reasonably big. Moreover, when you want to transfer these applications to others, you may want to avoid its re-distribution.

So this time I explored…

How to create compressed and/or encrypted bash scripts

TL;DR: An extended version of a code that gzips and encodes a script, that is run depending on a password (or license) is in the repository named LSC – License Shell Code.

Bash scripting is a powerful language to develop applications that manage servers, but also to automate processes in Linux, to execute batch tasks, or to other tasks that can be executed in the Linux commandline e.g. implement workflows in scientific applications (processing the output from one application to prepare it for other application).

But if your script grows a lot, maybe you want to compress it. Moreover, compressing the code is like somehow obfuscating the code. So it reduces the chance of re-using your code for other applications without permission (for the case of common end-users that do not master bash scripting).

Compressing the script and running it

Having a bash script in file example, compressing it is very simple. You can just see the next commands that create the example script and generates a gzipped version:

$ cat > example <<EOT
#!/bin/bash
echo "hello world"
EOT
$ cat > compressedscript <<EOT
#!/bin/bash
eval "\$(echo "$(cat example | gzip | base64)" | base64 -d | gunzip)"
EOT

Now the compressed script looks like this one:

#!/bin/bash
eval "$(echo "H4sIAOMSmVsAA1NW1E/KzNNPSizO4EpNzshXUMpIzcnJVyjPL8pJUeICAFeChBYfAAAA" | base64 -d | gunzip)"

And executing it is as simple as chmodding it and running it:

$ chmod +x compressedscript
$ ./compressedscript
hello world

As you can see, the process is very simple. We just embedded the gzipped source code (*) in a new script and took profit from the ability of the eval function.

(*) We needed to uuencode the gzipped file to be able to embed it in a plain text file.

Encrypting the script

Now that we know that the eval function works for our purposes, we can generalize the function of our proof of concept, to embed more complex things. For example, we can encrypt our script…

To encrypt things, we can use openssl as shown in this post. Briefly, we can encrypt a file with a command like the next one:

$ openssl enc -aes-256-cbc -salt -in file.txt -out file.txt.enc

And decode a encoded file with a command like the next one:

$ openssl enc -aes-256-cbc -d -in file.txt.enc -out file.txt

We apply encryption to our case, appart from gzipping our script. The resulting code is very similar to the previous case, but adding the encryption step:

$ cat > compressedscript <<EOT
#!/bin/bash
read -p "Please provide a password: " PASSWORD
eval "\$(echo "$(cat example | gzip | openssl enc -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "mypasswd" | base64 | tr -d '\n')" | base64 -d | openssl enc -d -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "\$PASSWORD" | gunzip)"
EOT

The resulting content for the new compressedscript file is the next one:

#!/bin/bash
read -p "Please provide a password: " PASSWORD
eval "$(echo "U2FsdGVkX18gnaQ3jFPBfhalu0/riWaRirtWHcFWgFqGFuzf3s98T/Y65Km2oe4jqGaAXlDCBX6+oWwWgqMIOBfG/O3P7qgmLTRpkzvShwc=" | base64 -d | openssl enc -d -aes-256-cbc -in /dev/stdin -out /dev/stdout -k "$PASSWORD" | gunzip)"

If we run that script, it will prompt for a password to decode de script (in our case it is “mypasswd”), and then it will be ran:

$ ./compressedscript
Please provide a password: mypasswd
hello world
$ ./compressedscript
Please provide a password: otherpasswd
bad decrypt
140040196646552:error:06065064:digital envelope routines:EVP_DecryptFinal_ex:bad decrypt:evp_enc.c:529:

gzip: stdin: not in gzip format

As shown in the code above, if other password is introduced, the code will not be ran.

Further reading

An extended version of the code of this post is included in LSC – License Shell Code. LSC is an application that generate shell applications (from existing ones) that need a license code to be run.

NOTES: Providing your users with a LICENSE code give them a more professional distribution, and a more tailored solution. Getting the code also implies a knowledge of what they are doing and so it also includes a first barrier (both from the point of the knowledge and from the ethics).

How to run Docker containers using common Linux tools (without Docker)

Containers are a current virtualization and application delivery trend, and I am working on it. If you try to search google about them, you can find tons of how-tos, information, tech guides, etc. As in anything in IT, there are flavors of containers. In this case, the players are Docker, LXC/LXD (in which Docker was once based), CoreOS RKT, OpenVZ, etc. If you have a look in the google trends, you’ll notice that undoubtedly the winner hype is Docker and “the others” try to fight against it.

trends_containers

But as there are several alternatives, I wanted to learn about the underlying technology and it seems that all of them are simply based on a set of kernel features: mainly the linux namespaces and the cgroups. And the most important diferences are the utilities that they provide to automate the procedures (the repository of images, container management and other parts of the ecosystem of a particular product).

Disclaimer: This is not a research blog, and so I am not going in depth on when namespaces were introduced in the kernel, which namespaces exist, how they work, what is copy on write, what are cgroups, etc. The purpose of this post is simply “fun with containers” 🙂

At the end, the “hard work” (i.e. the execution of a containerized environment) is made by the Linux kernel. And so this time I learned…

How to run Docker containers using common Linux tools (without Docker).

We start from a scenario in which we have one container running in Docker, and we want to run it using standard Linux tools. We will mainly act as a common user that has permissions to run Docker containers (i.e. in the case of Ubuntu, my user calfonso is in group ‘docker’), to see that we can run containers in the user space.

1

TL;DR

To run a contained environment with its own namespaces, using standard Linux tools you can follow the next procedure:

calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar
calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/
calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user chroot $PWD/rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

At this point you need to set up the network devices (from outside the container) and deal with the cgroups (if you need to).

In first place, we are preparing a folder for our tests (handcontainer) and then we will dump the filesystem of the container:

calfonso:~$ mkdir handcontainer
calfonso:~$ cd handcontainer
calfonso:handcontainer$ docker export blissful_goldstine -o dockercontainer.tar

If we check the tar file produced, we’ll see that it is the whole filesystem of the container

2

Let’s extract it in a new folder (called rootfs)

calfonso:handcontainer$ mkdir rootfs
calfonso:handcontainer$ tar xf dockercontainer.tar --ignore-command-error -C rootfs/

This action will raise an error, because only the root user can use the mknod application and it is needed for the /dev folder, but it will be fine for us because we are not dealing with devices.

3

If we check the contents of rootfs, the filesystem is there and we can chroot to that filesystem to verify that we can use it (more or less) as if it was the actual system.

4

The chroot technique is well known and it was enough in the early days, but we have no isolation in this system. It is exposed if we use the next commands:

/ # ip link
/ # mount -t proc proc /proc && ps -ef
/ # hostname

In these cases, we can manipulate the network of the host, interact with the processes of the host or manipulate the hostname.

This is because using chroot only changes the root filesystem for the current session, but it takes no other action.

Some words on namespaces

One of the “magic” of containers are the namespaces (you can read more on this in this link). The namespaces make that one process have a particular vision of “the things” in several areas. The namespaces that are currently available in Linux are the next:

  • Mounts namespace: mount points.
  • PID namespace: process number.
  • IPC namespace: Inter Process Communication resources.
  • UTS namespace: hostname and domain name.
  • Network namespace: network resources.
  • User namespace: User and Group ID numbers.

Namespaces are handled in the Linux kernel, and any process is already in one namespace (i.e. the root namespace). So changing the namespaces of one particular process do not introduce additional complexity for the processes.

Creating particular namespaces for particular processes means that one process will have its particular vision of the resources in that namespace. As an example, if one process is started with its own PID namespace, the PID number of the process will be 0 and its children will have the next PID numbers. Or if one process is started with its own NET namespace, it will have a particular stack of network devices.

The parent namespace of one namespace is able to manipulate the nested namespace… It is a “hard” sentence, but what this means is that the root namespace is always able to manipulate the resources in the nested namespaces. So the root of one host has the whole vision of the namespaces.

Using namespaces

Now that we know about namespaces, we want to use them 😉

We can think of a container as one process (e.g. a /bin/bash shell) that has its particular root filesystem, its particular network, its particular hostname, its particular PIDs and users, etc. And this can be achieved by creating all these namespaces and spawning the /bin/bash processes inside of them.

The Linux kernel includes the calls clone, setns and unshare that enable to easily manipulate the namespaces for processes. But the common Linux distributions also provide the commands unshare and nsenter that enable to manipulate the namespaces for proccesses and applications from the commandline.

If we get back to the main host, we can use the command unshare to create a process with its own namespaces:

calfonso:handcontainer$ unshare --mount --uts --ipc --net --pid --fork --user --map-root-user /bin/bash

It seems that nothing happened, except that we are “root”, but if we start using commands that manipulate the features in the host, we’ll see what happened.

8If we echo the PID of the current process ($$) we can see that it is 1 (the main process), the current user has UID and GID 0 (he is root), we have not any network device, we can manipulate the hostname…

If we check the processes in the host, in other terminal, we’ll see that even we are shown as ‘root’, outside the process our process is executed under the credentials of our regular user:

9.png

This is the magic of the PID namespace, that makes that one process has different PID numbers, depending on the namespace.

Back in our “unshared” environment, if we try to show the processes that are currently running, we’ll get the vision of the processes in the host:

10.pngThis is because of how Linux works: the processes are file descriptors in the /proc mount point and, in our environments, we still have access to the existing mountpoints. But as we have our mnt namespace, we can mount our particular mount filesystem:

11

From ouside the container, we will be able to create a network device and put it into the namespace:

12

And if we get back to our “unshared” environment, we’ll see that we have a new network device:

13.png

The network setup is incomplete, and we will have access to nowhere (the peer of our eth0 is not connected to any network). This falls out of the scope of this post, but tha main idea is that you will need to connect the peer to some bridge, set an IP address for the eth0 inside the unshared environment, set up a NAT in the host, etc.

Obtaining the filesystem of the container

Now that we are in an “isolated” environment, we want to have the filesystem, utilities, etc. from the container that we started. And this can be done with our old friend “chroot” and some mounts:

root:handcontainer# chroot rootfs ash
root:# mount -t proc none /proc
root:# mount -t sysfs none /sys
root:# mount -t tmpfs none /tmp

Using chroot, the filesystem changes and we can use all the new mount points, commands, etc. in that filesystem. So now we have the vision of being inside an isolated environment with an isolated filesystem.

Now we have finished setting up a “hand made container” from an existing Docker container.

Further work

Appart from the “contained environment”, the Docker containers also are managed inside cgroups. Cgroups enable to account and to limit the resources that the processes are able to use (i.e. CPU, I/O, Memory and Devices) and that will be interesting to better control the resources that the processes will be allowed to use (and how).

It is possible to explore the cgroups in the path /sys/fs/cgroups. In that folder you will find the different cgroups that are managed in the system. Dealing with cgroups is a bit “obscure” (creating subfolders, adding PID to files, etc.), and will be left to other eventual post.

Other features that offer Docker is the layered filesystems. The layered filesystem is used in Docker basically to have a common filesystem and only track the modifications. So there is a set of common layers for different containers (that will not be modified) and each of the containers will have a layer that makes its filesystem unique from the others.

In our case, we used a simple flat filesystem for the container, that we used as root filesystem for our contained environment. Dealing with layered filesystem will be a new post 😉

And now…

Well, in this post we tried to understand how the containers work and see that it is a relatively simple feature that is offered by the kernel. But it involves a lot of steps to have a properly configured container (remember that we left the cgroups out of this post).

We did this steps just because we could… just to better understand containers.

My advise is to use the existing technologies to be able to use well built containers (e.g. Docker).

Further reading

As in other posts, I wrote this just to arrange my concepts following a very simple step-by-step procedure. But you can find a lot of resources about containers using your favourite search engine. The most useful resources that I found are:

How to (securely) contain users using Docker containers

Docker container have proved to be very useful to deliver applications. They enable to pack all the libraries and dependencies needed by an application, and to run it in any system. One of the most drawbacks argued by Docker competitors is that the Docker daemon runs as root and it may introduce security threats.

I have searched for the security problems of Docker (e.g. sysdigblackhat conferenceCVEs, etc.) and I could only find privilege escalation by running privileged containers (–privileged), files that are written using root permissions, using the communication socket, using block devices, poisoned images, etc. But all of these problems are related to letting the users start their own containers.

So I think that Docker can be used by sysadmins to provide a different or contained environment to the users. E.g. having a CentOS 7 front-end, but letting some users to run an Ubuntu 16.04 environment. This is why this time I learned…

How to (securely) contain users using Docker containers

TL;DR

You can find the results of this tests in this repo: https://github.com/grycap/dosh

The repository contains DoSH (which stands for Docker SHell), which is a development to use Docker containers to run the shell of the users in your Linux system. It is an in-progress project that aims at provide a configurable and secure mechanism to make that when a user logs-in a Linux system, a customized (or standard) container will be created for him. This will enable to limit the resources that the user is able to use, the applications, etc. but also provide custom linux flavour for each user or group of users (i.e. it will coexist users that have CentOS 7 with Ubuntu 16.04 in the same server).

The Docker SHell

In a multi-user system it would be nice to offer a feature like providing different flavours of Linux, depending on the user. Or even including a “jailed” system for some specific users.

This could be achieved in a very easy way. You just need to create a script like the next one

root@onefront00:~# cat > /bin/dosh <<\EOF
docker run --rm -it alpine ash
EOF
root@onefront00:~# chmod +x /bin/dosh 
root@onefront00:~# echo "/bin/dosh" >> /etc/shells

And now you can change the sell of one user in /etc/passwd

myuser:x:9870:9870::/home/myuser:/bin/dosh

And you simply have to allow myuser to run docker containers (e.g. in Ubuntu, by adding the user to the “docker” group).

Now we have that when “myuser” logs in the system, he will be inside a container with the Alpine flavour:

alpine-dockershell-1

This is a simple solution that enables the user to have a specific linux distribution… but also your specific linux environment with special applications, libraries, etc.

But the user has not access to its home nor other files that will be interesting to give him the appearance of being in the real system. So we could just map his home folder (and other folders that we wanted to have inside the container; e.g. /tmp). A modified version of /bin/dosh will be the next one:

#!/bin/bash
username="$(whoami)"
docker run --rm -v /home/$username:/home/$username -v /tmp:/tmp -it alpine ash

But if we log in as myuser the result is that the user that logs in is… root. And the things that he does is as root.

alpine-dockershell-2

We to run the container as the user and not as root. An updated version of the script is the next:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

If myuser now logs in, the container has the permissions of this user

alpine-dockershell-3

We can double-check it by checking the running processes of the container

The problem now is that the name of the user (and the groups) are not properly resolved inside the container.

alpine-dockershell-6

This is because the /etc/passwd and the /etc/group files are included in the container, and they do not know about the users or groups in the system. As we want to resemble the system in the container, we can share a readonly copy of /etc/passwd and /etc/group by modifying the /bin/dosh script:

#!/bin/bash
username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"
docker run --rm -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -it alpine ash

And now the container has the permissions of the user and the username is resolved. So the user can access the resources in the filesystem in the same conditions that if he was accessing the hosting system.

alpine-dockershell-7

Now we should add the mappings for the folders to which the user has to have permissions to access (e.g. scratch, /opt, etc.).

Using this script as-is, the user will have different environment for each of the different sessions that he starts. That means that the processes will not be shared between different sessions.

But we can create a more ellaborated script to start containers using different Docker images depending on the user or on the group to which the user belongs. Or even to create pseudo-persistent containers that start when the user logs-in and stops when the user leaves (to allow multiple ttys for the same environment).

An example of this kind of script will be the next one:

#!/bin/bash

username="$(whoami)"
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Using this script we start the user containers on demand and their processes are kept between log-ins. Moreover, the log-in will fail in case that the container fails to start.

In the event that the system is powered off, the container will be powered off although its contents are kept for future log-ins (the container will be restarted from the stop state).

The development of Docker SHell continues in this repository: https://github.com/grycap/dosh

Security concerns

The main problem of Docker related to security is that the daemon is running as root. So if I am able to run containers, I am able to run something like this:

$ docker run --privileged alpine ash -c 'echo 1 > /proc/sys/kernel/sysrq; echo o > /proc/sysrq-trigger'

And the host will be powered off as a regular user. Or simply…

$ docker run --rm -v /etc:/etc -it alpine ash
/ # adduser mynewroot -G root
...
/ # exit

And once you exit the container, you will have a new root user in the physical host.

This happens because the user inside the container is “root” that has UID=0, and it is root because the Docker daemon is root with UID=0.

We could change this behaviour by shifting the user namespace with the flag –userns-remap and the subuids to make that the Docker daemon does not run as UID=0, but this will also limit the features of Docker for the sysadmin. The first consequence is that that the sysadmin will not be able to run Docker containers as root (nor privileged containers). If this is acceptable for your system, this will probably be the best solution for you as it limits the possible security threats.

If you are not experienced with the configuration of Docker or you simply do not want (or do not know how) to use the –userns-remap, you can still use DoSH.

On linux capabilities

If you add the flag --cap-grop=all (or selective cap-drop) to the sequence of running the Docker container, you can get an even more secure container that will never get some linux capabilities (e.g. to mount a device). You can learn more on capabilities in the linux manpage, but we can easily verify the capabilities…

We will run a process using inside the container, using the flag –cap-drop=all:

$ docker run --rm --cap-drop=all -u 1001:1001 -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/myuser:/home/myuser -v /tmp:/tmp -w /home/myuser -it alpine sleep 10000

Now we can check the capabilities of such process

capabilities-zero.png

We should check the effective capabilities (CapEff) but also the upper bound of the capabilities (CapBnd) which determines which capabilities could the process acquire (e.g. using sudo or executing a suid application). We can see that boths capabilities are zero, and that means that the process cannot get any capability.

Take into account that using  –capdrop=all will make that commands such as ping do not work because it is an application that needs specific capabilities (in the case of ping, it needs cap_net_raw, and this is why it has suid permissions).

capabilities-capdrop

Dropping capabilities when spawning the container will make that the sleep command inside a container is even more secure than the regular one. You can check it by simply repeating the same procedure but not using the containers. In such case, if you inspect the capabilities, you will find the next thing:

capabilities-all.png

The effective capabilities is zero, but the CapBnd field shows that the user could escalate up to get any of the capabilities in a buggy application.

Executing the docker commands by non-root users

The actual problem is that the user needs to be allowed to use Docker to spawn the DoSH container, and you do not want to allow the user to run arbitraty docker commands.

We can consider that the usage of Docker is secure if the containers are ran under the credentials of regular users, and the devices and other critical resources that are attached to the container are used under these credentials. So users can be allowed to run Docker containers if they are forced to include the flat -u <uid>:<gid> and the rest of the commandline is controlled.

The solution is as easy as installing sudo (which is shipped in the default distribution of Ubuntu but also is an standard package almost in any distribution) and allow users to run as sudo only a specific command that execute the docker commands, but do not allow these users to modify these commands.

Once installed sudo, we can create the file /etc/sudoers.d/dosh

root@onefront00:~# cat > /etc/sudoers.d/dosh <<\EOF
> ALL ALL=NOPASSWD: /bin/shell2docker
> EOF
root@onefront00:~# chmod 440 /etc/sudoers.d/dosh

Now we must move the previous /bin/dosh script to /bin/shell2docker and then we can create the script /bin/dosh with the following content:

root@onefront00:~# mv /bin/dosh /bin/shell2docker
root@onefront00:~# cat > /bin/dosh <<\EOF
#!/bin/bash
sudo /bin/shell2docker
EOF
root@onefront00:~# chmod +x /bin/dosh

And finally, we will remove the ability to run docker containers to the user (e.g. in Ubuntu, by removing him from the “docker” group).

If you try to log-in as the user, you will notice that now we have the problem that the user that runs the script is “root” and then the container will be run as “root”. But we can modify the script to detect whether the script has be ran as sudo or as a regular user and then catch the appropriate username. The updated script will be the next:

#!/bin/bash
if [ $SUDO_USER ]; then username=$SUDO_USER; else username="$(whoami)"; fi
uid="$(id -u $username)"
gid="$(id -g $username)"

CONTAINERNAME="container-${username}"
CONTAINERIMAGE="alpine"
CMD="ash"

case "$username" in
 myuser)
 CONTAINERIMAGE="ubuntu:16.04"
 CMD="/bin/bash";;
esac

RUNNING="$(docker inspect -f "{{.State.Running}}" "$CONTAINERNAME" 2> /dev/null)"
if [ $? -ne 0 ]; then
 docker run -h "$(hostname)" -u $uid:$gid -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v /home/$username:/home/$username -v /tmp:/tmp -w /home/$username -id --name "$CONTAINERNAME" "$CONTAINERIMAGE" "$CMD" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
else
 if [ "$RUNNING" == "false" ]; then
 docker start "$CONTAINERNAME" > /dev/null
 if [ $? -ne 0 ]; then
 exit 1
 fi
 fi
fi
docker exec -it "$CONTAINERNAME" "$CMD"

Now any user can execute the command that create the Docker container as root (using sudo), but the user cannot run arbitraty Docker commands. So all the security is now again in the side of the sysadmin that must create “secure” containers.

This is an in-progress work that will continue in this repository: https://github.com/grycap/dosh