How to move from a linear disk to an LVM disk and join the two disks into an LVM-like RAID-0

I had the recent need for adding a disk to an existing installation of Ubuntu, to make the / folder bigger. In such a case, I have two possibilities: to move my whole system to a new bigger disk (and e.g. dispose of the original disk) or to convert my disk to an LVM volume and add a second disk to enable the volume to grow. The first case was the subject of a previous post, but this time I learned…

How to move from a linear disk to an LVM disk and join the two disks into an LVM-like RAID-0

The starting point is simple:

  • I have one 14 Gb. disk (/dev/vda) with a single partition that is mounted in / (The disk has a GPT table and UEFI format and so it has extra partitions that we’ll keep as they are).
  • I have an 80 Gb. brand new disk (/dev/vdb)
  • I want to have one 94 Gb. volume built from the two disks
root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part /
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk /mnt
vdc 252:32 0 4G 0 disk [SWAP]

The steps are the next:

  1. Creating a boot partition in /dev/vdb (this is needed because Grub cannot boot from LVM and needs an ext or VFAT partition)
  2. Format the boot partition and put the content of the current /boot folder
  3. Create an LVM volume using the extra space in /dev/vdb and initialize it using an ext4 filesystem
  4. Put the contents of the current / folder into the new partition
  5. Update grub to boot from the new disk
  6. Update the mount point for our system
  7. Reboot (and check)
  8. Add the previous disk to the LVM volume.

Let’s start…

Separate the /boot partition

When installing an LVM system, it is needed to have a /boot partition in a common format (e.g. ext2 or ext4), because GRUB cannot read from LVM. Then GRUB reads the contents of that partition and starts the proper modules to read the LVM volumes.

So we need to create the /boot partition. In our case, we are using ext2 format, because has no digest (we do not need it for the content of /boot) and it is faster. We are using 1 Gb. for the /boot partition, but 512 Mb. will probably be enough:

root@somove:~# fdisk /dev/vdb

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (1-4, default 1):
First sector (2048-167772159, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-167772159, default 167772159): +1G

Created a new partition 1 of type 'Linux' and of size 1 GiB.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

root@somove:~# mkfs.ext2 /dev/vdb1
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 262144 4k blocks and 65536 inodes
Filesystem UUID: 24618637-d2d4-45fe-bf83-d69d37f769d0
Superblock backups stored on blocks:
32768, 98304, 163840, 229376

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

Now we’ll make a mount point for this partition, mount the partition and copy the contents of the current /boot folder to that partition:

root@somove:~# mkdir /mnt/boot
root@somove:~# mount /dev/vdb1 /mnt/boot/
root@somove:~# cp -ax /boot/* /mnt/boot/

Create an LVM volume in the extra space of /dev/vdb

First, we will create a new partition for our LVM system, and we’ll get the whole free space:

root@somove:~# fdisk /dev/vdb

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): n
Partition type
p primary (1 primary, 0 extended, 3 free)
e extended (container for logical partitions)
Select (default p):

Using default response p.
Partition number (2-4, default 2):
First sector (2099200-167772159, default 2099200):
Last sector, +sectors or +size{K,M,G,T,P} (2099200-167772159, default 167772159):

Created a new partition 2 of type 'Linux' and of size 79 GiB.

Command (m for help): w
The partition table has been altered.
Syncing disks.

Now we will create a Physical Volume, a Volume Group and the Logical Volume for our root filesystem, using the new partition:

root@somove:~# pvcreate /dev/vdb2
Physical volume "/dev/vdb2" successfully created.
root@somove:~# vgcreate rootvg /dev/vdb2
Volume group "rootvg" successfully created
root@somove:~# lvcreate -l +100%free -n rootfs rootvg
Logical volume "rootfs" created.

If you want to learn about LVM to better understand what we are doing, you can read my previous post.

Now we are initializing the filesystem of the new /dev/rootvg/rootfs volume using ext4, and then we’ll copy the existing filesystem except from the special folders and the /boot folder (which we have separated in the other partition):

root@somove:~# mkfs.ext4 /dev/rootvg/rootfs
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 20708352 4k blocks and 5177344 inodes
Filesystem UUID: 47b4b698-4b63-4933-98d9-f8904ad36b2e
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000

Allocating group tables: done
Writing inode tables: done
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done

root@somove:~# mkdir /mnt/rootfs
root@somove:~# mount /dev/rootvg/rootfs /mnt/rootfs/
root@somove:~# rsync -aHAXx --delete --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/boot/*,/lost+found} / /mnt/rootfs/

Update the system to boot from the new /boot partition and the LVM volume

At this point we have our /boot partition (/dev/vdb1) and the / filesystem (/dev/rootvg/rootfs). Now we need to prepare GRUB to boot using these new resources. And here comes the magic…

root@somove:~# mount --bind /dev /mnt/rootfs/dev/
root@somove:~# mount --bind /sys /mnt/rootfs/sys/
root@somove:~# mount -t proc /proc /mnt/rootfs/proc/
root@somove:~# chroot /mnt/rootfs/

We are binding the special mount points /dev and /sys to the same folders in the new filesystem which is mounted in /mnt/rootfs. We are also creating the /proc mount point which holds the information about the processes. You can find some more information about why this is needed in my previous post on chroot and containers.

Intuitively, we are somehow “in the new filesystem” and now we can update things as if we had already booted into it.

At this point, we need to update the mount point in /etc/fstab to mount the proper disks once the system boots. So we are getting the UUIDs for our partitions:

root@somove:/# blkid
/dev/vda1: LABEL="cloudimg-rootfs" UUID="135ecb53-0b91-4a6d-8068-899705b8e046" TYPE="ext4" PARTUUID="b27490c5-04b3-4475-a92b-53807f0e1431"
/dev/vda14: PARTUUID="14ad2c62-0a5e-4026-a37f-0e958da56fd1"
/dev/vda15: LABEL="UEFI" UUID="BF99-DB4C" TYPE="vfat" PARTUUID="9c37d9c9-69de-4613-9966-609073fba1d3"
/dev/vdb1: UUID="24618637-d2d4-45fe-bf83-d69d37f769d0" TYPE="ext2"
/dev/vdb2: UUID="Uzt1px-ANds-tXYj-Xwyp-gLYj-SDU3-pRz3ed" TYPE="LVM2_member"
/dev/mapper/rootvg-rootfs: UUID="47b4b698-4b63-4933-98d9-f8904ad36b2e" TYPE="ext4"
/dev/vdc: UUID="3377ec47-a0c9-4544-b01b-7267ea48577d" TYPE="swap"

And we are updating /etc/fstab to mount /dev/mapper/rootvg-rootfs as the / folder. But we need to mount partition /dev/vdb1 in /boot. Using our example, the /etc/fstab file will be this one:

UUID="47b4b698-4b63-4933-98d9-f8904ad36b2e" / ext4 defaults 0 0
UUID="24618637-d2d4-45fe-bf83-d69d37f769d0" /boot ext2 defaults 0 0
LABEL=UEFI /boot/efi vfat defaults 0 0
UUID="3377ec47-a0c9-4544-b01b-7267ea48577d" none swap sw,comment=cloudconfig 0 0

We are using the UUID to mount / and /boot folders because the devices may change their names or location and that may lead to breaking our system.

And now we are ready to mount our /boot partition, update grub, and to install it in the /dev/vda disk (because we are keeping both disks).

root@somove:/# mount /boot
root@somove:/# update-grub
Generating grub configuration file ...
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Found linux image: /boot/vmlinuz-4.15.0-43-generic
Found initrd image: /boot/initrd.img-4.15.0-43-generic
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Found Ubuntu 18.04.1 LTS (18.04) on /dev/vda1
done
root@somove:/# grub-install /dev/vda
Installing for i386-pc platform.
Installation finished. No error reported.

Reboot and check

We are almost done, and now we are exiting the chroot and rebooting

root@somove:/# exit
root@somove:~# reboot

And the result should be the next one:

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk
├─vdb1 252:17 0 1G 0 part /boot
└─vdb2 252:18 0 79G 0 part
└─rootvg-rootfs 253:0 0 79G 0 lvm /
vdc 252:32 0 4G 0 disk [SWAP]

root@somove:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 676K 394M 1% /run
/dev/mapper/rootvg-rootfs 78G 993M 73G 2% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/vdb1 1008M 43M 915M 5% /boot
/dev/vda15 105M 3.6M 101M 4% /boot/efi
tmpfs 395M 0 395M 0% /run/user/1000

We have our / system mounted from the new LVM Logical Volume /dev/rootvg/rootfs, the /boot partition from /dev/vdb1, and the /boot/efi from the existing partition (just in case that we need it).

Add the previous disk to the LVM volume

Here we are facing the easier part, which is to integrate the original /dev/vda1 volume in the LVM volume.

Once we have double-checked that we had copied every file from the original / folder in /dev/vda1, we can initialize it for using it in LVM:

WARNING: This step wipes the content of /dev/vda1.

root@somove:~# pvcreate /dev/vda1
WARNING: ext4 signature detected on /dev/vda1 at offset 1080. Wipe it? [y/n]: y
Wiping ext4 signature on /dev/vda1.
Physical volume "/dev/vda1" successfully created.

Finally, we can integrate the new partition in our volume group and extend the logical volume to use the free space:

root@somove:~# vgextend rootvg /dev/vda1
Volume group "rootvg" successfully extended
root@somove:~# lvextend -l +100%free /dev/rootvg/rootfs
Size of logical volume rootvg/rootfs changed from <79.00 GiB (20223 extents) to 92.88 GiB (23778 extents).
Logical volume rootvg/rootfs successfully resized.
root@somove:~# resize2fs /dev/rootvg/rootfs
resize2fs 1.44.1 (24-Mar-2018)
Filesystem at /dev/rootvg/rootfs is mounted on /; on-line resizing required
old_desc_blocks = 10, new_desc_blocks = 12
The filesystem on /dev/rootvg/rootfs is now 24348672 (4k) blocks long.

And now we have the new 94 Gb. / folder which is made from /dev/vda1 and /dev/vdb2:

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
├─vda1 252:1 0 13.9G 0 part
│ └─rootvg-rootfs 253:0 0 92.9G 0 lvm /
├─vda14 252:14 0 4M 0 part
└─vda15 252:15 0 106M 0 part /boot/efi
vdb 252:16 0 80G 0 disk
├─vdb1 252:17 0 1G 0 part /boot
└─vdb2 252:18 0 79G 0 part
└─rootvg-rootfs 253:0 0 92.9G 0 lvm /
vdc 252:32 0 4G 0 disk [SWAP]
root@somove:~# df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 395M 676K 394M 1% /run
/dev/mapper/rootvg-rootfs 91G 997M 86G 2% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/vdb1 1008M 43M 915M 5% /boot
/dev/vda15 105M 3.6M 101M 4% /boot/efi
tmpfs 395M 0 395M 0% /run/user/1000

(optional) Having the /boot partition to /dev/vda

In case we wanted to have the /boot partition in /dev/vda, the procedure will be a bit different:

  1. Instead of creating the LVM volume in /dev/vdb1, I would prefer to create a single partition /dev/vdb1 (ext4) which does not imply the separation of /boot and /.
  2. Once created /dev/vdb1, copy the filesystem in /dev/vda1 to /dev/vdb1 and prepare to boot from /dev/vdb1 (chroot, adjust mount points, update-grub, grub-install…).
  3. Boot from the new partition and wipe the original /dev/vda1 partition.
  4. Create a partition /dev/vda1 for the new /boot and initialize it using ext2, copy the contents of /boot according to the instructions in this post.
  5. Create a partition /dev/vda2, create the LVM volume, initialize it and copy the contents of /dev/vdb1 except from /boot
  6. Prepare to boot from /dev/vda (chroot, adjust mount points, mount /boot, update-grub, grub-install…)
  7. Boot from the new /root+LVM / and decide whether you want to add /dev/vdb to the LVM volume or not.

Using this procedure, you will get from linear to LVM with a single disk. And then you can decide whether to make the LVM to grow or not. Moreover you may decide whether to create a LVM-Raid(1,5,…) with the new or other disks.

How to move an existing installation of Ubuntu to another disk

Under some circumstances, we may have the need of moving a working installation of Ubuntu to another disk. The most common case is when your current disk runs out of space and you want to move it to a bigger one. But you could also want to move to an SSD disk or to create an LVM raid…

So this time I learned…

How to move an existing installation of Ubuntu to another disk

I have a 14Gb disk that contains my / partition (vda), and I want to move to a new 80Gb disk (vdb).

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
└─vda1 252:1 0 13.9G 0 part /
vdb 252:16 0 80G 0 disk
vdc 252:32 0 4G 0 disk [SWAP]

First of all, I will create a partition for my / system in /dev/vdb.

root@somove:~# fdisk /dev/vdb

Welcome to fdisk (util-linux 2.31.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-167772159, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-167772159, default 167772159):

Created a new partition 1 of type 'Linux' and of size 80 GiB.

Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

NOTE: The inputs from the user are: n for the new partition and the defaults (i.e. return) for any setting to get the whole disk. Then w to write the partition table.

Now that I have the new partition, we’ll create the filesystem (ext4):

root@somove:~# mkfs.ext4 /dev/vdb1
mke2fs 1.44.1 (24-Mar-2018)
Creating filesystem with 20971264 4k blocks and 5242880 inodes
Filesystem UUID: ea7ee2f5-749e-4e74-bcc3-2785297291a4
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000

Allocating group tables: done
Writing inode tables: done
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done

We have to transfer the content of the running filesystem to the new disk. But first, we’ll make sure that any other mount point except for / is unmounted (to avoid copying files in other disks).:

root@somove:~# umount -a
umount: /run/user/1000: target is busy.
umount: /sys/fs/cgroup/unified: target is busy.
umount: /sys/fs/cgroup: target is busy.
umount: /: target is busy.
umount: /run: target is busy.
umount: /dev: target is busy.
root@somove:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=2006900k,nr_inodes=501725,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=403912k,mode=755)
/dev/vda1 on / type ext4 (rw,relatime,data=ordered)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=403908k,mode=700,uid=1000,gid=1000)
root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:0 0 14G 0 disk
└─vda1 252:1 0 13.9G 0 part /
vdb 252:16 0 80G 0 disk
└─vdb1 252:17 0 80G 0 part
vdc 252:32 0 4G 0 disk [SWAP]

Now we will create a mount point for the new filesystem and we’ll copy everything from / to it, except for the special folders (i.e. /tmp, /sys, /dev, etc.). Once completed, we’ll create the Linux special folders:

root@somove:~# mkdir /mnt/vdb1
root@somove:~# mount /dev/vdb1 /mnt/vdb1/
root@somove:~# rsync -aHAXx --delete --exclude={/dev/*,/proc/*,/sys/*,/tmp/*,/run/*,/mnt/*,/media/*,/lost+found} / /mnt/vdb1/

Instead of using rsync, we could use cp -ax /bin /etc /home /lib /lib64 …, but you need make sure that all folders and files are copied. You also need to make sure that the special folders are created by running mkdir /mnt/vdb1/{boot,mnt,proc,run,tmp,dev,sys}. The rsync version is easier to control and to understand.

Now that we have the same directory tree, we just need to make the magic of chroot to prepare the new disk:

root@somove:~# mount --bind /dev /mnt/vdb1/dev
root@somove:~# mount --bind /sys /mnt/vdb1/sys
root@somove:~# mount -t proc /proc /mnt/vdb1/proc
root@somove:~# chroot /mnt/vdb1/

We need to make sure that the new system will try to mount in / the new partition (i.e. /dev/vdb1), but we cannot use /dev/vdb1 id because if we remove the other disk, it will modify its device name to /dev/vda1. So we are using the UUID of the disk. To get it, we can use blkid:

root@somove:/# blkid
/dev/vda1: UUID="135ecb53-0b91-4a6d-8068-899705b8e046" TYPE="ext4"
/dev/vdb1: UUID="eb8d215e-d186-46b8-bd37-4b244cbb8768" TYPE="ext4"

And now we have to update file /etc/fstab to mount the proper UUID in the / folder. The new file /etc/fstab for our example is the next:

UUID="eb8d215e-d186-46b8-bd37-4b244cbb8768" / ext4 defaults 0 0

At this point, we need to update grub to match our disks (it will get the UUID or labels), and install it in the new disk:

root@somove:/# update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-4.15.0-43-generic
Found initrd image: /boot/initrd.img-4.15.0-43-generic
WARNING: Failed to connect to lvmetad. Falling back to device scanning.
Found Ubuntu 18.04.1 LTS (18.04) on /dev/vda1
done
root@somove:/# grub-install /dev/vdb
Installing for i386-pc platform.
Installation finished. No error reported.

WARNING: In case we get error “error: will not proceed with blocklists.”, please go to the end part of this post.

WARNING: If you plan to keep the original disk in its place (e.g. a Virtual Machine in Amazon or OpenStack), you must install grub in /dev/vda. Otherwise, it will boot the previous system.

Finally, you can exit from chroot, power off, remove the old disk, and boot using the new one. The result will be next:

root@somove:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 252:16 0 80G 0 disk
└─vda1 252:17 0 80G 0 part /
vdc 252:32 0 4G 0 disk [SWAP]

What if we get error “error: will not proceed with blocklists.”

If we get this error (only if we get this error), we’ll need to wipe the gap between the init of the disk and the partition and then we’ll be able to install grub in the disk.

WARNING: make sure that you know what you are doing, or the disk is new, because this can potentialle erase the data of /dev/vdb.

$ grub-install /dev/vdb
Installing for i386-pc platform.
grub-install: warning: Attempting to install GRUB to a disk with multiple partition labels. This is not supported yet..
grub-install: warning: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However, blocklists are UNRELIABLE and their use is discouraged..
grub-install: error: will not proceed with blocklists.

In this case, we need to check the partition table of /dev/vdb

root@somove:/# fdisk -l /dev/vdb
Disk /dev/vdb: 80 GiB, 85899345920 bytes, 167772160 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device Boot Start End Sectors Size Id Type
/dev/vdb1 2048 167772159 167770112 80G 83 Linux

And now we will put zeros in /dev/vdb (skipping the first sector where the partition table is stored), up to the sector before to where our partition starts (in our case, partition /dev/vdb1 starts in sector 2048 and so we will zero 2047 sectors):

root@somove:/# dd if=/dev/zero of=/dev/vdb seek=1 count=2047
2047+0 records in
2047+0 records out
1048064 bytes (1.0 MB, 1.0 MiB) copied, 0.0245413 s, 42.7 MB/s

If this was the problem, now you should be able to install grub:

root@somove:/# grub-install /dev/vdb
Installing for i386-pc platform.
Installation finished. No error reported.

How to use LVM

LVM stands for Logical Volume Manager, and it provides logical volume management for Linux kernels. It enables us to manage multiple physical disks from a single manager and to create logical volumes that take profit from having multiple disks (e.g. RAID, thin provisioning, volumes that span across disks, etc.).

I needed using LVM multiple times, and in special it is of interest when dealing with LVM backed cinder in OpenStack.

So this time I learned…

How to use LVM

LVM is installed in multiple Linux distros and they are usually LVM-aware, to be able to boot from LVM volumes.

For the purpose of this post, we’ll consider LVM as a mechanism to manage multiple physical disks and to create logical volumes on them. Then LVM will show the operating system the logical volumes as if they were disks.

Testlab

LVM is intended for physical disks (e.g. /dev/sda, /dev/sdb, etc.). But we are creating a test lab to avoid the need of buying physical disks.

We are creating 4 fake disks of size 256Mb each. To create each of them we simply create a file of the proper size (that will store the data), and then we attach that file to a loop device:

root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.0 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.1 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.2 bs=1M count=256
...
root@s1:/tmp# dd if=/dev/zero of=/tmp/fake-disk-256.3 bs=1M count=256
...
root@s1:/tmp# losetup /dev/loop0 ./fake-disk-256.0
root@s1:/tmp# losetup /dev/loop1 ./fake-disk-256.1
root@s1:/tmp# losetup /dev/loop2 ./fake-disk-256.2
root@s1:/tmp# losetup /dev/loop3 ./fake-disk-256.3

And now you have 4 working disks for our tests:

root@s1:/tmp# fdisk -l
Disk /dev/loop0: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop1: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop2: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/loop3: 256 MiB, 268435456 bytes, 524288 sectors
...
Disk /dev/sda: (...)

For the system, these devices can be used as regular disks (e.g. format them, mount, etc.):

root@s1:/tmp# mkfs.ext4 /dev/loop0
mke2fs 1.44.1 (24-Mar-2018)
...
Writing superblocks and filesystem accounting information: done

root@s1:/tmp# mkdir -p /tmp/mnt/disk0
root@s1:/tmp# mount /dev/loop0 /tmp/mnt/disk0/
root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# touch this-is-a-file
root@s1:/tmp/mnt/disk0# ls -l
total 16
drwx------ 2 root root 16384 Apr 16 12:35 lost+found
-rw-r--r-- 1 root root 0 Apr 16 12:38 this-is-a-file
root@s1:/tmp/mnt/disk0# cd /tmp/
root@s1:/tmp# umount /tmp/mnt/disk0

Concepts of LVM

LVM has simple actors:

  • Physical volume: which is a physical disk.
  • Volume group: which is a set of physical disks managed together.
  • Logical volume: which is a block device.

Logical Volumes (LV) are stored in Volume Groups (VG), which are backed by Physical Volumes (PV).

PVs are managed using pv* commands (e.g. pvscan, pvs, pvcreate, etc.). VGs are managed using vg* commands (e.g. vgs, vgdisplay, vgextend, etc.). LGs are managed using lg* commands (e.g. lvdisplay, lvs, lvextend, etc.).

Simple workflow with LVM

To have an LVM system, we have to first initialize a physical volume. That is somehow “initializing a disk in LVM format”, and that wipes the content of the disk:

root@s1:/tmp# pvcreate /dev/loop0
WARNING: ext4 signature detected on /dev/loop0 at offset 1080. Wipe it? [y/n]: y
Wiping ext4 signature on /dev/loop0.
Physical volume "/dev/loop0" successfully created.

Now we have to create a volume group (we’ll call it test-vg):

root@s1:/tmp# vgcreate test-vg /dev/loop0
Volume group "test-vg" successfully created

And finally, we can create a logical volume

root@s1:/tmp# lvcreate -l 100%vg --name test-vol test-vg
Logical volume "test-vol" created.

And now we have a simple LVM system that is built from one single physical disk (/dev/loop0) that contains one single volume group (test-vg) that holds a single logical volume (test-vol).

Examining things in LVM

  • The commands to examine PVs: pvs and pvdisplay. Each of them offers different information. pvscan also exists, but it is not needed in current versions of LVM.
  • The commands to examine VGs: vgs and vgdisplay. Each of them offers different information. vgscan also exists, but it is not needed in current versions of LVM.
  • The commands to examine PVs: lvs and lvdisplay. Each of them offers different information. lvscan also exists, but it is not needed in current versions of LVM.

Each command has several options, which we are not exploring here. We are just using the commands and we’ll present some options in the next examples.

At this time we should have a PV, a VG, and LV, and we can see them by using pvs, vgs and lvs:

root@s1:/tmp# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 0
root@s1:/tmp# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 1 1 0 wz--n- 252.00m 0
root@s1:/tmp# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
test-vol test-vg -wi-a----- 252.00m

Now we can use the test-vol as if it was a partition:

root@s1:/tmp# mkfs.ext4 /dev/mapper/test--vg-test--vol
mke2fs 1.44.1 (24-Mar-2018)
...
Writing superblocks and filesystem accounting information: done

root@s1:/tmp# mount /dev/mapper/test--vg-test--vol /tmp/mnt/disk0/
root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# touch this-is-my-file
root@s1:/tmp/mnt/disk0# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 241M 2.1M 222M 1% /tmp/mnt/disk0

Adding another disk to grow the filesystem

Imagine that we have filled our 241Mb volume in our 256Mb disk (/dev/loop0), and we need some more storage space. We could buy an extra disk (i.e. /dev/loop1) and add it to the LVM (using command vgextend).

 

root@s1:/tmp# pvcreate /dev/loop1
Physical volume "/dev/loop1" successfully created.
root@s1:/tmp# vgextend test-vg /dev/loop1
Volume group "test-vg" successfully extended

And now we have two physical volumes added to a single volume group. The VG is of size 504Mb and there are 252Mb free.

root@s1:/tmp# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 0
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m
root@s1:/tmp# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 2 1 0 wz--n- 504.00m 252.00m

We could think as if the VG was somehow a disk and the LV are the partitions. So we can make grow the LV in the VG and then grow the filesystem:

root@s1:/tmp/mnt/disk0# lvscan
ACTIVE '/dev/test-vg/test-vol' [252.00 MiB] inherit

root@s1:/tmp/mnt/disk0# lvextend -l +100%free /dev/test-vg/test-vol
Size of logical volume test-vg/test-vol changed from 252.00 MiB (63 extents) to 504.00 MiB (126 extents).
Logical volume test-vg/test-vol successfully resized.

root@s1:/tmp/mnt/disk0# resize2fs /dev/test-vg/test-vol
resize2fs 1.44.1 (24-Mar-2018)
Filesystem at /dev/test-vg/test-vol is mounted on /tmp/mnt/disk0; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 4
The filesystem on /dev/test-vg/test-vol is now 516096 (1k) blocks long.

root@s1:/tmp/mnt/disk0# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 485M 2.3M 456M 1% /tmp/mnt/disk0

root@s1:/tmp/mnt/disk0# ls -l
total 12
drwx------ 2 root root 12288 Apr 22 16:46 lost+found
-rw-r--r-- 1 root root 0 Apr 22 16:46 this-is-my-file

Now we have the new LV with double size.

Downsize the LV

Now we have obtained some free space and we want to keep only 1 disk (e.g. /dev/loop0). So we can downsize the filesystem (e.g. to 200Mb), and then downsize the LV.

This method needs unmounting the filesystem. So if you want to resize the root partition, you would need to use a live system or pivoting root to an unused filesystem as described [here](https://unix.stackexchange.com/a/227318).

First, unmount the filesystem and check it:

root@s1:/tmp# umount /tmp/mnt/disk0
root@s1:/tmp# e2fsck -ff /dev/mapper/test--vg-test--vol
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/test--vg-test--vol: 12/127008 files (0.0% non-contiguous), 22444/516096 blocks

Then change the size of the filesystem to the desired size:

root@s1:/tmp# resize2fs /dev/test-vg/test-vol 200M
resize2fs 1.44.1 (24-Mar-2018)
Resizing the filesystem on /dev/test-vg/test-vol to 204800 (1k) blocks.
The filesystem on /dev/test-vg/test-vol is now 204800 (1k) blocks long.

And now, we’ll reduce the logical volume to the new size and re-check the filesystem:

root@s1:/tmp# lvreduce -L 200M /dev/test-vg/test-vol
WARNING: Reducing active logical volume to 200.00 MiB.
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce test-vg/test-vol? [y/n]: y
Size of logical volume test-vg/test-vol changed from 504.00 MiB (126 extents) to 200.00 MiB (50 extents).
Logical volume test-vg/test-vol successfully resized.
root@s1:/tmp# lvdisplay
--- Logical volume ---
LV Path /dev/test-vg/test-vol
LV Name test-vol
VG Name test-vg
LV UUID xGh4cd-R93l-UpAL-LGTV-qnxq-vvx2-obSubY
LV Write Access read/write
LV Creation host, time s1, 2020-04-22 16:26:48 +0200
LV Status available
# open 0
LV Size 200.00 MiB
Current LE 50
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:0

root@s1:/tmp# resize2fs /dev/test-vg/test-vol
resize2fs 1.44.1 (24-Mar-2018)
The filesystem is already 204800 (1k) blocks long. Nothing to do!

And now we are ready to use the disk with the new size:

root@s1:/tmp# mount /dev/test-vg/test-vol /tmp/mnt/disk0/
root@s1:/tmp# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-test--vol 190M 1.6M 176M 1% /tmp/mnt/disk0root@s1:/tmp# cd /tmp/mnt/disk0/
root@s1:/tmp/mnt/disk0# ls -l
total 12
drwx------ 2 root root 12288 Apr 22 16:46 lost+found
-rw-r--r-- 1 root root 0 Apr 22 16:46 this-is-my-file

Removing a PV

Now we want to remove /dev/loop0 (which was our original disks) and keep the replacement (/dev/loop1).

We just need to free /dev/loop0 from VGs and remove it from the VG, for being able to safely remove it from the system. First, we check the PVs:

root@s1:/tmp/mnt/disk0# pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/loop0 test-vg lvm2 a-- 252.00m 52.00m 200.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m 0

We can see that /dev/loop0 is used, so we need to move its data to another PV:

root@s1:/tmp/mnt/disk0# pvmove /dev/loop0
/dev/loop0: Moved: 100.00%
root@s1:/tmp/mnt/disk0# pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m 0
/dev/loop1 test-vg lvm2 a-- 252.00m 52.00m 200.00m

Now /dev/loop0 is 100% free and we can remove it from the VG:

root@s1:/tmp/mnt/disk0# vgreduce test-vg /dev/loop0
Removed "/dev/loop0" from volume group "test-vg"
root@s1:/tmp/mnt/disk0# pvremove /dev/loop0
Labels on physical volume "/dev/loop0" successfully wiped.

Thin provisioning with LVM

Thin provisioning consists of providing the user with the illusion of having an amount of storage space, but only storing the amount of space that is actually used. It is similar to qcow2 or vmdk disk formats for virtual machines. The idea is that, if you have 1 Gb of storage space used, the backend will only store these data, even if the volume is 10Gb. The total amount of storage space will be used as it is requested.

First, you need to reserve an effective storage space in the form of a “thin pool”. The next example reserves 200M (-L 200M) as a thin storage (-T) as the thin pool thinpool in VG test-vg:

root@s1:/tmp/mnt# lvcreate -L 200M -T test-vg/thinpool
Using default stripesize 64.00 KiB.
Thin pool volume with chunk size 64.00 KiB can address at most 15.81 TiB of data.
Logical volume "thinpool" created.

The result is that we’ll have a volume of size 200M, which is like a volume but is marked as a thin pool.

Now we can create two thin-provisioned volumes of the same size:

root@s1:/tmp/mnt# lvcreate -V 200M -T test-vg/thinpool -n thin-vol-1
Using default stripesize 64.00 KiB.
Logical volume "thin-vol-1" created.
root@s1:/tmp/mnt# lvcreate -V 200M -T test-vg/thinpool -n thin-vol-2
Using default stripesize 64.00 KiB.
WARNING: Sum of all thin volume sizes (400.00 MiB) exceeds the size of thin pool test-vg/thinpool and the amount of free space in volume group (296.00 MiB).
WARNING: You have not turned on protection against thin pools running out of space.
WARNING: Set activation/thin_pool_autoextend_threshold below 100 to trigger automatic extension of thin pools before they get full.
Logical volume "thin-vol-2" created.
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 0.00
thin-vol-2 200.00m 0.00
thinpool 200.00m 0.00 2

The result is that we have 2 volumes of 200Mb each, while actually having only 200Mb. But each of the volumes is empty. Now as we use the volumes, the space will be occupied:

root@s1:/tmp/mnt# mkfs.ext4 /dev/test-vg/thin-vol-1
(...)
Writing superblocks and filesystem accounting information: done

root@s1:/tmp/mnt# mount /dev/test-vg/thin-vol-1 /tmp/mnt/disk0/
root@s1:/tmp/mnt# dd if=/dev/random of=/tmp/mnt/disk0/randfile bs=1K count=1024
dd: warning: partial read (94 bytes); suggest iflag=fullblock
0+1024 records in
0+1024 records out
48623 bytes (49 kB, 47 KiB) copied, 0.0701102 s, 694 kB/s
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 5.56
thin-vol-2 200.00m 0.00
thinpool 200.00m 5.56 2
root@s1:/tmp/mnt# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-thin--vol--1 190M 1.6M 175M 1% /tmp/mnt/disk0

In this example, we have used 49kb, which means 5.56%. If we repeat the process for the other volume:

root@s1:/tmp/mnt# mkfs.ext4 /dev/test-vg/thin-vol-2
(...)
Writing superblocks and filesystem accounting information: done

root@s1:/tmp/mnt# mkdir -p /tmp/mnt/disk1
root@s1:/tmp/mnt# mount /dev/test-vg/thin-vol-2 /tmp/mnt/disk1/
root@s1:/tmp/mnt# dd if=/dev/random of=/tmp/mnt/disk1/randfile bs=1K count=1024
dd: warning: partial read (86 bytes); suggest iflag=fullblock
0+1024 records in
0+1024 records out
38821 bytes (39 kB, 38 KiB) copied, 0.0473561 s, 820 kB/s
root@s1:/tmp/mnt# lvs -o name,lv_size,data_percent,thin_count
LV LSize Data% #Thins
thin-vol-1 200.00m 5.56
thin-vol-2 200.00m 5.56
thinpool 200.00m 11.12 2
root@s1:/tmp/mnt# df -h
Filesystem Size Used Avail Use% Mounted on
(...)
/dev/mapper/test--vg-thin--vol--1 190M 1.6M 175M 1% /tmp/mnt/disk0
/dev/mapper/test--vg-thin--vol--2 190M 1.6M 175M 1% /tmp/mnt/disk1

We can see that the space is being consumed, but we still see the 200Mb for each of the volumes.

Using RAID with LVM

Apart from using LVM as RAID-0 (i.e. stripping an LV across multiple physical devices), it is possible to create other types of raids using LVM.

Some of the most popular types of raids are RAID-1, RAID-10, and RAID-5. In the context of LVM, they intuitively mean:

  • RAID-1: mirror an LV in multiple PV
  • RAID-10: mirror an LV in multiple PV, but also stripping parts of the volume in different PVs.
  • RAID-5: distributing the LV in multiple PV, and using some parity data to be able to continue working if PV fails.

You can check more specific information on RAIDs in this link.

You can use other RAID utilities (such as on-board RAIDs or software like mdadm), but using LVM you will also be able to take profit from LVM features.

Mirroring an LV with LVM

The first use case is to get a volume that is mirrored in multiple PV. We’ll get back to the 2 PV set-up:

root@s1:/tmp/mnt# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 252.00m
root@s1:/tmp/mnt# vgs
VG #PV #LV #SN Attr VSize VFree
test-vg 2 0 0 wz--n- 504.00m 504.00m

And now we can create an LV that is in RAID1. But we can also set the number of copies that we want for the volume (using flag -m). In this case, we’ll create LV lv-mirror (-n lv-mirror) that is mirrored once (-m 1) with a size of 100Mb (-L 100M).

root@s1:/tmp/mnt# lvcreate --type raid1 -m 1 -L 100M -n lv-mirror test-vg
Logical volume "lv-mirror" created.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror 100.00 lv-mirror_rimage_0(0),lv-mirror_rimage_1(0)
[lv-mirror_rimage_0] /dev/loop1(1)
[lv-mirror_rimage_1] /dev/loop0(1)
[lv-mirror_rmeta_0] /dev/loop1(0)
[lv-mirror_rmeta_1] /dev/loop0(0)

As you can see, LV lv-mirror is built from lv-mirror_rimage_0, lv-mirror_rmeta_0, lv-mirror_rimage_1 and lv-mirror_rmeta_1 “devices”, and in which PV is located each part.

For the case of RAID1, you can convert it to and from linear volumes. This feature is not (yet) implemented for other types of raid such as RAID10 or RAID5.

You can change the number of mirror copies for an LV (even getting the volume to linear), by using command lvconvert:

root@s1:/tmp/mnt# lvconvert -m 0 /dev/test-vg/lv-mirror
Are you sure you want to convert raid1 LV test-vg/lv-mirror to type linear losing all resilience? [y/n]: y
Logical volume test-vg/lv-mirror successfully converted.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror /dev/loop1(1)
root@s1:/tmp/mnt# pvs
PV VG Fmt Attr PSize PFree
/dev/loop0 test-vg lvm2 a-- 252.00m 252.00m
/dev/loop1 test-vg lvm2 a-- 252.00m 152.00m

In this case, we have converted the volume to linear (i.e. zero copies).

But you can also get mirror capabilities for a linear volume:

root@s1:/tmp/mnt# lvconvert -m 1 /dev/test-vg/lv-mirror
Are you sure you want to convert linear LV test-vg/lv-mirror to raid1 with 2 images enhancing resilience? [y/n]: y
Logical volume test-vg/lv-mirror successfully converted.
root@s1:/tmp/mnt# lvs -a -o name,copy_percent,devices
LV Cpy%Sync Devices
lv-mirror 100.00 lv-mirror_rimage_0(0),lv-mirror_rimage_1(0)
[lv-mirror_rimage_0] /dev/loop1(1)
[lv-mirror_rimage_1] /dev/loop0(1)
[lv-mirror_rmeta_0] /dev/loop1(0)
[lv-mirror_rmeta_1] /dev/loop0(0)

More on RAIDs

Using command lvcreate it is also possible to create other types of RAID (e.g. RAID10, RAID5, RAID6, etc.). The only requirement is to have enough PVs for the type of RAID. But you must also have in mind that it is not possible to convert from these other types of raid to or from LV.

You can find much more information on LVM and RAIDs in this link.

More on LVM

There are a lot of features and tweaks to adjust in LVM, but this post shows the basics (and a bit more) to deal with LVM. You are advised to check the file /etc/lvm.conf and the man page.

How to deal with the Union File Systems that use Docker (OverlayFS and AUFS)

Containers are a modern application delivery mechanism (very interesting for software reproduciblity). As I commented in my previous post, undoubtedly the winner hype is Docker. Most of containers developments (e.g. Docker) are supported by the Linux kernel, by using namespaces, cgroups, other technologies… and chroots to the filesystem of the system to virtualize.

Using lightweight virtualization increases the density of the virtualized units, but many containers may share the same base system (e.g. the plain OS with some utilities installed) and only modify a few files (e.g. installing one application of updating the configuration files). And that is why Docker and others use Union File Systems to implement the filesystems of the virtualized units.

The recent publication of the NIST “Application Container Security Guide” suggests that “An image should only include the executables and libraries required by the app itself; all other OS functionality is provided by the OS kernel within the underlying host OS. Images often use techniques like layering and copy-on-write (in which shared master images are read only and changes are recorded to separate files) to minimize their size on disk and improve operational efficiency“. This strenghtens the usage of Union File Systems for containers, as Docker has introduced since I remember (AUFS, OverlayFS, OverlayFS2, etc.).

The conclusion is that Union File Systems are actually used in Docker, and I need to understand how to deal with them if anything fails. So this time I learned…

How to deal with the Union File Systems that use Docker (AUFS, OverlayFS and Overlay2)

AUFS

In first place, it is important to know how AUFS works. It is intuitively simple and I think that it is well explained in the Docker documentation. The next image is from the documentation of Docker (just in case that the URL changes):

aufs_layers

The idea is to have a set of layers, which consist of different directory trees, and they are combined to show a single one which is the result of ordered combination of the different directory trees. The order is important, because if one file is present in more than one directory tree, you will only see the version in the “upper layer”.

There are different readonly layers, and a working layer that gathers the modification in the resulting filesystem: if some files are modified (or added), the new version will appear in the working layer; if any file is deleted, some metadata is included in the layers, to instruct the kernel to hide the file in the resulting filesystem (we’ll see some practical examples).

Working with AUFS

We are preparing a test example to see how AUFS works and that it is very easy to understand and to work with.

We’ll have 2 base folders (named layer1 and layer2) and a working folder (named worklayer). And a folder named mountedfs that will hold the combined filesystem. The next commands will create the starting scenario:

root:root# cd /tmp
root:tmp# mkdir aufs-test
root:tmp# cd aufs-test/
root:aufs-test# mkdir layer1 layer2 upperlayer mountedfs
root:aufs-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:aufs-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:aufs-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:aufs-test# echo "content for file3.txt in layer2" > layer2/file3.txt

Both layer1 and layer2 have a file with the same name (file1.txt) with different content, and there are different files in each layer. The result is shown in the next figure:

aufs-r1

Now we’ll mount the filesystem using the basic syntax:

root:aufs-test# mount -t aufs -o br:upperlayer:layer1:layer2 none mountedfs

The result is that folder mountedfs contains the union of the files that we have created:

aufs-r2.png

The whole syntax and options is explained in the manpage of aufs (i.e. man aufs), but we’ll be using the basic options.

The key for us is the option br in which we set the branches (i.e. layers) that will be unioned in the resulting filesystem. They have precedence from left to righ. That means that if one file exist in two layers, the version shown in the AUFS filesystem will be the version of the leftmost layer.

The next figure we can see the contents of the files in the mounted AUFS folder:

aufs-t3

In our case, file1.txt contains “content for file1.txt in layer1”, as expected because of the order of the layers.

Now if we create a new file (file4.txt) with the content “new content for file4.txt”, it will be created in the folder upperlayer:

aufs-r3

If we delete the file “file1.txt”, it will be kept in each of the layers (i.e. folders layer1 and layer2). But it will be marked as deleted in folder upperlayer by including some control files (although these files will not be shown in the resulting mounted filesystem).

aufs-r4.png

The key for the AUFS driver are the files named .wh*. In this case, we can see that the deletion is reflected in the upperlayer folder by creating the file .wh.file1.txt. That file instructs AUFS to hide the file in resulting mount point. If we create the file again, it will appear the file again and the control file for deletion will be removed.

aufs-r5.png

Of course, the content of file1.txt in layer1 and layer2 folders is kept.

Docker and AUFS

Docker makes use of AUFS in ubuntu (althouhg is being replaced by overlay2). We’ll explore a bit by running a container and searching for its filesystem…

aufs-d1.png

We can see that our container ID is d5afc60dbfd7. If we see the mounts, by simply typing the command “mount” we’ll see that we have a new AUFS mounted point:

aufs-d2.png

Well… we are smart and we know that they are related… but how? We need to check folder /var/lib/docker/image/aufs/layerdb/mounts/ and there we will find a folder named as the ID for our container (d5afc60dbfd7…). And several files in it:

aufs-d3.png

And the mount-id file contains an ID that corresponds with a folder in /var/lib/docker/aufs/mnt/ that correspond with the unioned filesystem that is the root filesystem for container d5afc60dbfd7. Such folder correspond to the mount point exposed when we inspected the mountpoints before.

aufs-d4

In folder /var/lib/docker/aufs/layers we can inspect the information about the layers. In particular, we can see the content of the file that correspond to the ID of our mountpoint:

aufs-d5.png

Such content correspond to the layers that have been used to create the mount point at /var/lib/docker/aufs/mnt/. The directory trees that correspond to these layers are included in folders with the corresponding names in folder /var/lib/docker/aufs/diff. In the case of our container, if we create a new file, we can see that it appears in the working layer.

aufs-d6.png

OverlayFS and Overlay2

While AUFS is only supported in some distributions (debian, gentoo, etc.), OverlayFS is included in the Linux Kernel.

The schema of OverlayFS is shown in the next image (obtained from Docker docs):

overlay_constructs.jpg

The underlying idea is the same than AUFS, and the concepts are almost the same: layers that combine together to build a unioned filesystem. The lowerdir are the readonly layers in AUFS, while the upperdir correspond to the read/write layer.

OverlayFS needs an extra folder named workdir that substitutes the .wh.* hidden files in AUFS, but also is used to support the atomic operations on the filesystem.

The main difference between OverlayFS and Overlay2 is that, OverlayFS only supported merging 1 single readonly layer with 1 read/write layer (although overlaying could be nested by overlaying overlayed layers). Now Overlay2 supports 128 lower layers

Working with OverlayFS

We are preparing an equivalent test example to see how OverlayFS works and that it is very easy to understand and to work with.

root:root# cd /tmp
root:tmp# mkdir overlay-test
root:tmp# cd overlay-test/
root:overlay-test# mkdir layer1 layer2 upperlayer workdir mountedfs
root:overlay-test# echo "content for file1.txt in layer1" > layer1/file1.txt
root:overlay-test# echo "content for file1.txt in layer2" > layer2/file1.txt
root:overlay-test# echo "content for file2.txt in layer1" > layer1/file2.txt
root:overlay-test# echo "content for file3.txt in layer2" > layer2/file3.txt

With respect to the AUFS example, we had to include an extra folder (workdir) that is needed for OverlayFS to work.

overlay-1.png

And now we’ll mount the layers using the basic syntax

Now we’ll mount the filesystem using the basic syntax:

root:overlay-test# mount -t overlay -o lowerdir=layer1:layer2,upperdir=upperlayer,workdir=workdir overlay mountedfs

The result is that folder mountedfs contains the union of the files that we have created:

overlay-2.png

As expected, we the vision of the union of the 2 existing layers, and the contents of these files are the expected. The lowerdir folders are interpreted from left to right for the case of precendece. So if one file exists in different lowerdirs the unioned filesystem will show the file in the leftmost lowerdir.

Now if we create a new file, the new contents will be created in the upperdir folder, while the contents in the other folders will be kept.

overlay-3

What can I do with this?

Appart for understanding how to better debug products such as Docker, you will be able to start containers in the Docker way using layered filesystems, using the tools shown in a previous post.