Part One: Prep Work #
Background #
(Ashe) So. We recently murdered a server’s terminal via
do_distro_upgrade
.(Tammy) Was it really that bad?
(Ashe) Yes.
% man 7z
WARNING: terminal is not fully functional
- (press RETURN)%
It was in fact that bad. So we figured, well, we can spend a few hours, days, whatever fixing this…
(Tammy) Or we could just build a new server!
(Ashe) Right.
So, after asking some friends about their opinions, we settled on Alpine Linux. And why not also migrate all of our pm2 workloads to containers while we’re at it? We’ve been meaning to learn more about containers for a while now.
So off we go!
Prep Work #
We need a few things before we actually set up rootless containers. We’ll be following along with the Official Rootless Containers Tutorial, making adjustments as necessary.
Login Information #
Most Rootless Container implementations use $XDG_RUNTIME_DIR
to find the user’s ID and where their runtime lives
(usually some subdir of /run/user/
).
Systemd-based Linux distros will handle this automatically, but Alpine uses
OpenRC, which does not do this automatically.
While Alpine doesn’t provide a tutorial for Rootless Containers, we can adapt some of the prep work done for
Wayland to get OpenRC to set $XDG_RUNTIME_DIR
for us.
We just create /etc/profile.d/xdg_runtime_dir.sh
like so:
if test -z "${XDG_RUNTIME_DIR}"; then
export XDG_RUNTIME_DIR=/tmp/$(id -u)-runtime-dir
if ! test -d "${XDG_RUNTIME_DIR}"; then
mkdir "${XDG_RUNTIME_DIR}"
chmod 0700 "${XDG_RUNTIME_DIR}"
fi
fi
And, log out and then back in…
~ ❯ env
[...]
XDG_RUNTIME_DIR=/tmp/1000-runtime-dir
[...]
With that done, we can move onto our next steps.
Sysctl #
There’s some sysctl config required for older distros, but this isn’t required for Alpine.
User Namespace Configuration #
Rootless Containers use User Namespaces, subUIDs, and subGIDs, so we’ll need to have those working.
The apk package shadow-subids
provides that functionality for us.
~ ❯ apk info shadow-subids
shadow-subids-4.10-r3 description:
Utilities for using subordinate UIDs and GIDs
shadow-subids-4.10-r3 webpage:
https://github.com/shadow-maint/shadow
shadow-subids-4.10-r3 installed size:
140 KiB
Sub-ID Counts #
Rootless Containers generally expect /etc/subuid
and /etc/subgid
to contain at least 65,536 sub-IDs for each user.
shadow-subids
doed create these files for us, but leaves them empty by default, so let’s go ahead and do that.
The
page on subIDs provides a handy Python script
to do that for us, which we’ll edit slightly so it’s not writing directly to system files:
f = open("subuid", "w")
for uid in range(1000, 65536):
f.write("%d:%d:65536\n" %(uid,uid*65536))
f.close()
f = open("subgid", "w")
for uid in range(1000, 65536):
f.write("%d:%d:65536\n" %(uid,uid*65536))
f.close()
This is probably overkill for our use-case, but that’s also fine.
(Doll) So this one just runs script and copies to /etc/?
(Ashe) Yes Doll, that’s right.
With that done, we can move onto the last prep step.
CGroups V2 #
To limit resources that a container can use, we need to enable CGroups V2.
In OpenRC, this can be done by changing some options in /etc/rc.conf
.
To enable CGroups in general, we need to set rc_controller_cgroups
to YES
# This switch controls whether or not cgroups version 1 controllers are
# individually mounted under
# /sys/fs/cgroup in hybrid or legacy mode.
rc_controller_cgroups="YES"
From here, we can enable CGroups V2 by setting rc_cgroup_mode
to unified
# This sets the mode used to mount cgroups.
# "hybrid" mounts cgroups version 2 on /sys/fs/cgroup/unified and
# cgroups version 1 on /sys/fs/cgroup.
# "legacy" mounts cgroups version 1 on /sys/fs/cgroup
# "unified" mounts cgroups version 2 on /sys/fs/cgroup
rc_cgroup_mode="unified"
(Doll) Doll confused.
(Ashe) So was I, for a bit. Despite what
rc.conf
says, cgroups V2 does not seem to be enabled on Alpine unlessrc_cgroup_mode
is set tounified
. The Alpine Wiki seems to agree here, but isn’t super clear. We’ll find out if this is sufficient.
Next step is configuring the controllers we want to use:
# This is a list of controllers which should be enabled for cgroups version 2
# when hybrid mode is being used.
# Controllers listed here will not be available for cgroups version 1.
rc_cgroup_controllers="cpuset cpu io memory hugetlb pids"
Finally, we can add cgroups to a runlevel so that it’s started automatically at boot:
rc-update add cgroups
From here, we can reboot, and continue on. If you don’t want to reboot, you can start the cgroup service manually:
rc-service cgroups start
Creating a group for our container users #
We’ll quickly create a group for all users who’ll be using rootless containers here. In Alpine, this is as simple as
doas addgroup ctr
. We’ll make use of this later.
Installing containerd and friends #
First up we’ll need to install containerd
(to host our containers) and
slirp4netns
(to allow network spaced commands inside the container with lower overhead than VPNKit), so we just:
doas apk add containerd
doas apk add slirp4netns
Next, we need to install nerdctl
and rootlesskit
. Both of these are currently only found inside
the testing
repo for Alpine. We can pull them in without subscribing to the entire testing repo like so:
doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ nerdctl
doas apk add -X https://dl-cdn.alpinelinux.org/alpine/edge/testing/ rootlesskit
Configuring the Rootless containerd service #
We’ll be using nerdctl as our containerd controller of choice. It comes with a rootless containerd.service, but since Alpine doesn’t use systemd, we’ll have to adapt this into an rc service.
We spent some time trying to adapt the install script nerdctl provides to our purposes, however this is a bit excessive for what we need, so we’ll just do it the “ hard way”.
(Tammy) Wait, this isn’t the “hard way”, is it?
(Ashe) Nope. Adapting a 500 line script would be hard and annoying. We’re better served by just doing it manually, and providing instructions for anyone following along. So in that vein:
Getting containerd running in rootlesskit #
First, let’s get containerd running at the CLI, and then we can make it into an OpenRC Script.
We’ll need a config.toml
, but it can pretty minimal:
version = 2
root = "/home/tammy/.local/share/containerd"
state = "/tmp/1000-runtime-dir/containerd"
[grpc]
address = "/tmp/1000-runtime-dir/containerd/containerd.sock"
First try:
~ ❯ rootlesskit --net=slirp4netns --copy-up=/etc --copy-up=/run \
--state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
BusyBox v1.35.0 (2022-08-01 15:14:44 UTC) multi-call binary.
Usage: ip [OPTIONS] address|route|link|tunnel|neigh|rule [ARGS]
OPTIONS := -f[amily] inet|inet6|link | -o[neline]
ip addr add|del IFADDR dev IFACE | show|flush [dev IFACE] [to PREFIX]
ip route list|flush|add|del|change|append|replace|test ROUTE
ip link set IFACE [up|down] [arp on|off] [multicast on|off]
[promisc on|off] [mtu NUM] [name NAME] [qlen NUM] [address MAC]
[master IFACE | nomaster] [netns PID]
ip tunnel add|change|del|show [NAME]
[mode ipip|gre|sit] [remote ADDR] [local ADDR] [ttl TTL]
ip neigh show|flush [to PREFIX] [dev DEV] [nud STATE]
ip rule [list] | add|del SELECTOR ACTION
[rootlesskit:parent] error: failed to setup network &{logWriter:0xc00014aa00 binary:slirp4netns mtu:65520 ipnet:<nil> disableHostLoopback:true apiSocketPath: enableSandbox:false enableSeccomp:false enableIPv6:false ifname:tap0 infoMu:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} info:<nil>}: setting up tap tap0: executing [[nsenter -t 28611 -n -m -U --preserve-credentials ip tuntap add name tap0 mode tap] [nsenter -t 28611 -n -m -U --preserve-credentials ip link set tap0 up]]: exit status 1
[rootlesskit:child ] error: parsing message from fd 3: EOF
(Doll) That looks like it broke, Miss.
(Ashe) sigh, yeah, that’s broken alright. That output looks like ip didn’t like the command supplied to it, so let’s find out what that was.
Some troubleshooting later, it looks like this is to do with BusyBox’s implementation of the ip commands. We’ve raised an issue, and we’ll see how that goes. In the mean time, we’ll just have to use native networking. This means we can’t apply firewall rules per-container, which is moderately annoying, but won’t actually hinder deployment. Just makes securing the deployment more annoying.
So let’s try without the --net=slirp4netns
(omitting anything that’s INFO):
~ ❯ rootlesskit --copy-up=/etc --copy-up=/run \
--state-dir=/tmp/1000-runtime-dir/rootlesskit-containerd --disable-host-loopback \
sh -c "rm -f /run/containerd; exec containerd -c config.toml"
WARN[2022-11-03T11:32:53.207241941+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
WARN[2022-11-03T11:32:53.227691744+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
A few things of note here:
WARN[2022-11-03T11:32:53.233006449+11:00] failed to load plugin io.containerd.internal.v1.opt error="mkdir /opt/containerd: permission denied"
[...]
ERRO[2022-11-03T11:32:53.235151641+11:00] failed to load cni during init, please check CRI plugin status before setting up network for pods error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
The warning tells us that it tried to create /opt/containerd, but was unable to. This is easy enough to fix:
~ ❯ doas mkdir /opt/containerd
~ ❯ doas chmod 2770 /opt/containerd
~ ❯ doas chown root:ctr /opt/containerd #Replace the username and group here as necessary
The error is more interesting. CRI here stands for
Container Runtime Interface, and
it seems to be used for Kubernetes. Since we won’t be using kubernetes here, we can just disable it by adding
disabled_plugins = ["io.containerd.grpc.v1.cri"]
to our config.toml
.
(Tammy) If you are interested in Kubernetes, make sure to check out our Home Server Build-Out series. We’re planning on setting up an entire cloud environment there.
Let’s try that again (cutting out any info stuff):
[...]
WARN[2022-11-03T16:18:35.425339343+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
WARN[2022-11-03T16:18:35.427868986+11:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
ERRO[2022-11-03T16:18:35.430061527+11:00] failed to initialize a tracing processor "otlp" error="no OpenTelemetry endpoint: skip plugin"
containerd successfully booted in 0.024502s
[...]
That’s cleaned up those issues, but we still have two warnings about devmapper
,
and containerd
couldn’t find an OpenTelemetry endpoint.
We’ll be skipping OpenTelemetry for now, but that sounds like a fun topic for a second blog post along side setting up Grafana.
(Doll) Doll will remember! Will remind Miss’ to make a post about this!
Setting up devmapper #
devmapper
is one of a few
snapshotters
that containerd
can use. It’s not the most performant (that honour goes to overlayfs
), but it is one of
the most robust, and least likely to break. This is more imporant to us than pure performance.
If you’re following along at home, you’ll have to decide which storage driver is best for your use-case.
Following the
setup guide,
we’ll need dmsetup
installed. Under Alpine, this is provided by the device-mapper
package,
which we already have installed.
We’ve also got a 100GB block device attached to this VPS, so let’s get that provisioned too.
Mounting and Formatting our block device #
We can use fdisk
to format our block device. fdisk -l
lists all devices and partitions.
~ ❯ doas fdisk -l
Disk /dev/vda: 25 GB, 26843545600 bytes, 52428800 sectors
52012 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
/dev/vda1 * 2,0,33 205,3,19 2048 206847 204800 100M 83 Linux
/dev/vda2 205,3,20 1023,15,63 206848 52428799 52221952 24.9G 8e Linux LVM
Disk /dev/vdb: 100 GB, 107374182400 bytes, 209715200 sectors
208050 cylinders, 16 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/vdb doesn't contain a valid partition table
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
250 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/dm-1: 23 GB, 24670896128 bytes, 48185344 sectors
2999 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
Disk /dev/dm-1 doesn't contain a valid partition table
We know that our VPS has a 25GB disk, so /dev/vdb
is our 100GB block device. We can format it with
doas fdisk /dev/vdb
. Let’s see how we do that:
~ ❯ doas fdisk /dev/vdb
Device contains neither a valid DOS partition table, nor Sun, SGI, OSF or GPT disklabel
Building a new DOS disklabel. Changes will remain in memory only,
until you decide to write them. After that the previous content
won't be recoverable.
The number of cylinders for this disk is set to 208050.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): n
Partition type
p primary partition (1-4)
e extended
p
Partition number (1-4): 1
First sector (63-209715199, default 63):
Using default value 63
Last sector or +size{,K,M,G,T} (63-209715199, default 209715199):
Using default value 209715199
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table
```
Running `fdisk -l` again:
```sh
[...]
Device Boot StartCHS EndCHS StartLBA EndLBA Sectors Size Id Type
/dev/vdb1 0,1,1 1023,15,63 63 209715199 209715137 99.9G 83 Linux
Disk /dev/dm-0: 1968 MB, 2063597568 bytes, 4030464 sectors
250 cylinders, 255 heads, 63 sectors/track
Units: sectors of 1 * 512 = 512 bytes
[...]
Looks like that worked.
Adding the formatted block device into LVM #
Let’s get this added into LVM. First, we need to create a physical volume with the pvcreate
command:
~ ❯ doas pvcreate /dev/vdb1
Physical volume "/dev/vdb1" successfully created.
Let’s create a new Volume Group for our workload data. There are two reasons for this:
- This will make it easier to extend in the future; and
- Our block device is spinning rust, and we don’t necessarily want to mix SSDs with spinning rust.
With that in mind, we’ll leave the existing VG, vg0
as the volume group for programs and container images:
~ ❯ doas vgcreate data /dev/vdb
Volume group "data" successfully created
~ ❯ doas vgdisplay data
--- Volume group ---
VG Name data
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 1
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 0
Open LV 0
Max PV 0
Cur PV 1
Act PV 1
VG Size <100.00 GiB
PE Size 4.00 MiB
Total PE 25599
Alloc PE / Size 0 / 0
Free PE / Size 25599 / <100.00 GiB
VG UUID 679FIe-aF9e-yBRy-bRH6-wRlY-KPgz-yUpXL9
(Doll) Is it working Miss? Doll wants to see websites in treasure chests go zoom!
(Ashe) Containers, dear Doll. And yes, yes it is. Only a few more steps and we’ll be ready to start bringing things online, don’t worry.
Speaking of, next we need to create our logical volumes. We’ll create two. One for our container scratch storage, and one for persistent storage. We’ll size scratch at 30GiB, and persistent at 70GiB. Let’s get that done:
~ ❯ doas lvcreate -n persist --size 70G data
Logical volume "persist" created.
~ ❯ doas lvcreate -n scratch --size 30G data
Volume group "data" has insufficient free space (7679 extents): 7680 required.
(Selene) Oh interesting. What happened there?
(Ashe) Our theoretically 100GiB device has one extent less than 100GiB, so we couldn’t divide it into exactly 30/70.
(Tammy) Wait is that why
fdisk
said the device was 99.9G?(Ashe) Good catch. Yeah. 100GiB doesn’t divide evenly into 960KiB cylinders, so we end up with one cylinder too few, and therefore—
(Tammy) One extent too few! Sneaky!
(Ashe) Yup. Actually, now that I look at it again, I forgot to make space for the metadata, so this works out nicely.
Creating our nerdctl thin pool #
Docker and nerdctl can control a block device directly to use as a storage driver via device-mapper, so we’ll be letting nerdctl do that for it’s mainline storage, and using our “persistent” pool for nerdctl volumes (which are persistent).
For this we’ll need device-mapper
, lvm2-dmeventd
, and thin-provisioning-tools
, so we’ll apk add
those in.
(Ashe) I’m going to skip showing the terminal output for installing packages from here on in to save space. I’m sure you’ve gotten the idea by now.
First up is creating a thin pool, which we’ll do as follows:
~ ❯ doas lvcreate --wipesignatures y -n scratch data -l 95%FREE
Logical volume "scratch" created.
~ ❯ doas lvcreate --wipesignatures y -n scratchmeta data -l 10%FREE
Logical volume "scratchmeta" created.
~ ❯ doas lvconvert -y --zero n -c 512K --thinpool data/scratch --poolmetadata data/scratchmeta
Thin pool volume with chunk size 512.00 KiB can address at most 126.50 TiB of data.
WARNING: Converting data/scratch and data/scratchmeta to thin pool's data and metadata volumes with metadata wiping.
THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Converted data/scratch and data/scratchmeta to thin pool.
~ ❯
So what did we do here?
(Doll) Ooh! Ooh! Doll knows! Miss created one LV, umm, Logical Volume, taking up 95% of the free space, and one taking up 10% of the free space… remaining free space? So ummm, ummm, 152 MiB?
(Ashe) That’s right! What next?
(Doll) We umm. Combine the two into one? This one is confuse.
(Ashe) Okay, I’ll try to keep it simple. A normal (thick) pool allocates all of its data when we create it. So all the space is reserved ahead of time. You can write to whatever bit of it you want, whenever you want. Imagine something like a notebook you bought. A thin pool isn’t like that. It initialises a small area with zeroes, but otherwise leaves the rest of the device alone. Like you have a page, and you ask the store for another blank page every time you get close to filling up your page. So, what would happen if I wrote a 100M file that was all zeroes?
(Selene) Let’s see if I understand. Well, you’d write the file metadata, and allocate some space… Wait who’s keeping track of the size of the volume?
(Ashe) Precisely, Selene. You need a metadata volume that contains information about the assigned blocks in the thin pool, since it wasn’t allocated all at once. So we create a pool for that, and then combine the two into our final thin pool.
That done, we can configure autoextension by creating /etc/lvm/profile/data-scratch.profile
:
activation {
thin_pool_autoextend_threshold=80
thin_pool_autoextend_percent=10
}
Apply said profile with doas lvchange --metadataprofile data-scratch data/scratch
, and check if the thin pool is being
monitored:
~ ❯ doas lvs -o+seg_monitor
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Monitor
persist data -wi-a----- 70.00g
scratch data twi---t--- <28.50g
lv_root vg0 -wi-ao---- <22.98g
lv_swap vg0 -wi-ao---- 1.92g
Looks good. Were the LV not monitored, we would see not monitored
at the end of the scratch data
line. Were that the
case, we could fix that with doas lvchange --monitor y data/scratch
.
Formatting the new Logical Volume #
Our final step is to format the LV we’ll be using for persistent volumes. We’ll be using plain-old ext4 for this as I don’t need to nor want to get fancy here.
~ ❯ doas mkfs.ext4 /dev/data/persist
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done
Creating filesystem with 18349056 4k blocks and 4587520 inodes
Filesystem UUID: c0a59a7b-1969-4476-9d2c-11af32628337
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424
Allocating group tables: done
Writing inode tables: done
Creating journal (131072 blocks): done
Writing superblocks and filesystem accounting information: done
Mounting our new logical drives and setting up automount #
Final step. Mounting the drive is relative simple:
~ ❯ doas mkdir /data
~ ❯ doas chmod 2770 /data
~ ❯ doas mount /dev/data/persist /data
~ ❯ doas chown root:ctr /data -R
From here, we can configure /etc/fstab
so they’re automatically mounted at boot.
To achieve that, we’ll add the following line to /etc/fstab
:
/dev/data/persist /data ext4 rw,relatime 0 0
We don’t need to mount the scratch LV (Logical Volume) as containerd will be controlling that directly.
And we should be good to go.
Last thing to do is add a minimal devmapper config to our config.toml
:
[...]
[plugins]
[plugins."io.containerd.snapshotter.v1.devmapper"]
root_path = "/opt/containerd/devmapper"
pool_name = "data-scratch"
base_image_size = "1024MB"
[...]
Let’s see what happens when we launch containerd
again:
WARN[2022-11-07T00:33:26.218437232+11:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="dmsetup version
error: Library version: 1.02.170 (2020-03-24)
/dev/mapper/control: open failed: Permission denied
Failure to communicate with kernel device-mapper driver.
Check that device-mapper is available in the kernel.
Incompatible libdevmapper 1.02.170 (2020-03-24) and kernel driver (unknown version).
Command failed.
: exit status 1"
(Tammy) That doesn’t look great.
(Ashe) No. It does not. Hmm. Let’s investigate.
(Ashe) Ah. Found it. Looks like devmapper isn’t supported in rootless configs. Now we know.
An Interlude, Some Tea, And a Break #
(Octavia) And on that bomb-shell, I think it’s about time we wrapped this up. Ashe is looking is pretty grumpy. Looks like we’ll have to make this into a series.
(Tammy) Hopefully we’ll have better luck next time.
(Doll) This one hopes you’ll all join us next time!