Keyboard meet head

Where my head meets keyboard..

HTTP 500 - Very internal server error

As far as downtimes go, most suck. Then there are some that are more like happy little opportunities to learn something new. When Gitlab in my homelab started returning 500s, I had no idea how blessed I was. I was at best annoyed.

Practically speaking, out of all the services I'm running at home, this is one of the few where downtime affects just me as I'm the only user. So it is indeed great opportunity to learn where I can take my time investigating the problem. In the beginning I did not appreciate this gift for what it was. In fact at first I just restarted the service, made sure it's working fine and went on with my day. But then, the next day 500s are back and now they have my attention.

What is going on with Gitlab?🔗

Let's check the logs first. Oddly enough most of the services in the omnibus Docker deployment still ran fine. Except there's this one service that is very obviously in bad state:

Failed to create a scheduled pipeline. schedule_id: 31 message: 13:starting process [..redacted process details..]: open /dev/null: operation not permitted.

My first instinct is to ignore the /dev/null part of the message, because surely out of all the things there's no way a service is crashing on permission issues accessing that device! In fact I'm convinced that the individual parts of log message are likely not related and that permission denied relates to something else. Anything but null suddenly becoming inaccessible. I'm still in the Anatoly Dyatlov mode seeing 3.6R/h on the dosimeter.

However ignorant I might have been, at least I have very specific line to search in the logs to see when Gitlab went down. Looking back in history, there's one specific CI job running at the time. I search back another day and sure enough the same daily job broke Gitlab last time as well. To confirm this I restart Gitlab again and run the job manually and B💣💣M as soon as the job ends, 500s start.

Now we have the trigger, but still no explanation what's going on. To make this whole situation even more bizarre, the job I'm looking at is a regular Ansible run. And it reports zero changes. The job is running in separate Docker container external to the Gitlab services, but it's still running on the same host.

The omnibus Gitlab container is essentially a small VM. It definitely does not follow best Docker practices of service separation, but I like it for its convenience. However that also means, that I need to exec inside the container to check individual services, which is where I hit another issue:

# docker exec -ti gitlab sh
OCI runtime exec failed: exec failed: unable to start container process: open /dev/ptmx: operation not permitted: unknown

🤔 Well, that's something I haven't seen so far. The container is in pretty bad state.

I decide to experiment some more with running jobs to see whether I can narrow down the trigger to something more specific. I try some other jobs, that finish without problems. I try different Ansible job, that also runs fine. I run the problematic Ansible job skipping Gitlab host itself adding the --limit parameter. And Gitlab does not break this time. 💡 On top of that I run the exact same playbook manually from my desktop instead of CI and this too breaks the container. Gitlab itself was a red herring.

What is going on with Ansible?🔗

Remember, that Ansible reports zero changes at the end of the run. Yet somehow, it breaks Gitlab container. This makes zero sense.

But fortunately, now I have pretty good test for "is Gitlab broken?" - the failing docker exec. For a good measure I add a check for /dev/null accessibility (at this stage I'm somewhat willing to admit there might be something seriously wrong with the container and perhaps the null device really is broken) and I get this:

while sleep 1
do 
    date
    docker exec -ti gitlab /bin/bash -c 'cat /dev/null' || echo BROKEN
done

This will tell me the breaking point down to a second, so with that running in separate terminal, I start poking the playbook run again. The timing turns out to align with the playbook end, so it's either the last Ansible role, that breaks Gitlab or Ansible itself somehow?

I try to shuffle the roles around a bit and it turns out, it's one specific role. Ansible itself was a red herring.

What's going on with the role?🔗

There's only couple of tasks in this role. The failure manifests almost exactly when role run ends. So it's either the role itself or the last task. I can't really shuffle the tasks around like I could do with the the roles, because tasks are much more dependent on specific order. Fortunately the tasks are tagged into group of logical steps and the last task already has unique tag.

So as a test, I skip the task using --skip-tasks parameter running ansible-playbook and.. everything is fine as long as the last task does not run. Role itself was a red herring.

What's going on with the task?🔗

First, remember the task reports zero changes. Second, it does look like it should actually do nothing:

- name: Ensure service is started and enabled
  become: true
  systemd:
    daemon_reload: true
    name: "{{ service_name }}"
    state: started
    enabled: true

The service in question is already enabled and started, so there really should be no change there. To test individual systemd module parameters, I run the module directly using ansible (not ansible-playbook) providing the parameters on command line:

ansible -bm systemd -a 'daemon_reload=true name=service_name state=started enabled=true'  hostname

After couple attempts I figure out that everything is fine as long as I leave out daemon_reload: true. This parameter makes Ansible run systemctl daemon-reload before doing any other operations, to make sure systemd has read any changes. And when I run this command without Ansible it also breaks the Gitlab container. The task itself was a red herring.

What's going on with systemd?🔗

I can already hear pitchforks and anti-systemd crowd in the background. But let's have a look first at what's really going on here and how is systemd affecting running containers?

Looking at output of systemctl list-units we can see, that docker containers show up as systemd scopes:

# systemctl list-units docker-* | awk '/docker/{print $1}'
docker-358b65de6fd638fab9d5de6a1f604681fe6a08dd66e0499d24043a05c0f4237c.scope
docker-4e31f1238903afdfb098ef1191bf924b968912e6d1765e851ea3b7a809cfddbc.scope
docker-8bfdd758db3a438886ec8921e94e0c9f74bdea123b0de8f3ff4567dc03fe0961.scope

So it's not just systemd messing with random containers, there actually is a scope configuration for this container. (and all the other as well) Why is that? Is Docker using systemd to run its containers? Well, not quite:

# docker info | grep Runtime:
 Default Runtime: runc

Docker is actually using runc as a runtime to do the heavy lifting. Abstractions upon abstractions, such is the way of things. Does runc use systemd then? Indeed it does - at least in some configurations - like mine, which is pretty default setup.

Okay, so now we established that there is indeed transient scope configuration for each container and that somehow systemctl daemon-reload breaks the container. Is it pitchfork time yet?

Let's see what systemd thinks about our container. Let's also focus on device access as this is what seems to be broken:

# systemctl cat docker-<long-id>.scope | grep DeviceAllow
# /run/systemd/transient/docker-longidhere.scope.d/50-DeviceAllow.conf
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

There isn't much, and remembering open /dev/null: operation not permitted from the very beginning of my journey, the DeviceAllow option is my prime suspect. Let's check how this configuration looks locally on my desktop:

# systemctl cat docker-<long-id>.scope | grep DeviceAllow
# /run/systemd/transient/docker-longidhere.scope.d/50-DeviceAllow.conf
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

💡Aha! We're onto something. The configuration is different! But if the the container is misconfigured from the very beginning, how would it work in the first place?

Also, if Gitlab is Docker container like any other why is it broken, but others aren't? Well I had the same question and that's how I realized, that every single container in my homelab is indeed broken. 😱 It's just Gitlab happens to be the only one where this manifests as an obvious service malfunction - because it is actually trying - and failing - to access /dev/null. But I can't docker exec into any container where systemctl daemon-reload ran previously - which is effectively everywhere. 🤯 Now I feel a bit like Anatoly Andreyevich Sitnikov staring from the top of Block C into the remains of the now infamous reactor.

Okay, so how does the container even work if the configuration is wrong from the moment it starts?

What are containers anyway?🔗

If you're using any recent Linux distribution, chances are you're using cgroups v2 to run containers. (Or for service isolation in systemd) To cut down the kernel documentation for this API to bare minimum, the TL/DR version is, that container is a process, that is running in some control group (hence the name cgroup) where this group has (typically) couple of controllers assigned which decide what kind of resources are available to process in the group.

This is what provides the containerization. The container is a cgroup with particularly configured controllers to give an "illusion" of a (on surface) pretty simple pseudo VM. Working with docker containers one feels like there are two very separated worlds - the one inside the container (the guest) and the one where you run your shell. (the host) In reality these are closer than you might think - it's even possible to move running process from container to the cgroup where your shell might be running. In fact you can even move individual process threads and can even end up with a single process that has multiple threads running in multiple cgroups. (and as a result in multiple containers 🤯) What is important for our investigation is that containers are really not that simple. It's pretty involved setup which tools like Docker, Podman, LXC (or some of the software under these) abstract away for us into very understandable docker run command.

Most of the controllers have very nice filesystem based API to manage these on per-cgroup basis. In fact cgroup v2 is just virtual filesystem mounted somewhere in your tree structure and you can create cgroup just by making subdirectory in it. And just like directories, cgroups also have tree-like structure. (Which is how you're able to run Docker in Docker for example) Individual controllers then also have some simple file-based API for setting the cgroup limits. For example you can echo "1G" > memory.max to set a memory limit for a cgroup. You can configure pretty much all of the controllers this way. All of the ones docker containers normally use can be configured this way except the one we are actually interested in - the device controller.

The device controller🔗

This is the controller that manages access to device files. It is the one responsible for the open /dev/null: operation not permitted error we've started with many layers above.

And unlike the rest, it's not so much configured as it is controlled by little bpf programs. You can think of these as simple executable code written in special assembly-like language. This program is executed on device access with couple parameters describing which device is being accessed and what kind of access is performed. (read, write, mknod) Program can return 0 to block the access or to allow it with any other return value.

Still holding a pitchfork? Systemd's DeviceAllow= option syntax suddenly doesn't look so bad huh? 😉

So what does our container device configuration look like? First let's check which bpf programs are attached to our container cgroup before it breaks:

# bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<some_long_id>.scope/
ID       AttachType      AttachFlags     Name
3814     device          multi

We can see there's one with ID 3814. What it looks like? Let's dump the code:

0# bpftool prog dump jited id 3814
1(61) r2 = *(u32 *)(r1 +0)
2(54) w2 &= 65535
3(61) r3 = *(u32 *)(r1 +0)
4(74) w3 >>= 16
5(61) r4 = *(u32 *)(r1 +4)
6(61) r5 = *(u32 *)(r1 +8)
7(55) if r2 != 0x1 goto pc+5
8(bc) w1 = w3
9(54) w1 &= 1
10(5d) if r1 != r3 goto pc+2
11(b4) w0 = 1
12(95) exit
13(55) if r2 != 0x2 goto pc+5
14(bc) w1 = w3
15(54) w1 &= 1
16(5d) if r1 != r3 goto pc+2
17(b4) w0 = 1
18(95) exit
19(55) if r2 != 0x2 goto pc+4
20(55) if r4 != 0x1 goto pc+3
21(55) if r5 != 0x3 goto pc+2
22(b4) w0 = 1
23(95) exit
24(55) if r2 != 0x2 goto pc+4
25(55) if r4 != 0x1 goto pc+3
26(55) if r5 != 0x5 goto pc+2
27(b4) w0 = 1
28(95) exit
29... SNIP...
30(55) if r2 != 0x2 goto pc+3
31(55) if r4 != 0x88 goto pc+2
32(b4) w0 = 1
33(95) exit
34(b4) w0 = 0
35(95) exit

The real code is a bit longer, I have cut out the middle part. Now I'm pretty sure you didn't expect learning about BPF assembly when this article started with Gitlab throwing 500s, but here we are now. It might look scary at first sight. Impenetrable even for people like me that haven't worked with assembly in the past. But remember these wise words:

Stare into the darkness and the shadows will disappear. Never be afraid of the darkness, but stare it down. -- Masih Alinejad

Let's stare this code down and we can see some patterns emerging. Skipping the initial part, we can see this:

19(55) if r2 != 0x2 goto pc+4
20(55) if r4 != 0x1 goto pc+3
21(55) if r5 != 0x3 goto pc+2
22(b4) w0 = 1
23(95) exit
24(55) if r2 != 0x2 goto pc+4
25(55) if r4 != 0x1 goto pc+3
26(55) if r5 != 0x5 goto pc+2
27(b4) w0 = 1
28(95) exit

These two blocks inspect some properties of the access request and if these match, some value is set to 1 and program exits. If any of the conditions does not match, the code skips ahead by couple lines to the next block.

This isn't so scary anymore, right? 🙂 Now what about the value set to 1? Let's remember what we've learned just couple minutes ago. Program returns 0 for access denied and any other value to allow access. Let's look at the end of the code:

29... SNIP...
30(55) if r2 != 0x2 goto pc+3
31(55) if r4 != 0x88 goto pc+2
32(b4) w0 = 1
33(95) exit
34(b4) w0 = 0
35(95) exit

💡Aha! The code just goes device by device. When it finds a match, sets return value to 1 and exits. If none of the blocks match, eventually the code hits last two lines that set return value to 0 before exiting - which we know means operation not permitted.

Now this is how this code looks like before container is broken. How does it look after systemctl daemon-reload?

# bpftool cgroup list /sys/fs/cgroup/system.slice/docker-<some_long_id>.scope/
ID       AttachType      AttachFlags     Name
3814     device          multi
3815     device          multi

Oh, so now there are two programs attached! The 3814 still looks the same, let's inspect the new one:

0# bpftool prog dump xlated id 3815
1(61) r2 = *(u32 *)(r1 +0)
2(54) w2 &= 65535
3(61) r3 = *(u32 *)(r1 +0)
4(74) w3 >>= 16
5(61) r4 = *(u32 *)(r1 +4)
6(61) r5 = *(u32 *)(r1 +8)
7(bc) w1 = w3
8(54) w1 &= 1
9(5d) if r1 != r3 goto pc+2
10(55) if r2 != 0x1 goto pc+1
11(05) goto pc+10
12(bc) w1 = w3
13(54) w1 &= 1
14(5d) if r1 != r3 goto pc+2
15(55) if r2 != 0x2 goto pc+1
16(05) goto pc+5
17(55) if r2 != 0x2 goto pc+2
18(55) if r4 != 0x88 goto pc+1
19(05) goto pc+2
20(b7) r0 = 0
21(05) goto pc+1
22(b7) r0 = 1
23(95) exit

Now this one is much shorter! From what we understand about the bpf code now, this also looks much more in line with the systemd configuration we've observed earlier:

DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

I don't have to understand the code completely to see that systemctl daemon-reload makes systemd actually apply the device configuration, which breaks the container, because while one program still returns 1 when accessing /dev/null the other returns 0 and blocks the access.

Where does the first program come from then? As it's often the case with Internet, it's much easier to find the answer when you already know most of it.

As it turns out, runc actually sets its own device controller configuration while also setting the systemd one.

So systemd is just doing exactly what is being told to do, it's actually runc setting it wrong. You can put the pitchforks down now. The systemd itself was a red herring.

What is going on with runc?🔗

The issue I've linked above was actually fixed in the version 1.1.4 of runc which is already installed on my system. So we're almost at the root cause, but not yet.

Looking at the patch, we see that it loops through devices in /dev/char/ skipping any that do not exist to avoid systemd warnings about non-existent device. Which is where root cause of my issues actually is. The problematic systems in my homelab are actually LXC containers. For some reason (which is beyond the point, but let's just say this is the default when you lxc launch ... a container) the system does not have /dev/char directory at all. Which means, that runc will skip generating all the per-device rules creating the incorrect configuration which is then waiting like a land mine for someone to step on with systemctl daemon-reload.

As it turns out, I'm not the only one affected by the issue. And literary as I'm investigating the problem and asking around before reporting a bug, runc 1.1.5 is released. And it contains a fix for the bug I'm observing.

Upgrading runc and restarting containers to let runc generate correct configuration solves the problem for good. 🎉

Final thoughts🔗

When I started the investigation, I just hoped to fix Gitlab. As it turns out the root cause was somewhere else altogether. Which is how I got much more from this than I ever hoped for. The real treasure isn't the runc bug fix, it's the knowledge we gained along the way.

There's no comment system on this page, but I do accept feedback. If you are interested in commenting, send me a message and I may publish your comments, in edited form.