LXC and the mystery of lost memory

In my homelab I use LXD/LXC quite a bit. Linux containers provide pretty decent alternative to full blown virtualization with smaller overhead and some extra perks like easier local filesystem access. All things considered, I’ve been happy with LXD for many years.

However there are places where the illusion of completely separated virtual machines shows some cracks when you look close enough. This is a story of one such crack that was haunting me for months.

After years of using LXD, one of the containers started having something that on the surface looked like memory issues. The container itself had 8GB memory limit, workload running on it needed around 5GB of RAM. It should run within these constraints just fine - and in fact it did for many years. But then one day..

Service runs out of memory

After couple weeks since last reboot, the service starts to misbehave in weird way. The web interface is super slow to load, and eventually just won’t load at all. This is non-critical service in my homelab and by the time I get to investigate, I can’t even ssh in. Fortunately this is where convenience of containers helps, so I can still lxc exec in and have a look. Still, I can hardly run anything. System behaves as if it’s absolutely out of memory and sure, free -m reports 0MB free, yet over 4GB of RAM as available, most of it in buff/cache according to the utility.

Now if you try to search online why there’s low free memory on Linux, the most likely answers you’re going to find is that "this is normal" and "it’s expected behavior of Linux" and "free RAM is wasted RAM". I know that, we all know that. It is so common question from Linux newbies, that there is an actual webpage dedicated to this problem. So when I try to search why free actually is zero my chances of finding anything relevant are just drowned in sea of "this is how linux memory management works" answers.

Understandably so, that would be my first reaction as well.

Eventually system is so starved of memory, I fail to do anything reasonable within the container shell and I decide to just restart the container. The problem disappears with reboot and all is well with the world up until couple weeks later.

In the next few months this repeats about once every 5-ish weeks. I’m super busy at the time, so I just reboot container to quickly restore the service and promise to have a look later. Which I avoid for a while.

Where is the memory actually used

Finally there’s free evening and I still feel fresh enough to do some investigation, so I log in into the container and have a look. This time the service is still running just fine, system is not running out of memory at all, so I wonder if there will be anything to find. free memory is about where I’d expect it (couple hundred MB), so I peek at available sitting somewhere in 3GB area. Again, nothing unusual. But now I’m in the discovery mode. free -m probably won’t tell me much more, so I turn to contents of /proc/meminfo. There’s a bunch of metrics there, so I open documentation to confirm, what each one means. Some are pretty obvious, some can be a bit tricky.

One particular sentence caught my eye:

Shmem: Total memory used by shared memory (shmem) and tmpfs

🤔 Hmmm..

Is tmpfs eating my memory?

It sure looks like that’s the case! df tells me there’s tmpfs mounted on /run with almost 2GB (at the time of investigation) of space used. This memory is then reported in buff/cache column and also as available. Which sounds wrong, but it actually makes sense - this memory can be dropped to swap if needed. So even if it’s not really "buffer", it can sort of be freed like one. The thing is, you need to have swap, which I don’t. (for reasons)

So now I know, it’s tmpfs eating my memory unable to free it up when needed.

But why?

That explains the reason why memory management behaved the way it did, but why is it problem now and not last couple years? I can very confidently rule out application knowing it does not use /run at all. Something else must have changed.

Quick inspection with du points to journal using up all this tmpfs space with its logs. I can already hear the systemd hate pitchforks getting ready as I’m typing this, but surely journald folks can’t be this reckless and just use all the RAM? And of course they aren’t. In fact the internal logic is pretty sane and defaults to using up to 10% of space for logs while making sure 15% is free and with hard limit of 4GB.

Just as I was looking at this, the journald was already using almost 2GB of RAM - way above the 10% I’d expect. (Which with 8GB RAM should be around 800MB) And that is even before the service is impacted, looking at graphs journald was peaking at 4GB space used.

Huh?

Well, the explanation for that ends up being quite simple, maybe a bit unexpected. When mounting tmpfs an optional size flag can be provided. This will set the desired size for the mount. The default is "half of your physical RAM without swap". The somewhat surprising (although very logical) fact is that this default value does not in any way reflect amount of memory available in specific container. So on my server with 96GB RAM, the /run is created with 48GB of capacity:

$ df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            48G  1.8G   46G   3% /run

💡 Aha!

So when journald starts, it looks at /run, it can see 48GB available. 10% of that is just under 5GB, so maximum of 4GB is used instead. Looking back at graphs when the service was impacted, journald really was near the 4GB limit - almost half of the memory occupied by logs. Leaving the rest to the system and apps, which is just not enough.

Note
At no stage is the container using more memory than allocated. Even if we tried to fill up the whole /run with files, Linux wouldn’t let container use more than the 8GB allocated and tmpfs would simply fail writes with file system full.

But why now?

So now that we came all the way up from failing web interface down to the default size in tmpfs, there’s just one question unanswered. Why did the LXD container run happily for years and only started failing couple months back.

Well, you see. I’ve managed to buy some cheap memory (it’s older DDR3 system) couple months back. I’ve bumped server RAM from 48GB to 96GB. This in turn changed the tmpfs default from 24GB to 48GB, which bumped up journald limit from about 2.4GB to 4GB. And the 1.6GB difference was juuuust enough to not fit together with the application anymore.

So here’s the punchline: My container ran out of memory, because I added more RAM to the server.