Nov 26, 2025

I have a small board farm, with one of every device I have deployed, connected to many of the same sensors I have deployed, which I use to test new builds before deploying them. I've had too many cases where an update will cause a device to fail (some configuration does not align with some new software) or fail to be reachable (a network device's name will change and the interface won't come up) therefore I want to be able to test updates on devices within easy reach before deploying to locations that would be harder to reach physically.

The devices are mostly Rockchip-based deivces: rock-pi-e, rock-pi-s, and radxa zero-3e. Each one is connected to a different set of various sensors, mostly from Adafruit, that sense properties such as temperature, humidity, light, and barometric pressure. Sensors such as the bh1750, sht30 (indoor), sht30 (outdoor), shtc3, aht20, lps25, mcp3002, ds18b20 (reg temp), and ds18b20 (high temp) (plus some experiments with others).

A "server" device (rock 5b) polls each unit for its data, and this data (along with the poll latency) is recorded on the server. Recently I started adding more data to the list of values each device was reporting such as cpu temperature, uptime, loadavg, and various pieces of memory (RAM) information. Luckily I started adding RAM information just in time before one of my devices performed a kernel panic due to running out of memory (provided below).

Most of the boards can be purchased with various RAM amounts. For example the rock-pi-s can either have 256MB or 512MB of RAM. The boards I bought have 256MB. The rock-pi-e and zero-3e devices I use come with 1GB of RAM.

I had noticed memory issues with the latest kernel (6.17) on my rock-pi-s devices and decided to configure them to continue using the 6.12 kernel. systemd has a tendency to use a lot more RAM than sysvinit, therefore I had made some tweaks to the amount of memory it is allowed to use. In order to test to see if these changes had a positive effect on these devices, I updated the one in my board farm to use the 6.17 kernel again. Unfortunately the rock-pi-s device in my board farm still throws an out-of-memory kernel panic running the 6.17 kernel despite my systemd RAM-usage tweaks.

It is worth pointing out that in all of my tests below, the only difference between the updates that encountered issues and those that did not, is the version of the kernel. All of the user-space software remained the same, and is run in the same way. The only differences between one update and the previous is that the kernel went from 6.12.55 to 6.17.6.

Here is a graph of the memory use of my test-rockpis boardfarm system (with 256MB of RAM) running the 6.12.55 kernel pretty much with the in-kernel defconfig. In this graph the system ran for almost 16 days. It didn't crash, after 16 days I simply uploaded a new RAUC bundle. The purple line is the total memory (230428), the blue line is the amount of available memory, and the green line is the amount of free memory:

Here's a graph of the same device running the next update, also running a 6.12.55 kernel, which ran for almost 22 days:

For the next update the only change I made was to switch from the 6.12.55 kernel to the 6.17.6 kernel. The total memory decreased from 230428 to 229636:

It only ran for about 4.5 days before issuing the following kernel panic:

[508644.063311] Kernel panic - not syncing: System is deadlocked on memory
[508644.063928] CPU: 2 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.6-yocto-standard-00102-g5a817ec7a7
96-dirty #1 PREEMPT 
[508644.064898] Hardware name: Radxa ROCK Pi S (DT)
[508644.065319] Call trace:
[508644.065557]  show_stack+0x18/0x30 (C) 
[508644.065921]  dump_stack_lvl+0x60/0x80
[508644.066277]  dump_stack+0x18/0x24
[508644.066602]  vpanic+0x124/0x2e8
[508644.066909]  abort+0x0/0x4
[508644.067176]  out_of_memory+0x560/0x580
[508644.067540]  __alloc_frozen_pages_noprof+0xc24/0xcf4
[508644.068007]  alloc_pages_mpol+0xb4/0x1a4
[508644.068384]  alloc_frozen_pages_noprof+0x44/0xc0
[508644.068822]  new_slab+0x328/0x3b0
[508644.069142]  ___slab_alloc+0x5dc/0x9c0
[508644.069503]  __slab_alloc.isra.0+0x34/0x68
[508644.069893]  __kmalloc_cache_noprof+0x168/0x2c0
[508644.070322]  copy_verifier_state+0x1bc/0x1f8
[508644.070732]  push_stack+0x7c/0x100
[508644.071062]  check_cond_jmp_op+0x3d8/0x13cc
[508644.071463]  do_check_common+0x28ac/0x2cec
[508644.071856]  bpf_check+0x247c/0x3220
[508644.072204]  bpf_prog_load+0x620/0xbe0
[508644.072564]  __sys_bpf+0x7b8/0x205c
[508644.072902]  __arm64_sys_bpf+0x20/0x30
[508644.073264]  invoke_syscall.constprop.0+0x40/0xf0
[508644.073712]  el0_svc_common.constprop.0+0x38/0xd8
[508644.074160]  do_el0_svc+0x1c/0x28
[508644.074485]  el0_svc+0x34/0xe8
[508644.074789]  el0t_64_sync_handler+0xa0/0xe4
[508644.075189]  el0t_64_sync+0x198/0x19c
[508644.075545] SMP: stopping secondary CPUs
[508644.076001] Kernel Offset: 0x391170c00000 from 0xffff800080000000
[508644.076560] PHYS_OFFSET: 0xfff1000000000000
[508644.076952] CPU features: 0x000000,00010000,20002000,0400421b
[508644.077484] Memory Limit: none
[508644.077787] ---[ end Kernel panic - not syncing: System is deadlocked on memory ]---

For comparison, here is a plot of the memory usage of my rock-pi-e boardfarm device (1GB RAM) using a 6.17 kernel running for over 21 days:

[purple=total blue=available green=free]

Here's a plot of a zero-3e device (1GB RAM) on a 6.17 kernel running for the same 21-22 days:

Nov 27, 2025

Today I noticed that my "server" device (4GB RAM, rock-5b, on the 6.17 kernel) did not throw a kernel panic, but it went crazy momentarily (loadavg very high) after running for almost 9 days (it is on a different update schedule than the devices in the boardfarm) and the log shows that it invoked the OOM killer aggressively:

Nov 27 05:28:38 server kernel: Out of memory: Killed process 2244 (systemd) total-vm:17296kB, anon-rss:1792kB, file-rss:608kB, shmem-rss:0kB, UID:0 pgtables:72kB oom_score_adj:100
Nov 27 05:28:37 server (sd-pam)[2247]: pam_systemd(systemd-user:session): Failed to release session: No session '1' known
Nov 27 05:41:55 server kernel: Out of memory: Killed process 2985663 (png_shelly_humi) total-vm:36624kB, anon-rss:15360kB, file-rss:1636kB, shmem-rss:0kB, UID:0 pgtables:112kB oom_score_a>
Nov 27 05:55:45 server kernel: Out of memory: Killed process 3047554 (png_temp_1w_out) total-vm:32000kB, anon-rss:10752kB, file-rss:444kB, shmem-rss:0kB, UID:0 pgtables:104kB oom_score_ad>
Nov 27 05:56:35 server kernel: Out of memory: Killed process 3050620 (png_light_separ) total-vm:31824kB, anon-rss:10112kB, file-rss:852kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_ad>
Nov 27 05:59:19 server kernel: Out of memory: Killed process 3053441 (convert) total-vm:20760kB, anon-rss:3328kB, file-rss:812kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
Nov 27 05:59:20 server kernel: Out of memory: Killed process 3053435 (convert) total-vm:20760kB, anon-rss:3328kB, file-rss:636kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
Nov 27 05:59:22 server kernel: Out of memory: Killed process 3053443 (convert) total-vm:20760kB, anon-rss:3328kB, file-rss:656kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
Nov 27 05:59:22 server kernel: Out of memory: Killed process 3053436 (convert) total-vm:20760kB, anon-rss:3456kB, file-rss:632kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
Nov 27 05:59:22 server kernel: Out of memory: Killed process 3053439 (convert) total-vm:20900kB, anon-rss:3500kB, file-rss:516kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0
Nov 27 05:59:22 server kernel: Out of memory: Killed process 3053438 (convert) total-vm:31900kB, anon-rss:14716kB, file-rss:416kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646649 (convert) total-vm:23604kB, anon-rss:6144kB, file-rss:132kB, shmem-rss:0kB, UID:0 pgtables:84kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646648 (convert) total-vm:22028kB, anon-rss:4648kB, file-rss:156kB, shmem-rss:0kB, UID:0 pgtables:88kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646646 (convert) total-vm:31900kB, anon-rss:9968kB, file-rss:404kB, shmem-rss:0kB, UID:0 pgtables:92kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646643 (convert) total-vm:31900kB, anon-rss:14252kB, file-rss:116kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646642 (convert) total-vm:31900kB, anon-rss:14360kB, file-rss:40kB, shmem-rss:0kB, UID:0 pgtables:100kB oom_score_adj:0
Nov 27 12:33:05 server kernel: Out of memory: Killed process 646645 (convert) total-vm:31900kB, anon-rss:14288kB, file-rss:84kB, shmem-rss:0kB, UID:0 pgtables:108kB oom_score_adj:0

This system did not panic and halt, but clearly was demonstrating memory issues.

This is not an issue that is unique to the rock-pi-s board that I was using, it's just that, with only 256MB of RAM, this is where the issue was spotted the earliest. The graphs above (from Nov 26) where my 1GB devices are running the 6.17.6 kernel are (in retrospect) showing the same problem I saw on the rock-pi-s, they're just doing it a lot slower, to the point where the graphs look better than the graph for the 256MB device.

Arnd, on #armlinux, suggested I start tracking slabinfo, slabtop -s c, and lsof. I did. He also suggested I flush the cache before recording any information:

if you run 'echo 2 > /proc/sys/vm/drop_caches' to free all reclaimable slab caches before looking at /proc/slabinfo, you can rule out the ones that would be freed before oom

Dec 01, 2025

After letting my system run for a couple days, I started by running the cache flush line as recommended by Arnd, then:

# slabtop -s c
 Active / Total Objects (% used)    : 483354 / 495272 (97.6%)
 Active / Total Slabs (% used)      : 19450 / 19450 (100.0%)
 Active / Total Caches (% used)     : 108 / 161 (67.1%)
 Active / Total Size (% used)       : 89564 / 92540 (96.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.19K / 8.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
218820 218820 100%    0.19K  10420       21      41680 filp
 60564  60555  99%    0.19K   2884       21      11536 kmalloc-cg-192
 17136  16958  98%    0.25K   1071       16       4284 maple_node
 20958  20841  99%    0.19K    998       21       3992 kmalloc-192
  5575   5575 100%    0.62K    223       25       3568 debugfs_inode_cache
...

This indicates that the filp structure in the kernel thinks that there are 218,820 "active objects" (i.e. files) being used by the system. However:

# lsof -n | wc -l
1260

lsof is only aware of 1,260 opened files being used by the system; a bit of a discrepancy. Thanks to Arnd for helping me interpret this information.

Dec 02, 2025

On #kernelnewbies, silurian_invader suggested I enable CONFIG_DEBUG_KMEMLEAK. I did, updated my system, and rebooted (Yocto makes this cycle so easy!).

After about 2 hours I did the drop_caches thing, then:

# cat /sys/kernel/debug/kmemleak | grep "^unreferenced object" | wc -l
6324

After only 2 hours the kernel is already aware of 6,324 unreferenced kernel objects! One of which looks like:

unreferenced object 0xffff00000246b6c0 (size 192):
  comm "systemd", pid 1, jiffies 4294893229
  hex dump (first 32 bytes):
    02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 133687fb):
    kmemleak_alloc+0x38/0x44
    kmem_cache_alloc_noprof+0x214/0x300
    prepare_creds+0x24/0x338
    copy_creds+0x2c/0x1c0
    copy_process+0x354/0x14b4
    kernel_clone+0x68/0x36c
    __do_sys_clone3+0xe0/0x140
    __arm64_sys_clone3+0x14/0x20
    invoke_syscall.constprop.0+0x40/0xf0
    el0_svc_common.constprop.0+0x38/0xd8
    do_el0_svc+0x1c/0x28
    el0_svc+0x34/0xe8
    el0t_64_sync_handler+0xa0/0xe4
    el0t_64_sync+0x198/0x19c

I'm guessing the answer lies in examining the backtraces. Most of them look quite similar, I'll try to summarize them to see where the similarities and differences show up.

On #mm dhansen and heat had a look. One thing that was suggested was:

# cat /sys/kernel/debug/kmemleak | sort | uniq -c | sort -rn | head -n50
  22356     kmemleak_alloc+0x38/0x44
  22356     invoke_syscall.constprop.0+0x40/0xf0
  22356     el0t_64_sync_handler+0xa0/0xe4
  22356     el0_svc+0x34/0xe8
  22356     do_el0_svc+0x1c/0x28
  22277     el0_svc_common.constprop.0+0xb8/0xd8
  21193   hex dump (first 32 bytes):
  21193     kmem_cache_alloc_noprof+0x214/0x300
  19464     el0t_64_sync+0x198/0x19c
  16704     path_openat+0x48/0xfd0
  16704     do_filp_open+0xa8/0x170
  16704     alloc_empty_file+0x54/0x11c
  16704     00 00 00 00 1d 80 4a 0c 40 3e b4 44 47 a0 ff ff  ......J.@>.DG...
  12550     do_sys_openat2+0x90/0xf8
  12550     __arm64_sys_openat+0x68/0xc0
   6179     __arm64_sys_execve+0x40/0x5c
   4917     do_execveat_common.isra.0+0x1a0/0x1e0
   4489     prepare_creds+0x24/0x338
   4489     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   4154     do_open_execat+0x64/0x16c
   3675     02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   2892     open_exec+0x30/0x70
   2892     bprm_execve+0x280/0x4a4
   2077     load_elf_binary+0x1d8/0x1578
   2077     98 29 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .)..............
   2076     a8 43 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .C..............
   2025     prepare_exec_creds+0x14/0x58
   2025     bprm_execve+0x44/0x4a4
   1262     do_execveat_common.isra.0+0x6c/0x1e0
   1262     alloc_bprm+0x28/0x240
   1230     set_current_groups+0x1c/0x90
   1230     __arm64_sys_setgroups+0x16c/0x23c
   1228     18 6f a2 02 00 00 ff ff 00 00 00 00 00 00 00 00  .o..............
   1227     00 48 5e 05 00 00 ff ff 00 00 00 00 00 00 00 00  .H^.............
   1163     __kvmalloc_node_noprof+0x3ac/0x4fc
   1163     __arm64_sys_setgroups+0x8c/0x23c
    846     18 6f 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .o..............
    834     e0 13 ba 02 00 00 ff ff 00 00 00 00 00 00 00 00  ................
    822     kernel_clone+0x68/0x36c
    822     copy_process+0x354/0x14b4
    822     copy_creds+0x2c/0x1c0
    815     load_script+0x1fc/0x2e0
    807     __do_sys_clone+0x70/0xb4
    807     __arm64_sys_clone+0x1c/0x28
    791   hex dump (first 8 bytes):
    432     c8 77 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .w..............
    432     10 62 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .b..............
    430     c0 6a 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .j..............
    429     b0 50 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  .P..............
    428     60 59 96 01 00 00 ff ff 00 00 00 00 00 00 00 00  `Y..............

Dec 04, 2025

At this point I am quite certain that the problem is in the linux-yocto kernel specifically, and not with upstream linux-stable (on which linux-yocto is based). I created a linux-stable recipe that I used to work with kernels directly from upstream linux-stable (without any Yocto patches applied), and I was unable to get any of them to demonstrate the leaking memory.

If you are running a 6.17-based linux-yocto kernel on your device, try out the slabtop -s c command demonstrated above. If the top item is filp, your kernel is leaking memory. Specifically it is leaking open file structures and not closing and cleaning them up properly.

Now the bisection begins...

Dec 06, 2026

Bisecting linux-yocto, in my spare time, I landed on https://git.yoctoproject.org/linux-yocto/commit/?h=v6.17/standard/base&id=ca4826d81209e0cd0a5521dbdb194de3a40ec650

This patch is part of a series to add aufs support to the linux-yocto kernel. Aufs is one instance of a set of separate solutions to provide union mount capabilities to Linux; the others being UnionFS and OverlayFS. UnionFS was the original, then aufs and OverlayFS came along. Aufs, however, was never merged into the kernel. Several distributions include the out-of-kernel aufs, and, seemingly, the linux-yocto kernel does as well but these patches are not maintained with the rest of the kernel and appear to have developed a leak.

Filesystem union mounts allows you to take two directories (for example), from two completely different filesystems, and mount them on top of each other so that they are presented to the user as one filesystem containing the union of the individual files on which it is built. Performing filesystem union mounts is useful in several situations. For example, if you have a read-only filesystem (e.g. SquashFS) but you need to write to it from time to time, you can mount a second, writable, filesystem on top of it to make it writable. You could configure it so that the writes end up in the second filesystem, leaving the underlying RO filesystem unchanged. Union filesystem mounts are also useful for A/B update mechanisms. In a full-disk A/B update system (such as RAUC can be configured to do) the entire filesystem is updated on each update, and you boot from one partition or the other. Therefore, for example, if you add a user and set their password (which is stored in /etc) then update and boot into the new bundle, those changes to /etc only exist on the old partition and are lost. Configuring a union overlay such that changes such as these are stored on a non-updated data partition then overlaid onto whichever partition's filesystem is currently running, means any configuration changes performed in one bundle are available in the next.

In any case this out-of-kernel patch set will need some attention, or it will need to be dropped from linux-yocto.

WARNING: Speculation, more investigation needed

I did not notice any kernel memory leaks with linux-yocto when running the previous linux-yocto kernel: 6.12. Maybe it was there but just slower? I don't know and will need to investigate.

Also, note that I am not (currently) using any union/overlay filesystem mechanism in my devices, including aufs. In fact, I don't even think it is enabled in my 6.17 kernel. So it appears as though simply having this patch in the kernel source tree (even without enabling or using it in any way) is enough to trigger this leak.

DoodleTs

27 Nov 2025

Linux-Yocto Kernel 6.17 Memory Issue