Mailing List Archive

Machine hangs up with out of memory
Hi,

I have an issue with a machine where I'm not able to detect the real
root cause. It hangs up totally. It seems like it was running out of
memory - but why? Hopefully somebody can give me some insight. As far I
can see right now, it hangs up a few hours after an `emerge --update
--newuse --deep --with-bdeps=y @world`.

The machine is an Intel Atom with 8 GB RAM (physical, max) and 24 GB
swap (a file). So 32 GB RAM in total. It has a 250GB SSD. It runs
gentoo-sources-4.14.83 build with genkernel. Portage uses the stable
tree only. It basically provides the hardware for a qemu VM which does
the network management: primary ns, dhcp, apache ssl proxy. This VM uses
4 GB RAM and has 8 GB swap file. The VM works smoothly. The atom machine
itself acts further as basic nfs server to an independent dedicated
server - which does the (re)exports - and as secondary ns. For this I'm
convinced that 32GB RAM total have to be enough - correct me if I'm
wrong!

The issue starts round about in February (IIRC). The update of gcc-10.2
did fail. I have /var/tmp/portage on tmpfs - I did increase the size in
fstab from 8 to 16 GB. Afterwards gcc build successfully.

After two hang-ups I did increase the swap from 8 to 24 GB. It doesn't
help. Here is a complete log from /var/log/messages:

Apr 28 05:35:57 Syrin kernel: [1454017.499919] isc-net-0000 invoked
oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD),
nodemask=(null), order=0, oom_score_adj=0
Apr 28 05:35:57 Syrin kernel: [1454017.499925] isc-net-0000 cpuset=/
mems_allowed=0
Apr 28 05:35:57 Syrin kernel: [1454017.499933] CPU: 0 PID: 27685 Comm:
isc-net-0000 Not tainted 4.14.83-gentoo #1
Apr 28 05:35:57 Syrin kernel: [1454017.499935] Hardware name: MSI
MS-7877/J1900I, BIOS V1.2 03/25/2014
Apr 28 05:35:57 Syrin kernel: [1454017.499936] Call Trace:
Apr 28 05:35:57 Syrin kernel: [1454017.499948] dump_stack+0x67/0x98
Apr 28 05:35:57 Syrin kernel: [1454017.499954] dump_header+0x94/0x20c
Apr 28 05:35:57 Syrin kernel: [1454017.499958]
oom_kill_process+0x24a/0x420
Apr 28 05:35:57 Syrin kernel: [1454017.499962] ?
oom_badness.part.9+0xd3/0x150
Apr 28 05:35:57 Syrin kernel: [1454017.499965] out_of_memory+0xf9/0x290
Apr 28 05:35:57 Syrin kernel: [1454017.499968]
__alloc_pages_nodemask+0xf48/0xff0
Apr 28 05:35:57 Syrin kernel: [1454017.499974]
filemap_fault+0x294/0x4c0
Apr 28 05:35:57 Syrin kernel: [1454017.499979]
ext4_filemap_fault+0x2c/0x40
Apr 28 05:35:57 Syrin kernel: [1454017.499983] __do_fault+0x1f/0xb0
Apr 28 05:35:57 Syrin kernel: [1454017.499986]
__handle_mm_fault+0x3ed/0xad0
Apr 28 05:35:57 Syrin kernel: [1454017.499991]
handle_mm_fault+0xaa/0x1f0
Apr 28 05:35:57 Syrin kernel: [1454017.499996]
__do_page_fault+0x250/0x4f0
Apr 28 05:35:57 Syrin kernel: [1454017.500000] ? page_fault+0x2f/0x50
Apr 28 05:35:57 Syrin kernel: [1454017.500003] page_fault+0x45/0x50
Apr 28 05:35:57 Syrin kernel: [1454017.500005] RIP: 0000:
(null)
Apr 28 05:35:57 Syrin kernel: [1454017.500007] RSP:
12f83750:0000000000000001 EFLAGS: 7ffa12f837a0
Apr 28 05:35:57 Syrin kernel: [1454017.500010] Mem-Info:
Apr 28 05:35:57 Syrin kernel: [1454017.500017] active_anon:1694713
inactive_anon:211859 isolated_anon:0
Apr 28 05:35:57 Syrin kernel: [1454017.500017] active_file:328
inactive_file:344 isolated_file:32
Apr 28 05:35:57 Syrin kernel: [1454017.500017] unevictable:1374 dirty:0
writeback:0 unstable:0
Apr 28 05:35:57 Syrin kernel: [1454017.500017] slab_reclaimable:4480
slab_unreclaimable:7449
Apr 28 05:35:57 Syrin kernel: [1454017.500017] mapped:1071 shmem:3
pagetables:16352 bounce:0
Apr 28 05:35:57 Syrin kernel: [1454017.500017] free:11655 free_pcp:534
free_cma:0
Apr 28 05:35:57 Syrin kernel: [1454017.500021] Node 0
active_anon:6778852kB inactive_anon:847436kB active_file:1312kB
inactive_file:1376kB unevictable:5496kB isolated(anon):0kB
isolated(file):128kB mapped:4284kB dirty:0kB writeback:0kB shmem:12kB
writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
Apr 28 05:35:57 Syrin kernel: [1454017.500026] DMA free:15836kB min:20kB
low:32kB high:44kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:15920kB
managed:15836kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
Apr 28 05:35:57 Syrin kernel: [1454017.500027] lowmem_reserve[]: 0 2664
7647 7647
Apr 28 05:35:57 Syrin kernel: [1454017.500036] DMA32 free:23732kB
min:3892kB low:6620kB high:9348kB active_anon:2319992kB
inactive_anon:363260kB active_file:0kB inactive_file:92kB
unevictable:0kB writepending:0kB present:2825512kB managed:2734888kB
mlocked:0kB kernel_stack:140kB pagetables:22572kB bounce:0kB
free_pcp:1136kB local_pcp:476kB free_cma:0kB
Apr 28 05:35:57 Syrin kernel: [1454017.500037] lowmem_reserve[]: 0 0
4982 4982
Apr 28 05:35:57 Syrin kernel: [1454017.500045] Normal free:7052kB
min:7284kB low:12384kB high:17484kB active_anon:4458860kB
inactive_anon:484176kB active_file:1368kB inactive_file:1568kB
unevictable:5496kB writepending:0kB present:5242880kB managed:5102420kB
mlocked:5496kB kernel_stack:2276kB pagetables:42836kB bounce:0kB
free_pcp:1000kB local_pcp:724kB free_cma:0kB
Apr 28 05:35:57 Syrin kernel: [1454017.500046] lowmem_reserve[]: 0 0 0 0
Apr 28 05:35:57 Syrin kernel: [1454017.500050] DMA: 1*4kB (U) 1*8kB (U)
1*16kB (U) 0*32kB 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB
(U) 1*2048kB (M) 3*4096kB (M) = 15836kB
Apr 28 05:35:57 Syrin kernel: [1454017.500070] DMA32: 268*4kB (UE)
1562*8kB (UE) 645*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 23888kB
Apr 28 05:35:57 Syrin kernel: [1454017.500084] Normal: 293*4kB (UME)
490*8kB (UME) 152*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 7524kB
Apr 28 05:35:57 Syrin kernel: [1454017.500100] Node 0 hugepages_total=0
hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Apr 28 05:35:57 Syrin kernel: [1454017.500101] 1818 total pagecache
pages
Apr 28 05:35:57 Syrin kernel: [1454017.500110] 86 pages in swap cache
Apr 28 05:35:57 Syrin kernel: [1454017.500112] Swap cache stats: add
13659627, delete 13659513, find 1129210/1793572
Apr 28 05:35:57 Syrin kernel: [1454017.500113] Free swap = 0kB
Apr 28 05:35:57 Syrin kernel: [1454017.500114] Total swap = 25165820kB
Apr 28 05:35:57 Syrin kernel: [1454017.500115] 2021078 pages RAM
Apr 28 05:35:57 Syrin kernel: [1454017.500116] 0 pages
HighMem/MovableOnly
Apr 28 05:35:57 Syrin kernel: [1454017.500117] 57792 pages reserved
Apr 28 05:35:57 Syrin kernel: [1454017.500118] 0 pages hwpoisoned
Apr 28 05:35:57 Syrin kernel: [1454017.500119] [ pid ] uid tgid
total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Apr 28 05:35:57 Syrin kernel: [1454017.500125] [ 4009] 0 4009
21207 320 7 3 20 0 apcupsd
Apr 28 05:35:57 Syrin kernel: [1454017.500128] [ 4043] 0 4043
54371 48 12 3 220 0 rsyslogd
Apr 28 05:35:57 Syrin kernel: [1454017.500131] [ 4084] 0 4084
1938 178 6 3 18 0 fcron
Apr 28 05:35:57 Syrin kernel: [1454017.500133] [ 4400] 0 4400
17733 1376 8 3 0 0 ntpd
Apr 28 05:35:57 Syrin kernel: [1454017.500136] [ 4429] 0 4429
2789 0 7 3 103 0 rsync
Apr 28 05:35:57 Syrin kernel: [1454017.500139] [ 4460] 0 4460
1689 241 6 3 112 -1000 sshd
Apr 28 05:35:57 Syrin kernel: [1454017.500142] [ 4693] 0 4693
1067352 80035 1546 7 664044 0 qemu-system-x86
Apr 28 05:35:57 Syrin kernel: [1454017.500145] [ 4863] 0 4863
1905 465 7 3 43 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500148] [ 4864] 0 4864
1905 433 7 3 44 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500151] [ 4865] 0 4865
1905 431 7 3 43 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500153] [ 4866] 0 4866
1905 443 7 3 43 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500156] [ 4867] 0 4867
1905 453 7 3 43 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500159] [ 4868] 0 4868
1905 433 8 3 43 0 agetty
Apr 28 05:35:57 Syrin kernel: [1454017.500162] [27439] 0 27439
675 295 5 3 60 0 rpcbind
Apr 28 05:35:57 Syrin kernel: [1454017.500164] [27509] 0 27509
750 419 5 3 80 0 rpc.idmapd
Apr 28 05:35:57 Syrin kernel: [1454017.500167] [27520] 65534 27520
693 421 5 3 53 0 rpc.statd
Apr 28 05:35:57 Syrin kernel: [1454017.500170] [27583] 0 27583
819 365 5 3 108 0 rpc.mountd
Apr 28 05:35:57 Syrin kernel: [1454017.500173] [27684] 40 27684
7570889 1823291 14584 32 5626031 0 named
Apr 28 05:35:57 Syrin kernel: [1454017.500176] [10479] 0 10479
3262 449 7 3 182 0 udevd
Apr 28 05:35:57 Syrin kernel: [1454017.500179] [ 2923] 0 2923
1938 318 6 3 10 0 fcron
Apr 28 05:35:57 Syrin kernel: [1454017.500182] [ 2924] 0 2924
981 406 5 3 0 0 backup.cron
Apr 28 05:35:57 Syrin kernel: [1454017.500184] [ 2930] 0 2930
981 462 5 3 0 0 rsync.sh
Apr 28 05:35:57 Syrin kernel: [1454017.500187] [ 2965] 0 2965
13185 1261 30 3 0 0 rsync
Apr 28 05:35:57 Syrin kernel: [1454017.500190] [ 2966] 0 2966
574 158 5 3 0 0 tee
Apr 28 05:35:57 Syrin kernel: [1454017.500192] [ 2968] 0 2968
2038 482 7 3 0 0 rsync
Apr 28 05:35:57 Syrin kernel: [1454017.500195] [ 2974] 0 2974
14142 1189 31 3 0 0 rsync
Apr 28 05:35:57 Syrin kernel: [1454017.500198] [ 2995] 0 2995
1938 271 6 3 7 0 fcron
Apr 28 05:35:57 Syrin kernel: [1454017.500201] [ 2996] 0 2996
981 68 6 3 0 0 sh
Apr 28 05:35:57 Syrin kernel: [1454017.500203] Out of memory: Kill
process 27684 (named) score 904 or sacrifice child
Apr 28 05:35:57 Syrin kernel: [1454017.500234] Killed process 27684
(named) total-vm:30283556kB, anon-rss:7293164kB, file-rss:0kB,
shmem-rss:0kB
Apr 28 05:36:00 Syrin kernel: [1454019.937636] oom_reaper: reaped
process 27684 (named), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

All the processes after udevd are just from this day (backup job).

For comparison, the last call trace before was nearly the same:

Apr 15 18:26:37 Syrin kernel: [376620.710330] Call Trace:
Apr 15 18:26:37 Syrin kernel: [376620.710341] dump_stack+0x67/0x98
Apr 15 18:26:37 Syrin kernel: [376620.710347] dump_header+0x94/0x20c
Apr 15 18:26:37 Syrin kernel: [376620.710352]
oom_kill_process+0x24a/0x420
Apr 15 18:26:37 Syrin kernel: [376620.710355] ?
oom_badness.part.9+0xd3/0x150
Apr 15 18:26:37 Syrin kernel: [376620.710358] out_of_memory+0xf9/0x290
Apr 15 18:26:37 Syrin kernel: [376620.710361]
__alloc_pages_nodemask+0xf48/0xff0
Apr 15 18:26:37 Syrin kernel: [376620.710367] filemap_fault+0x294/0x4c0
Apr 15 18:26:37 Syrin kernel: [376620.710372]
ext4_filemap_fault+0x2c/0x40
Apr 15 18:26:37 Syrin kernel: [376620.710376] __do_fault+0x1f/0xb0
Apr 15 18:26:37 Syrin kernel: [376620.710380]
__handle_mm_fault+0x3ed/0xad0
Apr 15 18:26:37 Syrin kernel: [376620.710385]
handle_mm_fault+0xaa/0x1f0
Apr 15 18:26:37 Syrin kernel: [376620.710390]
__do_page_fault+0x250/0x4f0
Apr 15 18:26:37 Syrin kernel: [376620.710394] ? page_fault+0x2f/0x50
Apr 15 18:26:37 Syrin kernel: [376620.710396] page_fault+0x45/0x50
Apr 15 18:26:37 Syrin kernel: [376620.710400] RIP:
21e50088:0x7f41aedee150
Apr 15 18:26:37 Syrin kernel: [376620.710401] RSP:
21e50070:0000000000000000 EFLAGS: 7f41aedee150
Apr 15 18:26:37 Syrin kernel: [376620.710405] Mem-Info:
Apr 15 18:26:37 Syrin kernel: [376620.710411] active_anon:1695970
inactive_anon:212020 isolated_anon:0
Apr 15 18:26:37 Syrin kernel: [376620.710411] active_file:339
inactive_file:320 isolated_file:0
Apr 15 18:26:37 Syrin kernel: [376620.710411] unevictable:1374 dirty:0
writeback:0 unstable:0
Apr 15 18:26:37 Syrin kernel: [376620.710411] slab_reclaimable:4311
slab_unreclaimable:7552
Apr 15 18:26:37 Syrin kernel: [376620.710411] mapped:1059 shmem:3
pagetables:16264 bounce:0
Apr 15 18:26:37 Syrin kernel: [376620.710411] free:11840 free_pcp:0
free_cma:0

Unfortunately I don't understand all the details. Any help is highly
appreciated.

I assume it has something to do with tmpfs which will not be freed. Just
an assumption, I'm searching for clarity, not try and error.

Thanks
Kai
Re: Machine hangs up with out of memory [ In reply to ]
On 2021-04-29 08:59, J.O. Aho wrote:
> Your named is taking up 7G or memory, are you sure your configuration
> is correct?
Thanks, good point. There were errors in the logs. I will investigate
and monitor it.

> --
>
> //Aho
Re: Machine hangs up with out of memory [ In reply to ]
On Wed, Apr 28, 2021 at 7:58 PM Kai Peter <kp@lists.qware.org> wrote:

> Hi,
>
> I have an issue with a machine where I'm not able to detect the real
> root cause. It hangs up totally. It seems like it was running out of
> memory - but why? Hopefully somebody can give me some insight. As far I
> can see right now, it hangs up a few hours after an `emerge --update
> --newuse --deep --with-bdeps=y @world`.
>
> The machine is an Intel Atom with 8 GB RAM (physical, max) and 24 GB
> swap (a file). So 32 GB RAM in total.
>

Might be worth adding zswap using zstd or lz4 to your config (uses more CPU
for less IO)
https://www.kernel.org/doc/html/latest/vm/zswap.html

If you want to see which processes are using the most memory, run top then
type 'M' to have top sort by memory instead of CPU. You can also type 'm'
to make top show the memory numbers as an ascii bar graph. The man page
explains what VIRT RES SHR mean.
Re: Machine hangs up with out of memory [ In reply to ]
On Friday, 30 April 2021 02:30:51 BST Adam Carter wrote:
> On Wed, Apr 28, 2021 at 7:58 PM Kai Peter <kp@lists.qware.org> wrote:
> > Hi,
> >
> > I have an issue with a machine where I'm not able to detect the real
> > root cause. It hangs up totally. It seems like it was running out of
> > memory - but why? Hopefully somebody can give me some insight. As far I
> > can see right now, it hangs up a few hours after an `emerge --update
> > --newuse --deep --with-bdeps=y @world`.
> >
> > The machine is an Intel Atom with 8 GB RAM (physical, max) and 24 GB
> > swap (a file). So 32 GB RAM in total.
>
> Might be worth adding zswap using zstd or lz4 to your config (uses more CPU
> for less IO)
> https://www.kernel.org/doc/html/latest/vm/zswap.html

Zswap will help optimise the swapping of cold pages out of RAM into the
compressed zswap cache and eventually push these out to the disk. This won't
help with the problem of not having enough RAM to start with for compiling
packages with large resource requirements, like e.g. Chromium.

With monster packages where more RAM is needed even for a single compilation
job, after all cold pages have been swapped out, hot pages will start being
swapped out/in. I think at this point the same I/O race condition will ensue
as if no zswap were available, plus a compress/decompress CPU load. I guess
the point of starting to thrash the on-disk swap trying to free RAM may start
sooner, since RAM which would otherwise be available for the single
prioritised compilation job is now being used by zswap. So, there would be a
sweet spot in using zswap to improve I/O performance, but it could end up
becoming a disbenefit on compilation of RAM hungry packages.

Please correct my reasoning above if I have misunderstood how zswap works.

However, the OP problem here seems to be with a leaky BIND?

I found this mentioned upstream - but have not check BGO:

https://gitlab.isc.org/isc-projects/bind9/-/issues/446


> If you want to see which processes are using the most memory, run top then
> type 'M' to have top sort by memory instead of CPU. You can also type 'm'
> to make top show the memory numbers as an ascii bar graph. The man page
> explains what VIRT RES SHR mean.
Re: Machine hangs up with out of memory [ In reply to ]
On 2021-04-30 12:09, Michael wrote:
>
> However, the OP problem here seems to be with a leaky BIND?
>
> I found this mentioned upstream - but have not check BGO:
>
> https://gitlab.isc.org/isc-projects/bind9/-/issues/446

Thanks for reply. rndc runs on other machines daily, but not on this
one.

Right now I have corrected my bind config and the memory usage is stable
by 700MB. The issue here was the statement "listen-on-v6 { any; };".
bind was trying repeatedly to bind ipv6 to br0 and tap{0,1} devices -
and fails.

The memory usage is stable since 3 days:

$>ps axuww | grep named
named 14414 0.2 1.4 735040 116952 ? Ssl Apr29 4:38
/usr/sbin/named -u named

$>free -h
total used free shared buff/cache
available
Mem: 7.5Gi 480Mi 55Mi 0.0Ki 7.0Gi
6.9Gi
Swap: 23Gi 378Mi 23Gi

Lets see what happens after the next update cycle.
Re: Machine hangs up with out of memory [ In reply to ]
On Fri, Apr 30, 2021 at 8:09 PM Michael <confabulate@kintzios.com> wrote:

> On Friday, 30 April 2021 02:30:51 BST Adam Carter wrote:
> > On Wed, Apr 28, 2021 at 7:58 PM Kai Peter <kp@lists.qware.org> wrote:
> > > Hi,
> > >
> > > I have an issue with a machine where I'm not able to detect the real
> > > root cause. It hangs up totally. It seems like it was running out of
> > > memory - but why? Hopefully somebody can give me some insight. As far I
> > > can see right now, it hangs up a few hours after an `emerge --update
> > > --newuse --deep --with-bdeps=y @world`.
> > >
> > > The machine is an Intel Atom with 8 GB RAM (physical, max) and 24 GB
> > > swap (a file). So 32 GB RAM in total.
> >
> > Might be worth adding zswap using zstd or lz4 to your config (uses more
> CPU
> > for less IO)
> > https://www.kernel.org/doc/html/latest/vm/zswap.html
>
> Zswap will help optimise the swapping of cold pages out of RAM into the
> compressed zswap cache and eventually push these out to the disk. This
> won't
> help with the problem of not having enough RAM to start with for compiling
> packages with large resource requirements, like e.g. Chromium.
>

Yep, but the system is already at its full RAM, so we're in
workaround/tactical solution territory. Any system that needs 24gig swap to
support 8gig ram is going to really suck.

>
> With monster packages where more RAM is needed even for a single
> compilation
> job, after all cold pages have been swapped out, hot pages will start
> being
> swapped out/in. I think at this point the same I/O race condition will
> ensue
> as if no zswap were available, plus a compress/decompress CPU load. I
> guess
> the point of starting to thrash the on-disk swap trying to free RAM may
> start
> sooner, since RAM which would otherwise be available for the single
> prioritised compilation job is now being used by zswap. So, there would
> be a
> sweet spot in using zswap to improve I/O performance, but it could end up
> becoming a disbenefit on compilation of RAM hungry packages.
>
> Please correct my reasoning above if I have misunderstood how zswap works.
>

What you've said sounds reasonable to me.