Why does memory-heavy processes lead to system crashes when swap is available?

Question

I run a bath (snakemake -j 1) of memory-heavy operations in Python: subtracting two arrays up to 15 GB each, then calculating norms of the difference. Surprisingly my system started to misbehave:

Thunderbird crashes,
my graphic environment (XFCE with lightdm) crashes (effectively killing my screen sessions with the bath running),
after graphic environment respawned it swapped my monitors (pun intended) and did not allow me to re-swap them with Display settings - service lightdm restart was necessary,
my snakemake pipeline (bash + Python + numpy + pandas) tends to fail with segmentation faults when processing the biggest arrays,
yesterday I discovered I lost audio from Firefox,
recently after pipeline and graphic session crash one of bash processes went wild (100% CPU usage).

I have plenty (932 GB) of swap available, so it is not that my system suddenly ran out of memory. RAM chips also seems to work (17 passes of Memtest86+ revealed no error).

I ask about the reason behind crashes/misbehaviour of other programs (Thunderbird, screen session, graphic environment). Even if my programs were poorly written, I would expect their impact to be limited extensive swapping. A total XFCE session restart is something that definitively should not happen. And by restart I mean restart, not freezing or slowdown due to swapping.

Raffa · Answer 1 · 2023-09-29T08:42:32.930

3

Active memory pages are not swap-able ... They have to be inactive i.e. not currently actively used/needed by any process in order to be candidates to be moved to swap.

Therefor, swap is not necessarily going to be used for all memory pages and the availability of free swap space doesn't mean that your system's physical memory is safe from getting full meanwhile.

"segmentation faults" are most likely due to applications not being able to allocate memory addresses due to insufficient free memory.

Bottom line, swap is only managed by the kernel and not by userspace applications and kernel only swaps inactive memory pages ... When one application actively uses a large amount of memory at once, kernel will not selectively swap any of that memory ... So, fixing your application's memory usage is the way to go around this issue.

Maybe it's time to stop thinking about fixing Ubuntu to work with your application and start thinking about fixing your own application's memory usage and a good deep look using a memory profiler is a good starting point ... See for example:

Check memory usage of process which exits immediately

edited Sep 29 '23 at 08:42

answered Sep 29 '23 at 08:29

Raffa

32,237

That may be the explanation I am looking for. May you elaborate, how does Ubuntu define active pages? I thought it uses some kind of LRU algorithm to determine which pages are considered inactive. – abukaj Sep 29 '23 at 10:10
2

@abukaj That's part of it (see 10.1, 10.3) ... More about swap here. – Raffa Sep 29 '23 at 10:28
It was a system problem. Badblocks at swap parrtition. – abukaj Sep 29 '23 at 16:00
@abukaj If I were you, I would take that with a pinch of salt … Badblocks might be a concern when reading from them i.e. swapping back to RAM and not the other way around and it’s not so easy for the OS to hit a badblocks accidentally or even see them these days (unless the drive is dying too fast)as SMART will most likely remap them to good ones automatically … Also swap is very reliable, see e.g. https://unix.stackexchange.com/q/269098 … Moreover swap might not activate at all if it can’t reasonably do its job as it has a tolerance level for bad blocks AFAIK – Raffa Sep 29 '23 at 18:08
@abukaj Please read "Some newer file systems such as Btrfs and ZFS do not have a bad-block avoidance feature at all) here: https://en.wikipedia.org/wiki/Bad_sector#Operating_system … to get what I mean. – Raffa Sep 29 '23 at 18:33
The main reason to have swap is the possibility to restore page back to RAM, isn't it? badblock found 207 badblocks at 0.1% of scan, then I replaced the old swap HDD. Out of curiosity I have just reran batch of the same jobs, in 1-2 hours I should know if my system is unstable. – abukaj Sep 29 '23 at 19:14
4h later no misbehavior. :) – abukaj Sep 29 '23 at 23:20
@abukaj That much badblocks and only at 0.01% is way above the tolerable level … Assuming 4K block size, that might add up to around 1G total badblocks size which is a very clear sign of a rapidly deteriorating drive surface and most likely faster than SMART passive monitoring mechanism can handle safely … Yes, you get a point there and if changing the drive solves it as it appears then that’s a good thing, but kernel swapping mechanism still has its limits that can’t be exceeded :-) – Raffa Sep 30 '23 at 10:03

score 1 · Answer 2 · answered Sep 29 '23 at 15:09

1

It might when swap partition contains badblocks.

Memory load leads to heavy swapping, which drastically increases chance of hitting a badblock.

(Thanks to matigo, whose comment advising me to check RAM inspired me to check swap partition too).

answered Sep 29 '23 at 15:09

abukaj

465

I’d rather say the drive is rapidly dying more than just some bad blocks at acceptable count and rate that can be handled by SMART or ignored by swap. – Raffa Sep 30 '23 at 10:07

Why does memory-heavy processes lead to system crashes when swap is available?

2 Answers2

Linked

Related