I too have been on a journey figuring out what might be causing these issues.
Often mentioned are kernel parameters, drivers and BIOS settings. And yes, they are all involved in some way. But what exactly is happening when the system suspends and then tries to come back from that?
The BIOS offers options for power management. These should be set in such a way the OS can take full control, so generally they should be disabled.
With that out of the way, the OS can do its thing. This basically means it acts on a trigger and then checks off a list of things to power off and stuff to save for when the system needs to power on again. Along this way, something goes wrong.
Since it seems to be specifically the NVIDIA graphics that goes wrong, that seems to me the thing to investigate. I then came across this interesting part of the NVIDIA docs: https://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/powermanagement.html
The introduction gives a very plausible cause for our problems:
The NVIDIA Linux driver includes support for the suspend
(suspend-to-RAM) and hibernate (suspend-to-disk) system power
management operations, such as ACPI S3 and S4 on the x86/x86_64
platforms. When the system suspends or hibernates, the NVIDIA kernel
drivers prepare in-use GPUs for the sleep cycle, saving state required
to return these GPUs to normal operation when the system is later
resumed.
The GPU state saved by the NVIDIA kernel drivers includes allocations
made in video memory. However, these allocations are collectively
large, and typically cannot be evicted. Since the amount of system
memory available to drivers at suspend time is often insufficient to
accommodate large copies of video memory, the NVIDIA kernel drivers
are designed to act conservatively, and normally only save essential
video memory allocations.
The resulting loss of video memory contents is partially compensated
for by the user-space NVIDIA drivers, and by some applications, but
can lead to failures such as rendering corruption and application
crashes upon exit from power management cycles.
And NVIDIA has come up with a solution:
To better support power management with these types of applications,
the NVIDIA Linux driver provides a custom power management interface
intended for integration with system management tools like systemd.
When I looked at the driver files (proprietary drivers on Arch), it indeed list these systemd unit files. They are not enabled by default, as they are considered 'experimental'. But I gave it a shot, by simply enabling the mentioned services for suspend, resume and hibernate.
In short (on Arch):
sudo systemctl enable nvidia-suspend.service
sudo systemctl enable nvidia-hibernate.service
sudo systemctl enable nvidia-resume.service
After trying it out some time, most issues were gone, but the system would still occasionally show a black screen on resume. Looking through the logs, the errors were now clearly pointing to modesetting. I had tried out a few kernel parameters concerning this, but had removed everything to be able to determine what works and what doesn't.
So, focusing on the modesetting, I added the following parameter to the kernel:
nvidia-drm.modeset=1
And apparently, I did not have the specific NVIDIA kernel module yet, so:
sudo pacman -S linux-headers nvidia-dkms
Now I no longer have the errors about modesetting and resume works great, also faster/cleaner than before.