1

I recently started getting the following elements printed in "kern.log" and syslog.

Jan 29 10:28:19 server kernel: [82515.307047] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.315021] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.322996] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.330971] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.338944] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.346923] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.354905] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.362875] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.370855] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.378837] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.386824] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.394788] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.402766] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.410765] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.418722] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.426707] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.434693] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.442670] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.450634] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.458628] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.466590] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.474561] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.482551] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.490528] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.498500] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.506492] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.514463] Page fault failed for pfn[0] = 0x0
Jan 29 10:28:19 server kernel: [82515.522435] Page fault failed for pfn[0] = 0x0

I have no idea what they mean but they seem to go on and on for very long time, making logs extremely large and usually, it ends up with the system being unresponsive.

Could it be related to bad RAM? I haven't changed anything related to RAM for a while now and system has been running fine for a few months until now.

Braiam
  • 67,791
  • 32
  • 179
  • 269

1 Answers1

0

The piece of code comes from the AMDGPU drivers:

for (i = 0; i < ttm->num_pages; i++) {
    /* FIXME: The pages cannot be touched outside the notifier_lock */
    pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
    if (unlikely(!pages[i])) {
        pr_err("Page fault failed for pfn[%lu] = 0x%llx\n",
               i, range->pfns[i]);
        r = -ENOMEM;

        goto out_free_pfns;

Apparently, unlikely() function returned true evaluating the negation of the content of the i entry in the pages array, which contains of the result of the hmm_device_entry_to_page() for "the range use to decode device entry value" and "device entry value to get corresponding struct page from". This supposedly trows an out of memory (ENOMEM) error for the gpu. Basically, there was a memory error in your gpu and it's complaining that it's out of memory.

Braiam
  • 67,791
  • 32
  • 179
  • 269