Load average high on only one of 16 HP workstations

Question

I have a lab setup with 16 HP Z620 systems, all alike (purchased at the same time), with exactly the same Ubuntu 12.04 installation with current kernel 3.13.0-44-generic. Well, not quite all alike: 15 of these have BIOS version J61 v03.06, and the 16th has BIOS version J61 v03.18. All have static IP address with network-manager, avahi-daemon, and cups-browsed disabled.

The bizarre thing is that the 15 systems show load averages much less than 1 (as I write this, uptime shows a load average of 0.00), but the 16th system always shows a load average of 1.00 or above. Here's a top snapshot:

    top - 13:13:04 up 25 min,  3 users,  load average: 1.00, 1.03, 0.91
    Tasks: 203 total,   1 running, 202 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.9 us,  0.3 sy,  0.0 ni, 97.5 id,  1.3 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem:  12232332 total,  1583716 used, 10648616 free,    63148 buffers
    KiB Swap: 12505084 total,        0 used, 12505084 free.   626708 cached Mem

      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
        1 root      20   0   33772   3024   1468 S   0.0  0.0   0:00.79 init
        2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
        3 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/0
        4 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0
        5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+
        7 root      20   0       0      0      0 S   0.0  0.0   0:01.64 rcu_sched
        8 root      20   0       0      0      0 S   0.0  0.0   0:00.28 rcuos/0
        9 root      20   0       0      0      0 S   0.0  0.0   0:00.23 rcuos/1
       10 root      20   0       0      0      0 S   0.0  0.0   0:00.20 rcuos/2
       11 root      20   0       0      0      0 S   0.0  0.0   0:01.95 rcuos/3
       12 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
       13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/0
       14 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/1
       15 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/2
       16 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcuob/3

I'm baffled as to why the load average on this one box is always 1.00 or above. Any suggestions?

BTW, I upgraded the BIOS on system 16 to version 3.85, but this didn't change anything. I also installed Ubuntu 14.04, but I still get the same behavior.

Doug Smythies · Answer 1 · 2015-02-01T21:45:17.037

When top does not identify CPU usage or I/O wait as the source of the load average, then typically it is a task or tasks in uninterruptible sleep (one task in your case). Identify them with this command:

ps -e -o state,pid,cmd | grep ^D

vmstat can also be used, but only to give the number of tasks in uninterruptible sleep. Example:

doug@doug-64:~$ vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
0  0    168  69980 124836 2525672    0    0   252    29    1    0  0  0 99  0

where tasks in uninterruptible sleep are under the "b" column under "procs".

It would be highly unusual to observe any constant non-zero number in the "r" column (processes waiting for run time) without also observing CPU usage and or I/O wait. In the two examples below, one is for an unloaded system and one is for a loaded system.

doug@s15:~/cse$ vmstat 10 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  99072 3302096 415420 7927960    0    0     2     9    4    0  2  0 98  0  0
 0  0  99072 3302320 415420 7927960    0    0     0     4   68  160  0  0 100  0  0
 0  0  99072 3302304 415420 7927960    0    0     0     9   62  132  0  0 100  0  0
 0  0  99072 3302064 415420 7927960    0    0     0     0   63  138  0  0 100  0  0
 0  0  99072 3302040 415420 7927960    0    0     0    12   66  155  0  0 100  0  0
 0  0  99072 3302024 415420 7927960    0    0     0    16   90  150  1  0 99  0  0
 0  0  99072 3302008 415420 7927960    0    0     0     6   61  131  0  0 100  0  0
 0  0  99072 3301868 415424 7927960    0    0     0     5   72  167  0  0 100  0  0
 0  0  99072 3301852 415432 7927956    0    0     0    13   66  145  0  0 100  0  0
 0  0  99072 3301836 415432 7927960    0    0     0    12   63  133  0  0 100  0  0

doug@s15:~/temp$ vmstat 10 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
13  0  99072 1271288 415992 9952884    0    0     2     9    4    0  2  0 98  0  0
13  0  99072 1271304 415992 9952884    0    0     0     0 2670 1198 99  0  1  0  0
13  0  99072 1270932 415996 9952884    0    0     0    12 2696 1240 99  0  1  0  0
13  0  99072 1270916 415996 9952884    0    0     0     1 2662 1166 99  0  1  0  0
13  0  99072 1270800 416000 9952884    0    0     0     1 2666 1205 99  0  1  0  0
13  0  99072 1270636 416004 9952884    0    0     0    18 2720 1264 99  0  1  0  0
13  0  99072 1270644 416004 9952884    0    0     0     3 2670 1170 99  0  1  0  0
13  0  99072 1270520 416004 9952884    0    0     0     0 2673 1218 99  0  1  0  0
14  0  99072 1269116 416008 9952888    0    0     0    14 2692 1250 99  0  1  0  0
14  0  99072 1271140 416008 9952888    0    0     0     1 2662 1168 99  0  1  0  0

doug@s15:~/temp$ uptime
 14:46:47 up 12 days, 22:23,  4 users,  load average: 12.59, 12.15, 8.31

If some sort of hung process in the queue is suspected, try this to identify:

ps -e -o state,pid,cmd | grep -v "ps -e -o" | grep ^R

Example (where I have 3 heavy processes running, properly):

doug@s15:~/temp$ ps -e -o state,pid,cmd | grep -v "ps -e -o" | grep ^R

R  9827 ../c/consume 90.000000 50.000000 100.000000
R  9828 ../c/consume 90.000000 50.000000 100.000000
R  9829 ../c/consume 90.000000 50.000000 100.000000

Do the command a few times to help identify the real culprit, as there may well be spurious real processes running.

The last thing to try is to just look at the entire list of threads for any anomalies. Example:

doug@s15:~/ubuntu-help$ ps -e -o state,pid,cmd
S   PID CMD
S     1 /sbin/init
S     2 [kthreadd]
...
S 17579 [kworker/u16:0]
R 17613 ps -e -o state,pid,cmd
S 22071 [kworker/0:0]

Where anything other than "S" or "R" in the first column is of interest. Perhaps filter the list with:

ps -e -o state,pid,cmd | grep -v ^S

Nothing in the 'b' column of vmstat:
1 0 0 147184 197576 10454836 0 0 30 17 42 72 0 0 100 0 0 — Timothy Fossum, Jan 31 '15 at 19:53
Is there always a task waiting for some run time, as shown in the "r" column? While not unusual to see non-zero numbers sometimes, it would be unusual to see it all the time without other usages. I'll edit my answer to show examples. — Doug Smythies, Jan 31 '15 at 22:19
The ps -e -o ... | grep ^R command produces no output, repeatedly. vmstat 10 10 produces lines with zeros in the b column and mostly zeros in the r column. System 16 was loaded with the same software as the other 15 systems, and has the same usage pattern. How can this system always report a load average at or above 1.00 when the others report load averages near zero? — Timothy Fossum, Feb 01 '15 at 20:03
O.K. I am out of ideas. I have edited my answer with one last thing to try. — Doug Smythies, Feb 01 '15 at 21:46
Interesting ... all of our "normal" systems has the following ps ax entry: 43 ? S 0.00 [khubd] ..., but the "abnormal" system has 43 ? D 3.09 [khubd] .... I have no idea why this didn't turn up earlier when I looked for D entries. So now, how can I get rid of this?? — Timothy Fossum, Feb 02 '15 at 00:20
Something is wrong with some USB stuff. It will take me awhile (perhaps days) to figure out what to suggest next. It might be worth swapping any external USB stuff with your other computers in an attempt to isolate the issue or problem component. — Doug Smythies, Feb 02 '15 at 03:45
@Timothy Fossum Is there any information in the /var/log/kern.log file that might give insight as to the root issue? Or any other log file? — Doug Smythies, Feb 05 '15 at 00:24

Load average high on only one of 16 HP workstations

1 Answers1

Linked