0

My application is a compute intensive task(I.e. video encoding). When it is running on linux kernel 4.9(Ubuntu 16.04), the cpu usage is 3300%. But when it is running on linux kernel 5.4(Ubuntu 20.04), the cpu Usage is just 2850%. Promise the processes do the same job.

So I wonder if linux kernel had done some cpu scheduling optimization or related work between 4.9 and 5.4? Could you give any advice to investigate the reason?

For your information,

  1. It is confirmed the performance gain comes from linux kernel 5.4, because the performance on linux kernel 5.3 is the same as linux kernel 4.9.
  2. It is confirmed the performance gain has no relation with libc, because on linux kernel 5.10 whose libc is 2.23 the performance is the same as linux kernel 5.4 whose libc is 2.31
CPU Info:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:              7
CPU MHz:               2200.000
BogoMIPS:              4401.69
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              14080K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Output of perf stat on Linux Kernel 4.9

Performance counter stats for process id '32504':

3146297.833447      cpu-clock (msec)          #   32.906 CPUs utilized          
     1,718,778      context-switches          #    0.546 K/sec                  
       574,717      cpu-migrations            #    0.183 K/sec                  
     2,796,706      page-faults               #    0.889 K/sec                  

6,193,409,215,015 cycles # 1.968 GHz (30.76%) 6,948,575,328,419 instructions # 1.12 insn per cycle (38.47%) 540,538,530,660 branches # 171.801 M/sec (38.47%) 33,087,740,169 branch-misses # 6.12% of all branches (38.50%) 1,966,141,393,632 L1-dcache-loads # 624.906 M/sec (38.49%) 184,477,765,497 L1-dcache-load-misses # 9.38% of all L1-dcache hits (38.47%) 8,324,742,443 LLC-loads # 2.646 M/sec (30.78%) 3,835,471,095 LLC-load-misses # 92.15% of all LL-cache hits (30.76%) <not supported> L1-icache-loads
187,604,831,388 L1-icache-load-misses (30.78%) 1,965,198,121,190 dTLB-loads # 624.607 M/sec (30.81%) 438,496,889 dTLB-load-misses # 0.02% of all dTLB cache hits (30.79%) 7,139,892,384 iTLB-loads # 2.269 M/sec (30.79%) 260,660,265 iTLB-load-misses # 3.65% of all iTLB cache hits (30.77%) <not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses

  95.615072142 seconds time elapsed

Output of perf stat on Linux Kernel 5.4

 Performance counter stats for process id '3355137':

      2,718,192.32 msec cpu-clock                 #   29.184 CPUs utilized          
         1,719,910      context-switches          #    0.633 K/sec                  
           448,685      cpu-migrations            #    0.165 K/sec                  
         3,884,586      page-faults               #    0.001 M/sec                  
 5,927,930,305,757      cycles                    #    2.181 GHz                      (30.77%)
 6,848,723,995,972      instructions              #    1.16  insn per cycle           (38.47%)
   536,856,379,853      branches                  #  197.505 M/sec                    (38.47%)
    32,245,288,271      branch-misses             #    6.01% of all branches          (38.48%)
 1,935,640,517,821      L1-dcache-loads           #  712.106 M/sec                    (38.47%)
   177,978,528,204      L1-dcache-load-misses     #    9.19% of all L1-dcache hits    (38.49%)
     8,119,842,688      LLC-loads                 #    2.987 M/sec                    (30.77%)
     3,625,986,107      LLC-load-misses           #   44.66% of all LL-cache hits     (30.75%)
   <not supported>      L1-icache-loads                                             
   184,001,558,310      L1-icache-load-misses                                         (30.76%)
 1,934,701,161,746      dTLB-loads                #  711.760 M/sec                    (30.74%)
       676,618,636      dTLB-load-misses          #    0.03% of all dTLB cache hits   (30.76%)
     6,275,901,454      iTLB-loads                #    2.309 M/sec                    (30.78%)
       391,706,425      iTLB-load-misses          #    6.24% of all iTLB cache hits   (30.78%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      93.139551411 seconds time elapsed
Zachary
  • 11
  • 1
    Check your math. I don't see any statistically significant differences, especially when you start to look at the details and sit down and work out the math. This doesn't seem to be an identical task because the first example has 6.9 trillion instructions and the second example has 6.8 trillion instructions. The end result is a difference of ~2.5 seconds. Both values are around a 1-2% difference. Some variability is expected especially since this task is not run in a vacuum- other tasks are also running on your PC . Also how can your CPU usage exceed 100% Wouldn't that defy physics? – Nmath May 19 '22 at 04:08
  • 1
    Also consider that the kernel version is not the only difference between 16.04 and 20.04. These two versions have different repositories with different versions of software across the board. It's not unusual for software updates to include optimizations and other minor changes that can improve performance. I think you've come to the wrong conclusion about this. If you really want to complie a list of all optimisations in all related software from 16.04 and 20.04, that will take all day. That much work to track down a variability of 1%, IMO that's not a very good use of time. – Nmath May 19 '22 at 04:13
  • Ubuntu 16.04 LTS has reached the end of it's standard support life thus is now off-topic here unless your question is specific to helping you move to a supported release of Ubuntu. Ubuntu 16.04 ESM support is available, but not on-topic here, see https://askubuntu.com/help/on-topic See also https://ubuntu.com/blog/ubuntu-16-04-lts-transitions-to-extended-security-maintenance-esm – guiverc May 19 '22 at 04:27
  • 1
    Ubuntu 16.04 didn't use the 4.9 kernel, using 4.4 if using the GA kernel stack & 4.15 if using HWE kernel stack. 4.9 was a Debian supported kernel, so was your off-topic system a Ubuntu 16.04? as your details imply it was a respin or altered system. Kernel 5.10 was like a Debian supported kernel (though Ubuntu did use it for some OEM installations) – guiverc May 19 '22 at 04:28
  • @Nmath Thanks for your comments. I said the tasks are identical is in an user's point of view. The two tasks transcode the same video and won't do any other jobs. The CPU usage exceeds 100% because it's a multi-core server. I also installed kernel 5.4 on Ubuntu 16.04, the result is the same as Ubuntu 20.04 with kernel 5.4. So the performance gain very likely comes from kernel 5.4. – Zachary May 19 '22 at 06:52
  • @guiverc I have no idea that Ubuntu using specific kernel version. That's an important information for me. Where could I find more details? – Zachary May 19 '22 at 07:00
  • Refer https://askubuntu.com/questions/517136/list-of-ubuntu-versions-with-corresponding-linux-kernel-version for kernel versions. Ubuntu LTS releases offer two kernel stacks; GA = the original kernel stack, and HWE uses the non-LTS kernel stacks from the next cycle until the next development cycle reaches it's GA kernel stack. OEM is more complex, but see https://wiki.ubuntu.com/Kernel/LTSEnablementStack for more detail on understanding HWE (re: A release cycle has 4 minor dev cycles; 18.10, 19.04, 19.10 & finally the LTS of 20.04.. The LTS is always last of a major development cycle) – guiverc May 19 '22 at 07:20
  • To really isolate the difference down to the commit level, you would have to bisect the kernel between 5.3 and 5.4. It would take approximately 18 (a guess) kernel compiles to complete. Myself, I would try a more recent kernel, even mainline 5.18-rc7, first to see if they are improved again. – Doug Smythies May 19 '22 at 14:51

1 Answers1

1

It seems performance gain comes from this fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de53fd7aedb100f03e5d2231cfce0e4993282425

Zachary
  • 11