

  ⑴后面又继续分析内核中出现的另一个错误,“BUG: soft lockup – CPU#N stuck for s! [qmgr/master:进程号]”,对上面的错误信息我做了一点点处理,CPU#后面的N是对应的一个具体的cpu编号,这个在每一台服务器是不一样的,还有就是最后中括号中的进程和进程号码不同,不过就是qmgr和master。如下统计:





  ⑹master qmgr





  ⑾接着上面的内核出错日志分析,发现一个很大的相同点,就是s这个值。首先解释一下这个值代表的意义,通常情况下如果一个cpu超过s没有喂狗(执行watchdog程序就会抛出soft lockup(软死锁错误并且挂起内核。但是这个值尽然是s,并都是一样的。完全可以理解为是一个固定的错误,为了验证自己的想法,我就在RedHat官方网站搜索这个错误信息,让我非常激动的是,尽然找到了相同的bug(url:然后查看错误的redhat版本和内核版本,都和我们的一样(redhat.和CentOS.对应。错如信息和解决方案如下:

  ⑿Does Red Hat Enterprise Linux or have a reboot problem which is caused by sched_clock( overflow around . days?

  ⒀(Updated Feb , : AM GMT RateSelect ratingGive it /Give it /Give it /Give it /Give it /Cancel ratingCancel ratingGive it /Give it /Give it /Give it /Give it /. Average: ( vote。 Show Follow

  ⒁Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

  ⒂rhel Content in rhel by Marc Milgram Content in rhel Content in rhel by Marc Milgram Content in kernel Content in kernel by

  ⒃Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

  ⒄Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

  ⒅Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

  ⒆ Issue(问题

  ⒇•Linux Kernel panics when sched_clock( overflows after an uptime of around . days.

  ⒈•Red Hat Enterprise Linux . system reboots with sched_clock( overflow after an uptime of around . days

  ⒉•This symptom may happen on the systems using the CPU which has TSC.

  ⒊•A process showing BUG: soft lockup - CPU#N stuck for s!


  ⒌•Red Hat Enterprise Linux

  ⒍?Red Hat Enterprise Linux ., . and . are affected

  ⒎?several kernels affected, see below

  ⒏?TSC clock source - **see root cause

  ⒐•Red Hat Enterprise Linux

  ⒑?Red Hat Enterprise Linux ., ., .: please refer to the resolution section for affected kernels

  ⒒?Red Hat Enterprise Linux ., ,, ., ., . ,.: all kernels affected

  ⒓?Red Hat Enterprise Linux . and later are not affected

  ⒔?TSC clock source - **see root cause

  ⒕•An approximate uptime of around . days.


  ⒗•Red Hat Enterprise Linux

  ⒘?Red Hat Enterprise Linux .x: update to kernel-..-.el (from RHSA-- or later. This kernel is already part of RHEL.GA. This fix was implemented with (private bz.

  ⒙?Red Hat Enterprise Linux .: update to kernel-..-...el (from RHBA-- or later. This fix was implemented with (private bz.

  ⒚?Red Hat Enterprise Linux . Extended Update Support: update to kernel-..-...el (from RHBA-- or later. This fix was implemented with (private bz.

  ⒛•Red Hat Enterprise Linux

  ①?architecture x_/bit

  ②■Red Hat Enterprise Linux .x: upgrade to kernel-..-.el (from RHBA-- or later. RHEL.GA and later already contain this fix.

  ③■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ④■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑤■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHBA-- or later.

  ⑥?architecture x/bit

  ⑦■Red Hat Enterprise Linux .x: upgrade to kernel-..-.el (from RHBA-- or later. RHEL.GA and later already contain this fix.

  ⑧■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑨■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑩■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHBA-- or later.

  ⅠRoot Cause(根本原因

  Ⅱ•An insufficiently designed calculation in the CPU aelerator in the previous kernel caused an arithmetic overflow in the sched_clock( function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC clock source.

  Ⅲ•This problem will our only when system uptime bees . days or exceeds . days.

  Ⅳ•This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer our under these circumstances.

  Ⅴ•On Red Hat Enterprise , this problem is a timing issue and very very rare to happen.

  Ⅵ•**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast aess clock whereas the HPET and PMTimer are both slow aess clocks. Using notsc would be a significant performance hit.

  ⅦDiagnostic Steps

  ⅧThis issue could likely happen in numerous locals that deal with time

  Ⅸin the kernel. For example, a user running a non-Red Hat kernel had the

  Ⅹkernel panic with a soft lockup in __ticket_spin_lock.


