2024年11月Linux系统内核崩溃如何排查?(2)

发布时间:

  ⑴后面又继续分析内核中出现的另一个错误,“BUG: soft lockup – CPU#N stuck for s! [qmgr/master:进程号]”,对上面的错误信息我做了一点点处理,CPU#后面的N是对应的一个具体的cpu编号,这个在每一台服务器是不一样的,还有就是最后中括号中的进程和进程号码不同,不过就是qmgr和master。如下统计:

  ⑵IP

  ⑶时间::::::::::::::

  ⑷::错误日志类型和进程qmgrmaster

  ⑸qmgrqmgr

  ⑹master qmgr

  ⑺masterqmgr

  ⑻masterqmgr

  ⑼masterqmgr

  ⑽错误类型就是上面提到的不会一起内核挂起的错误,就是现在分析的这个错误,会导致linux内核panic。可以看出只有和当时是没有挂起的。

  ⑾接着上面的内核出错日志分析,发现一个很大的相同点,就是s这个值。首先解释一下这个值代表的意义,通常情况下如果一个cpu超过s没有喂狗(执行watchdog程序就会抛出soft lockup(软死锁错误并且挂起内核。但是这个值尽然是s,并都是一样的。完全可以理解为是一个固定的错误,为了验证自己的想法,我就在RedHat官方网站搜索这个错误信息,让我非常激动的是,尽然找到了相同的bug(url:然后查看错误的redhat版本和内核版本,都和我们的一样(redhat.和CentOS.对应。错如信息和解决方案如下:

  ⑿Does Red Hat Enterprise Linux or have a reboot problem which is caused by sched_clock( overflow around . days?

  ⒀(Updated Feb , : AM GMT RateSelect ratingGive it /Give it /Give it /Give it /Give it /Cancel ratingCancel ratingGive it /Give it /Give it /Give it /Give it /. Average: ( vote。 Show Follow

  ⒁Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

  ⒂rhel Content in rhel by Marc Milgram Content in rhel Content in rhel by Marc Milgram Content in kernel Content in kernel by

  ⒃Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

  ⒄Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

  ⒅Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

  ⒆ Issue(问题

  ⒇•Linux Kernel panics when sched_clock( overflows after an uptime of around . days.

  ⒈•Red Hat Enterprise Linux . system reboots with sched_clock( overflow after an uptime of around . days

  ⒉•This symptom may happen on the systems using the CPU which has TSC.

  ⒊•A process showing BUG: soft lockup - CPU#N stuck for s!

  ⒋Environment(环境

  ⒌•Red Hat Enterprise Linux

  ⒍?Red Hat Enterprise Linux ., . and . are affected

  ⒎?several kernels affected, see below

  ⒏?TSC clock source - **see root cause

  ⒐•Red Hat Enterprise Linux

  ⒑?Red Hat Enterprise Linux ., ., .: please refer to the resolution section for affected kernels

  ⒒?Red Hat Enterprise Linux ., ,, ., ., . ,.: all kernels affected

  ⒓?Red Hat Enterprise Linux . and later are not affected

  ⒔?TSC clock source - **see root cause

  ⒕•An approximate uptime of around . days.

  ⒖Resolution(解决方案

  ⒗•Red Hat Enterprise Linux

  ⒘?Red Hat Enterprise Linux .x: update to kernel-..-.el (from RHSA-- or later. This kernel is already part of RHEL.GA. This fix was implemented with (private bz.

  ⒙?Red Hat Enterprise Linux .: update to kernel-..-...el (from RHBA-- or later. This fix was implemented with (private bz.

  ⒚?Red Hat Enterprise Linux . Extended Update Support: update to kernel-..-...el (from RHBA-- or later. This fix was implemented with (private bz.

  ⒛•Red Hat Enterprise Linux

  ①?architecture x_/bit

  ②■Red Hat Enterprise Linux .x: upgrade to kernel-..-.el (from RHBA-- or later. RHEL.GA and later already contain this fix.

  ③■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ④■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑤■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHBA-- or later.

  ⑥?architecture x/bit

  ⑦■Red Hat Enterprise Linux .x: upgrade to kernel-..-.el (from RHBA-- or later. RHEL.GA and later already contain this fix.

  ⑧■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑨■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHSA-- or later.

  ⑩■Red Hat Enterprise Linux ..z: upgrade to kernel-..-...el (from RHBA-- or later.

  ⅠRoot Cause(根本原因

  Ⅱ•An insufficiently designed calculation in the CPU aelerator in the previous kernel caused an arithmetic overflow in the sched_clock( function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC clock source.

  Ⅲ•This problem will our only when system uptime bees . days or exceeds . days.

  Ⅳ•This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer our under these circumstances.

  Ⅴ•On Red Hat Enterprise , this problem is a timing issue and very very rare to happen.

  Ⅵ•**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast aess clock whereas the HPET and PMTimer are both slow aess clocks. Using notsc would be a significant performance hit.

  ⅦDiagnostic Steps

  ⅧThis issue could likely happen in numerous locals that deal with time

  Ⅸin the kernel. For example, a user running a non-Red Hat kernel had the

  Ⅹkernel panic with a soft lockup in __ticket_spin_lock.

  ㈠通过上面的信心我们完全可以确认这个是linux内核的一个bug,这个bug的原因上面也相信描述了,就是对于x_体系结构的内核版本,如果启动时间超过.天就会导致溢出。

  ㈡虽然得到了上面的信息证实了内核panic的原因,不过自己想了解一下淘宝的内核工程师是否也应该遇到过同样的问题,所以就在qq上找以前聊过的淘宝内核工程师确认这个问题。结果证明:他们也遇到过同样的错误,并且也不能重现,解决方案还是升级内核版本。

  ㈢上面就是Linux内核崩溃的排查方法介绍了,通过本文的介绍能够了解到Linux内核的排查是比较困难的,需要一定的耐心和技术。