Introduction
Kernel softlockups are critical system-level issues in Linux where a CPU core becomes unresponsive due to a non-preemptible task that fails to yield control, leading to system instability or crashes. This post delves into the symptoms, root causes, diagnostic techniques, and resolution strategies for softlockups, with a focus on kernel developers and system administrators.
Symptoms of Kernel Softlockup
Identifying the Signs
Softlockups often manifest as:
- System becomes unresponsive to user input or remote connections
- High CPU usage by a single process, even when the system is idle
- Kernel logs (dmesg) displaying messages like “soft lockup” or “CPU[0] stuck”
- Presence of “NMI watchdog” or “LOCKUP” entries in the system log
Common Scenarios
These issues frequently occur in environments with:
- Custom kernel modules (e.g., drivers or resource-intensive tasks)
- High load on single-core systems or asymmetric multiprocessing (SMP) configurations
- Kernel bugs in scheduling or timer mechanisms
Root Cause Analysis
What Triggers a Softlockup?
A softlockup arises when a kernel thread or process fails to relinquish the CPU for a prolonged period. The Linux kernel’s watchdog timer (NMI watchdog) detects this by checking if a CPU has been in a non-preemptible state for more than 10 seconds. Common causes include:
- Busy-wait loops without yield points in kernel code
- Deadlocks in kernel synchronization primitives (e.g., spinlocks)
- Unpatched kernel vulnerabilities or race conditions
- Misconfigured or faulty hardware drivers
Kernel Mechanisms Involved
The NMI watchdog uses non-maskable interrupts (NMIs) to monitor CPU activity. When a softlockup is detected, the kernel logs a warning and, if configured, triggers a panic. The watchdog’s behavior is controlled by kernel parameters such as softlockup_panic
and watchdog_cpumask
.
Diagnosis Tools and Procedures
Key Diagnostic Utilities
Use the following tools to identify softlockups:
dmesg
: Inspect kernel ring buffer for softlockup messagesperf
: Profile CPU usage and detect infinite loopstop
orhtop
: Identify processes consuming excessive CPU/proc/softlockup
: Examine watchdog status and timestampstop -H
: Check for kernel threads (e.g., ksoftirqd) consuming CPU
Example Diagnosis Command
dmesg | grep -i 'soft lockup'
This command filters kernel logs for softlockup-related entries, such as:
[12345.67890] BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:1234]
Example Code Demonstrating the Issue
Minimal Kernel Module for Testing
/* softlockup_test.c */
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/delay.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Admin");
MODULE_DESCRIPTION("Test module to trigger softlockup");
static int __init softlockup_init(void) {
printk(KERN_INFO "Softlockup test module loaded\n");
while (1) {
msleep(1000); // Busy-wait without yielding
}
return 0;
}
static void __exit softlockup_exit(void) {
printk(KERN_INFO "Softlockup test module unloaded\n");
}
module_init(softlockup_init);
module_exit(softlockup_exit);
Compiling and Loading the Module
make -C /lib/modules/$(uname -r)/build M=$(pwd) modules
insmod softlockup_test.ko
This module will immediately trigger a softlockup due to the infinite loop with no yielding.
Step-by-Step Solution
1. Analyze Kernel Logs
Run dmesg -T
to view human-readable timestamps. Look for softlockup warnings or panics. Example output:
[Mon Apr 3 12:00:00 2023] BUG: soft lockup - CPU#0 stuck for 22s! [ksoftirqd/0:123]
2. Identify the Culprit Process
Use ps -e -o pid,comm,psr
to identify processes tied to the affected CPU. Cross-reference with strace
or gdb
to inspect their behavior.
3. Check Kernel Modules
Run lsmod
to list loaded modules. Use modinfo
on suspected modules. If a custom module is involved, unload it with rmmod
and test stability.
4. Monitor with Perf
Execute perf top
to profile CPU usage. Look for functions with disproportionately high CPU cycles. Example:
perf top -C 0 -d 10
This isolates CPU 0 and captures 10 seconds of data.
5. Apply Workarounds and Patches
Implement fixes such as:
- Updating the kernel to a version with the relevant patch
- Setting
softlockup_panic=0
in/etc/default/grub
to disable panics temporarily - Modifying module code to include
schedule()
ormsleep()
yield points - Disabling the NMI watchdog with
noirqdebug
orlapic
in the kernel command line
6. Validate with Kernel Parameters
Edit /etc/default/grub
to add softlockup_panic=1
and regenerate GRUB config with update-grub
. This forces a panic on softlockup, aiding in crash analysis.