Understanding Linux Kernel Soft Lockup: Symptoms and Diagnosis
Symptoms of a Soft Lockup
A soft lockup occurs when a CPU is unable to schedule tasks for an extended period, typically due to a non-preemptible loop or improper spinlock usage. Common symptoms include:
- System unresponsiveness or complete freeze
- Kernel log messages such as “CPU[0] has been soft locked up” in dmesg
- High CPU usage by a single process with no apparent cause
- Failure of the watchdog timer to trigger a hard lockup panic
Root Cause Analysis
Soft lockups are triggered by the kernel’s watchdog mechanism, which monitors CPU activity. The watchdog checks for non-preemptible CPU usage exceeding 120 seconds. Key root causes include:
- Improperly held spinlocks in kernel modules or drivers
- Busy-wait loops without yielding the CPU
- Uninterruptible sleep (D-state) processes blocking critical threads
- Kernel bugs in scheduling or interrupt handling
The watchdog timer is configured via the CONFIG_LOCKUP_DETECTOR
kernel option. When a CPU fails to schedule a task during this window, the kernel logs a soft lockup warning.
Example Code: A Faulty Kernel Module
Consider the following kernel module snippet that causes a soft lockup:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
static void *long_loop(void *data) {
while (1) {
/* No yield or preemption point */
printk(KERN_INFO "Soft lockup test module running...\n");
}
return NULL;
}
static int __init softlock_init(void) {
kernel_thread(long_loop, NULL, 0);
return 0;
}
static void __exit softlock_exit(void) {
printk(KERN_INFO "Soft lockup module unloaded.\n");
}
module_init(softlock_init);
module_exit(softlock_exit);
MODULE_LICENSE("GPL");
This module spawns a kernel thread that enters an infinite loop without yielding, causing the CPU to remain in a non-preemptible state.
Diagnosis Tools and Techniques
Use the following tools to identify and troubleshoot soft lockups:
dmesg
to check kernel logs for soft lockup warningsperf
for real-time CPU profiling and identifying hot loopsps
andtop
to inspect processes in D-state/proc/softlockup
for watchdog timer statisticsgdb
orkgdb
for kernel debugging
For example, a dmesg
output might show:
[ 123.456789] CPU[0] has been soft locked up for 121.45 seconds! Stack: ...
Step-by-Step Solution: Fixing a Soft Lockup
To resolve a soft lockup issue, follow these steps:
- Identify the Affected CPU: Use
mpstat
ortop
to determine which CPU is experiencing the lockup. - Check Kernel Logs: Run
dmesg | grep "soft lockup"
to locate the timestamp and stack trace of the issue. - Analyze with perf: Execute
perf record -a -g sleep 10
to capture CPU activity and inspect for non-preemptible loops. - Examine Process States: Use
ps -el | grep D
to find processes in uninterruptible sleep that may be blocking resources. - Debug with kgdb: Load the kernel’s vmlinux file into
kgdb
and inspect the stack trace of the locked CPU. - Modify the Code: Introduce preemption points (e.g.,
schedule()
ormsleep()
) in long-running loops to avoid prolonged CPU occupancy. - Test and Validate: Rebuild the kernel module, load it, and monitor with
perf
ortop
to ensure the issue is resolved.