Understanding and Resolving Linux Kernel Soft Lockup Issues

Understanding Linux Kernel Soft Lockup: Symptoms and Diagnosis

Symptoms of a Soft Lockup

A soft lockup occurs when a CPU is unable to schedule tasks for an extended period, typically due to a non-preemptible loop or improper spinlock usage. Common symptoms include:

  • System unresponsiveness or complete freeze
  • Kernel log messages such as “CPU[0] has been soft locked up” in dmesg
  • High CPU usage by a single process with no apparent cause
  • Failure of the watchdog timer to trigger a hard lockup panic

Root Cause Analysis

Soft lockups are triggered by the kernel’s watchdog mechanism, which monitors CPU activity. The watchdog checks for non-preemptible CPU usage exceeding 120 seconds. Key root causes include:

  • Improperly held spinlocks in kernel modules or drivers
  • Busy-wait loops without yielding the CPU
  • Uninterruptible sleep (D-state) processes blocking critical threads
  • Kernel bugs in scheduling or interrupt handling

The watchdog timer is configured via the CONFIG_LOCKUP_DETECTOR kernel option. When a CPU fails to schedule a task during this window, the kernel logs a soft lockup warning.

Example Code: A Faulty Kernel Module

Consider the following kernel module snippet that causes a soft lockup:


#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

static void *long_loop(void *data) {
while (1) {
/* No yield or preemption point */
printk(KERN_INFO "Soft lockup test module running...\n");
}
return NULL;
}

static int __init softlock_init(void) {
kernel_thread(long_loop, NULL, 0);
return 0;
}

static void __exit softlock_exit(void) {
printk(KERN_INFO "Soft lockup module unloaded.\n");
}

module_init(softlock_init);
module_exit(softlock_exit);
MODULE_LICENSE("GPL");

This module spawns a kernel thread that enters an infinite loop without yielding, causing the CPU to remain in a non-preemptible state.

Diagnosis Tools and Techniques

Use the following tools to identify and troubleshoot soft lockups:

  • dmesg to check kernel logs for soft lockup warnings
  • perf for real-time CPU profiling and identifying hot loops
  • ps and top to inspect processes in D-state
  • /proc/softlockup for watchdog timer statistics
  • gdb or kgdb for kernel debugging

For example, a dmesg output might show:

[ 123.456789] CPU[0] has been soft locked up for 121.45 seconds! Stack: ...

Step-by-Step Solution: Fixing a Soft Lockup

To resolve a soft lockup issue, follow these steps:

  1. Identify the Affected CPU: Use mpstat or top to determine which CPU is experiencing the lockup.
  2. Check Kernel Logs: Run dmesg | grep "soft lockup" to locate the timestamp and stack trace of the issue.
  3. Analyze with perf: Execute perf record -a -g sleep 10 to capture CPU activity and inspect for non-preemptible loops.
  4. Examine Process States: Use ps -el | grep D to find processes in uninterruptible sleep that may be blocking resources.
  5. Debug with kgdb: Load the kernel’s vmlinux file into kgdb and inspect the stack trace of the locked CPU.
  6. Modify the Code: Introduce preemption points (e.g., schedule() or msleep()) in long-running loops to avoid prolonged CPU occupancy.
  7. Test and Validate: Rebuild the kernel module, load it, and monitor with perf or top to ensure the issue is resolved.
Scroll to Top