Understanding and Resolving Linux Kernel Soft Lockup Issues

Understanding Linux Kernel Soft Lockup: Symptoms and Diagnosis

Symptoms of a Soft Lockup

A soft lockup occurs when a CPU is unable to schedule tasks for an extended period, typically due to a non-preemptible loop or improper spinlock usage. Common symptoms include:

System unresponsiveness or complete freeze
Kernel log messages such as “CPU[0] has been soft locked up” in dmesg
High CPU usage by a single process with no apparent cause
Failure of the watchdog timer to trigger a hard lockup panic

Root Cause Analysis

Soft lockups are triggered by the kernel’s watchdog mechanism, which monitors CPU activity. The watchdog checks for non-preemptible CPU usage exceeding 120 seconds. Key root causes include:

Improperly held spinlocks in kernel modules or drivers
Busy-wait loops without yielding the CPU
Uninterruptible sleep (D-state) processes blocking critical threads
Kernel bugs in scheduling or interrupt handling

The watchdog timer is configured via the CONFIG_LOCKUP_DETECTOR kernel option. When a CPU fails to schedule a task during this window, the kernel logs a soft lockup warning.

Example Code: A Faulty Kernel Module

Consider the following kernel module snippet that causes a soft lockup:

#include <linux/module.h> #include <linux/kernel.h> #include <linux/init.h>


static void *long_loop(void *data) {

    while (1) {

        /* No yield or preemption point */

        printk(KERN_INFO "Soft lockup test module running...\n");

    }

    return NULL;

}
static int __init softlock_init(void) {

    kernel_thread(long_loop, NULL, 0);

    return 0;

}
static void __exit softlock_exit(void) {

    printk(KERN_INFO "Soft lockup module unloaded.\n");

}

module_init(softlock_init); module_exit(softlock_exit); MODULE_LICENSE("GPL");

This module spawns a kernel thread that enters an infinite loop without yielding, causing the CPU to remain in a non-preemptible state.

Diagnosis Tools and Techniques

Use the following tools to identify and troubleshoot soft lockups:

dmesg to check kernel logs for soft lockup warnings
perf for real-time CPU profiling and identifying hot loops
ps and top to inspect processes in D-state
/proc/softlockup for watchdog timer statistics
gdb or kgdb for kernel debugging

For example, a dmesg output might show:

[ 123.456789] CPU[0] has been soft locked up for 121.45 seconds! Stack: ...

Step-by-Step Solution: Fixing a Soft Lockup

To resolve a soft lockup issue, follow these steps:

Identify the Affected CPU: Use mpstat or top to determine which CPU is experiencing the lockup.
Check Kernel Logs: Run dmesg | grep "soft lockup" to locate the timestamp and stack trace of the issue.
Analyze with perf: Execute perf record -a -g sleep 10 to capture CPU activity and inspect for non-preemptible loops.
Examine Process States: Use ps -el | grep D to find processes in uninterruptible sleep that may be blocking resources.
Debug with kgdb: Load the kernel’s vmlinux file into kgdb and inspect the stack trace of the locked CPU.
Modify the Code: Introduce preemption points (e.g., schedule() or msleep()) in long-running loops to avoid prolonged CPU occupancy.
Test and Validate: Rebuild the kernel module, load it, and monitor with perf or top to ensure the issue is resolved.