Diagnosing and Resolving Kernel-Level SoftIRQ Stack Overflow in Linux

Understanding the Symptom: SoftIRQ Stack Overflow

Symptoms and Observations

Users encountering this issue may observe system instability characterized by:

  • Kernel panic messages such as “softirq: softirq stack overflow”
  • Processes hanging or becoming unresponsive
  • High CPU usage on specific IRQ threads
  • Unpredictable reboots or crashes under heavy I/O or network load
  • Logs in /var/log/kern.log or dmesg showing “BUG: softirq stack overflow”

This typically occurs in Linux kernels with high interrupt traffic, where the per-CPU softirq stack is exhausted due to excessive processing in softirq handlers.

Root Cause Analysis

SoftIRQ Stack Architecture in Linux

The Linux kernel uses a per-CPU softirq stack to handle deferred processing for interrupts. Each CPU has a fixed-size stack (default is 8KB) for softirqs, which are processed in a serialized manner. A stack overflow occurs when a softirq handler exceeds this limit, either due to:

  • Recursive or deeply nested softirq processing
  • Prolonged execution in a softirq context (e.g., heavy computation or deadlock)
  • Missing preemption points in long-running softirq functions
  • Excessive use of per-CPU variables or large local data structures in handlers

For example, in kernels prior to 5.15, improper handling of network drivers could trigger this issue during high packet rates.

Common Triggers

SoftIRQ stack overflows often stem from:

  1. Kernel modules with inefficient softirq handlers (e.g., under-optimized NIC drivers)
  2. Missing workqueue offloading for tasks that should run in process context
  3. Improper use of spinlocks or other synchronization primitives in softirqs
  4. Custom kernel patches that alter softirq behavior without proper testing

The overflow can cause undefined behavior, including data corruption or kernel crashes, as the stack pointer overwrites adjacent memory.

Diagnosis Tools and Techniques

Log Analysis with dmesg and /var/log/kern.log

Check for messages like:
kernel: [IRQ] softirq: softirq stack overflow
These logs often include the CPU ID and the softirq type (e.g., NET_RX, TIMER). Use the following command to filter:
grep 'softirq' /var/log/kern.log

Performance Profiling with perf

Run perf top or perf report to identify high softirq CPU usage. Example:
perf stat -e softirq:softirq_entry /bin/true
This exposes which softirqs are being triggered frequently.

Per-CPU SoftIRQ Monitoring

Inspect /proc/softirqs to track softirq counts per CPU:
cat /proc/softirqs
Look for elevated values in the NET_RX or NET_TX columns, which indicate networking-related stack pressure.

Step-by-Step Resolution

1. Verify Kernel Version and Configuration

Check the kernel version with uname -r. Ensure CONFIG_DEBUG_VM and CONFIG_DEBUG_STACKOVERFLOW are enabled for stack validation. For kernels with limited softirq stack size, consider applying a patch or using a recent version.

2. Analyze SoftIRQ Handler Code

Identify the source of the overflow by inspecting the kernel source or loadable modules. Look for functions registered with open_softirq() or raise_softirq(). Example:

void my_softirq_handler(struct softirq_action *h) {
// Long-running or recursive logic here
}

If the handler lacks preemption points or offloads work, it risks stack overflow.

3. Adjust SoftIRQ Stack Size (Advanced)

Edit the kernel configuration to increase the softirq stack size. For example, modify:

CONFIG_SOFTIRQ_STACK_SIZE=16384

in .config and recompile the kernel. This is a last-resort fix and should be paired with code optimization.

4. Apply Workqueue Offloading

Move non-essential tasks from softirq context to workqueues. Example:

DECLARE_WORK(my_work, my_task_func);
raise_softirq_irqoff(NET_RX);
schedule_work(&my_work);

This reduces softirq stack pressure by delegating work to process context.

5. Update or Revert Drivers

If the issue is driver-specific, update to the latest version or revert to a stable release. For example, patching a network driver to avoid recursive softirq calls can resolve the issue.

Example Code and Debugging

Sample SoftIRQ Handler with Stack Overflow Risk

Below is a flawed example of a softirq handler:

void risky_softirq(struct softirq_action *h) {
for (int i = 0; i < 100000; i++) {
// Recursive or long-loop logic without yielding
}
}

This loops indefinitely in softirq context, exhausting the stack.

Debugging with kprobe and ftrace

Use kprobe to trace softirq handler entry/exit:

echo 'p:softirq_entry my_softirq_handler' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace

This identifies the handler’s execution path and potential bottlenecks.

Conclusion

Resolving softirq stack overflows requires a deep understanding of kernel scheduling and memory management. Prioritize offloading work to process context, ensure preemption in handlers, and monitor system logs for early detection. For developers, rigorous testing of softirq handlers under high load is critical to avoid such issues.

Scroll to Top