Understanding the Symptom: SoftIRQ Stack Overflow
Symptoms and Observations
Users encountering this issue may observe system instability characterized by:
- Kernel panic messages such as “softirq: softirq stack overflow”
- Processes hanging or becoming unresponsive
- High CPU usage on specific IRQ threads
- Unpredictable reboots or crashes under heavy I/O or network load
- Logs in /var/log/kern.log or dmesg showing “BUG: softirq stack overflow”
This typically occurs in Linux kernels with high interrupt traffic, where the per-CPU softirq stack is exhausted due to excessive processing in softirq handlers.
Root Cause Analysis
SoftIRQ Stack Architecture in Linux
The Linux kernel uses a per-CPU softirq stack to handle deferred processing for interrupts. Each CPU has a fixed-size stack (default is 8KB) for softirqs, which are processed in a serialized manner. A stack overflow occurs when a softirq handler exceeds this limit, either due to:
- Recursive or deeply nested softirq processing
- Prolonged execution in a softirq context (e.g., heavy computation or deadlock)
- Missing preemption points in long-running softirq functions
- Excessive use of per-CPU variables or large local data structures in handlers
For example, in kernels prior to 5.15, improper handling of network drivers could trigger this issue during high packet rates.
Common Triggers
SoftIRQ stack overflows often stem from:
- Kernel modules with inefficient softirq handlers (e.g., under-optimized NIC drivers)
- Missing workqueue offloading for tasks that should run in process context
- Improper use of spinlocks or other synchronization primitives in softirqs
- Custom kernel patches that alter softirq behavior without proper testing
The overflow can cause undefined behavior, including data corruption or kernel crashes, as the stack pointer overwrites adjacent memory.
Diagnosis Tools and Techniques
Log Analysis with dmesg and /var/log/kern.log
Check for messages like:
kernel: [IRQ] softirq: softirq stack overflow
These logs often include the CPU ID and the softirq type (e.g., NET_RX, TIMER). Use the following command to filter:
grep 'softirq' /var/log/kern.log
Performance Profiling with perf
Run perf top
or perf report
to identify high softirq CPU usage. Example:
perf stat -e softirq:softirq_entry /bin/true
This exposes which softirqs are being triggered frequently.
Per-CPU SoftIRQ Monitoring
Inspect /proc/softirqs
to track softirq counts per CPU:
cat /proc/softirqs
Look for elevated values in the NET_RX
or NET_TX
columns, which indicate networking-related stack pressure.
Step-by-Step Resolution
1. Verify Kernel Version and Configuration
Check the kernel version with uname -r
. Ensure CONFIG_DEBUG_VM
and CONFIG_DEBUG_STACKOVERFLOW
are enabled for stack validation. For kernels with limited softirq stack size, consider applying a patch or using a recent version.
2. Analyze SoftIRQ Handler Code
Identify the source of the overflow by inspecting the kernel source or loadable modules. Look for functions registered with open_softirq()
or raise_softirq()
. Example:
void my_softirq_handler(struct softirq_action *h) {
// Long-running or recursive logic here
}
If the handler lacks preemption points or offloads work, it risks stack overflow.
3. Adjust SoftIRQ Stack Size (Advanced)
Edit the kernel configuration to increase the softirq stack size. For example, modify:
CONFIG_SOFTIRQ_STACK_SIZE=16384
in .config
and recompile the kernel. This is a last-resort fix and should be paired with code optimization.
4. Apply Workqueue Offloading
Move non-essential tasks from softirq context to workqueues. Example:
DECLARE_WORK(my_work, my_task_func);
raise_softirq_irqoff(NET_RX);
schedule_work(&my_work);
This reduces softirq stack pressure by delegating work to process context.
5. Update or Revert Drivers
If the issue is driver-specific, update to the latest version or revert to a stable release. For example, patching a network driver to avoid recursive softirq calls can resolve the issue.
Example Code and Debugging
Sample SoftIRQ Handler with Stack Overflow Risk
Below is a flawed example of a softirq handler:
void risky_softirq(struct softirq_action *h) {
for (int i = 0; i < 100000; i++) {
// Recursive or long-loop logic without yielding
}
}
This loops indefinitely in softirq context, exhausting the stack.
Debugging with kprobe and ftrace
Use kprobe
to trace softirq handler entry/exit:
echo 'p:softirq_entry my_softirq_handler' > /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace
This identifies the handler’s execution path and potential bottlenecks.
Conclusion
Resolving softirq stack overflows requires a deep understanding of kernel scheduling and memory management. Prioritize offloading work to process context, ensure preemption in handlers, and monitor system logs for early detection. For developers, rigorous testing of softirq handlers under high load is critical to avoid such issues.