What hardware conditions can cause an atomic fetch_add (RMW) to significantly stall control flow?

16 hours ago 1
ARTICLE AD BOX

In C++ (std::atomic::fetch_add), we often treat atomic Read-Modify-Write (RMW) operations as fast, "lock-free" primitives. However, I am interested in understanding the extreme latency bounds of these operations at the hardware level.

Beyond the obvious case of high thread contention (cache line bouncing) I supposed, I want to know: Is it possible to observe a single fetch_add operation stalling the CPU's instruction pipeline for a long time? If so, what hardware or system-level conditions can trigger this, even if the atomic variable is not being heavily contested?

For example:

#include <thread> #include <atomic> int main(){ std::atomic<int> v = 0; auto t1 = std::thread([&](){ v.fetch_add(1,std::memory_order::relaxed); // #1 }); auto t2 = std::thread([&](){ v.fetch_add(1,std::memory_order::relaxed); // #2 }); t1.join(); t2.join(); }

Can we observe either the complete invocation of #1 or #2 spending a long time?

Read Entire Article