ARTICLE AD BOX
In C++ (std::atomic::fetch_add), we often treat atomic Read-Modify-Write (RMW) operations as fast, "lock-free" primitives. However, I am interested in understanding the extreme latency bounds of these operations at the hardware level.
Beyond the obvious case of high thread contention (cache line bouncing) I supposed, I want to know: Is it possible to observe a single fetch_add operation stalling the CPU's instruction pipeline for a long time? If so, what hardware or system-level conditions can trigger this, even if the atomic variable is not being heavily contested?
For example:
#include <thread> #include <atomic> int main(){ std::atomic<int> v = 0; auto t1 = std::thread([&](){ v.fetch_add(1,std::memory_order::relaxed); // #1 }); auto t2 = std::thread([&](){ v.fetch_add(1,std::memory_order::relaxed); // #2 }); t1.join(); t2.join(); }Can we observe either the complete invocation of #1 or #2 spending a long time?
