ARTICLE AD BOX
It seems that indeed it's very hard to delete latches in C++, precisely because of the fact that when arrive_and_wait() returns, all you know is that other threads have also called arrive_and_wait(), but it's not actually safe to destroy the latch at that point. And unfortunately, it's not possible to know when to delete the latch.
From what I gather, latches are used as global variables to facilitate concurrent initialization of different things. But they aren't something you can use in each iteration of a loop.
Fortunately it's not that hard to design the kind of latch that I want, which would be something like this:
struct Latch { const int size_; std::atomic_int count_; explicit Latch(int size) : size_(size), count_(size) {} ~Latch() { int n = count_.load(std::memory_order_relaxed); if (n == -size_) return; assert(n <= 0); // Can't delete before all threads have arrived while (count_.load(std::memory_order_relaxed) != -size_) std::this_thread::yield(); } void arrive_and_wait() { auto n = count_.fetch_sub(1, std::memory_order_release) - 1; if (n <= 0) count_.notify_all(); else do { count_.wait(n, std::memory_order_relaxed); n = count_.load(std::memory_order_relaxed); } while (n > 0); count_.fetch_sub(1, std::memory_order_acquire); } };On one hand, this likely causes more coherence misses than std::latch, which is presumably why std::latch doesn't behave that way. On the other hand you do need some kind of acquire operation after all threads have hit the barrier, to ensure that everything any thread did before the barrier inter-thread happens before everything every other thread did after the barrier. So you could luck out and have the fetch_sub happen right after the load and not require an extra cache miss.
