C++ atomics

Each CPU architecture has its own memory consistency model and the means to control the visibility of memory operations among different cores and hardware threads. The memory consistency models of high level programming languages, on the other hand, is an abstraction layer over the hardware consistency models. They aim at providing portability while maintaining efficiency and hopefully providing ease of use.

C++ is among the few languages that have formalized their memory consistency models (other prominent example is Java). Unfortunately, by only studying the mathematics of high level memory consistency models, it’s impossible to understand how they relate to the underlying hardware.

In this blog I attempt to shed light on this relationship using a few simple C++ programs that I test on several popular CPU architectures.

What are C++ atomics?

In C++, unlike managed environments like JVM or .NET, data races are prohibited and lead to undefined behavior. A data race occurs when two threads access the same variable at the same time and at least one of those accesses is a write. This definition and the associated undefined behavior mean that the regular writes (stores) and reads (loads) can’t be used for inter-thread synchronization. They themselves have to be synchronized using some other means, such as mutexes or atomics.

I will not discuss mutexes, conditional variables and the like. Also, I’m only interested in atomic types that are lock-free (std::atomic<T>::is_lock_free() returns true). These are the specializations of std::atomic<T> for standard scalar types.

The type std::atomic<T> for scalar type T provides several properties:

data race immunity
atomicity
memory ordering guarantees

Data race immunity means that no concurrent access to an atomic variable may lead to a data race. Atomic variables have to be used in all cases where concurrent access is not synchronized.

Atomicity means that operations that simultaneously require both read and write access (increment or exchange, for example) are executed atomically, even if other threads attempt to change the variable’s value at the same time.

Memory ordering guarantees are the main subject of this post and will be discussed later.

Before turning our attention to memory orders, let’s discuss the first two properties by looking at a classical example of a multithreaded program: incrementing a counter from two threads. In the example below variable x is not atomic. The behavior of the program is thus undefined due to a data race.

IncrementRegular

C++17 require(_x == 2)
Thread 0	x++;
Thread 1	x++;
x86_64 FAILURE
incl 24(%rdx)	incl 24(%rdx)
ARMv7 32-bit FAILURE
ldr r3, [r1, #20] add r3, r3, #1 str r3, [r1, #20]	ldr r3, [r1, #20] add r3, r3, #1 str r3, [r1, #20]
ARMv8 64-bit FAILURE
ldr w10, [x9, #24] add w10, w10, #1 str w10, [x9, #24]	ldr w10, [x9, #24] add w10, w10, #1 str w10, [x9, #24]
MIPS 32-bit FAILURE
lw $2,20($4) addiu $2,$2,1 sw $2,20($4)	lw $2,20($4) addiu $2,$2,1 sw $2,20($4)

The table contains a C++ program with two threads, both incrementing variable x. Also, the generated assembly code is shown for several architectures.

I use the following compilers and platforms:

Architecture	Compiler	CPU/Platform
x86_64	GCC	Intel Core i7-6700K
ARMv7 (32-bit)	GCC	Broadcom BCM2835
ARMv8 (64-bit)	Clang	Qualcomm SDM660
MIPS (32-bit)	GCC	MediaTek MT7621

In this and following C++ programs, all variables are initially initialized to zero. Both threads begin their execution at the same time and after they both finish, the computed result (in this case the value of variable _x) is checked against an expected value (_x == 2). The program is executed several thousand times, because many concurrency effects manifest themselves only occasionally¹. If the expected result is produces in all runs, the table row with the architecture’s name is painted in green, otherwise in red. Also note that a thousand of successful runs can be false positive, subject to timing, while even a single failure is always definitive.

Undefined behavior is a property of C++ programs. Once a program has been compiled, there is no undefined behavior left, even if the result is unexpected. The generated code is well-defined from the CPU’s standpoint.

In the example above, the lack of atomicity is clearly visible on all architectures except x86. An increment operation consists of the following steps:

Load a value from a memory location.
Increment the value.
Store the result back to the memory location.

x86 does the same thing, but using a single instruction. The lack of atomicity will become apparent when the instruction is translated into microcode before being pushed to the pipeline.

None of the tested architectures provide atomicity for the non-atomic increment.

Now, we change the type of variable _x from int to std::atomic<int>. We also make all accesses relaxed to prevent the compiler from generating unnecessary instructions to provide memory ordering.

IncrementRelaxed

C++17 require(_x == 2)
Thread 0	x.fetch_add(1, std::memory_order_relaxed);
Thread 1	x.fetch_add(1, std::memory_order_relaxed);
x86_64 SUCCESS
lock incl 24(%rdx)	lock incl 24(%rdx)
ARMv7 32-bit SUCCESS
add r3, r3, #20 .L660: ldrex r1, [r3] add r1, r1, #1 strex r0, r1, [r3] cmp r0, #0 bne .L660	add r3, r3, #20 .L1440: ldrex r1, [r3] add r1, r1, #1 strex r0, r1, [r3] cmp r0, #0 bne .L1440
ARMv8 64-bit SUCCESS
add x9, x9, #24 .LBB37_17: ldxr w10, [x9] add w10, w10, #1 stxr w11, w10, [x9] cbnz w11, .LBB37_17	add x9, x9, #24 .LBB40_17: ldxr w10, [x9] add w10, w10, #1 stxr w11, w10, [x9] cbnz w11, .LBB40_17
MIPS 32-bit SUCCESS
sync 1: ll $1,20($2) addiu $1,$1,1 sc $1,20($2) beq $1,$0,1b nop sync	sync 1: ll $1,20($2) addiu $1,$1,1 sc $1,20($2) beq $1,$0,1b nop sync

Note that the generated code has changed significantly. x86 still has a single instruction, but with a lock prefix. The prefix provides a memory barrier in addition to atomicity, but there is no way on x86 to request atomicity only. This lack of granularity is common.

Other architectures all introduce a loop and employ a kind of optimistic concurrency. First, they load and modify a value, and then attempt to store it back. The store succeeds only if no one has changed the value of that memory location in the meantime. Otherwise, the whole sequence is repeated. This is a more general mechanism of providing atomicity than making available a predefined set of atomic instructions. This mechanism is called load-link/store-conditional.

Similarly to increment, other atomic operations (including the compare-and-swap, or CAS, operation) are either built into the instruction set (x86, CAS on ARMv8) or are emulated using a Load-link/store-conditional loop.

What is memory order?

Every function that reads or writes memory has a modification order encoded in its source code. When we write a piece of code, we expect that the CPU will read and write memory in exactly the same order as written in code. Or, at least, we are happy with an illusion that the CPU does that. This sequence of loads and stores has program order.

If we have several threads, each operating on the same set of shared variables, we might expect that there is a global modification order: an interleaving of each thread’s program orders. Each thread sees (using loads) the variables change in the same order. Programs that behave like that are said to be sequentially consistent².

When looking at each thread individually, one might come up with many optimizations that eliminate unnecessary loads and stores, possibly reordering the remaining ones. What’s important is that those transformations maintain the illusion of program order for individual threads. These transformations are done by both the compiler and the CPU.

The following are some sources of memory order transformations:

Compiler optimizations.
CPU write buffers.
Speculative execution.
CPU memory caches.

Unfortunately, the illusion of program order breaks down when the same memory locations are modified concurrently. The compiler doesn’t take into account concurrent access when it optimizes regular (non-atomic) loads and stores. It is allowed to do so because any concurrent access to regular variables is a data race and hence undefined behavior. In addition, the compiler is free to reorder relaxed atomic accesses to different memory locations.

At the level of assembler instructions, all CPU architectures specify which memory reorders are allowed and which are prohibited. Additionally, they provide means to limit the allowed transformations, either partially or completely, allowing the programmer to achieve sequential consistency where it’s necessary. Or, at least, reduce the set of possible executions enough to make a particular algorithm implementation correct.

Here is an example of a program that demonstrates the effects of such transformations

SpinUseRelaxed

C++17 require(r1 == 1)
Thread 0	y.store(1, std::memory_order_relaxed); x.store(1, std::memory_order_relaxed);
Thread 1	while (x.load(std::memory_order_relaxed) == 0) ; r1 = y.load(std::memory_order_relaxed);
x86_64 SUCCESS
movl $1, 28(%rdx) movl $1, 24(%rdx)	leaq 24(%rdx), %rcx .L1009: movl (%rcx), %esi testl %esi, %esi je .L1009 movl 28(%rdx), %edx
ARMv7 32-bit SUCCESS
str r7, [r3, #24] str r7, [r3, #20]	add r1, r3, #20 .L1088: ldr r0, [r1] cmp r0, #0 beq .L1088 ldr r3, [r3, #24]
ARMv8 64-bit FAILURE
str w23, [x9, #28] str w23, [x9, #24]	.LBB67_17: ldr w10, [x9, #24] cbz w10, .LBB67_17 ldr w10, [x9, #28]
MIPS 32-bit SUCCESS
sw $19,24($2) sw $19,20($2)	addiu $4,$2,20 $L466: lw $5,0($4) beq $5,$0,$L466 lw $2,24($2)

This is a simple synchronization protocol. Thread 0 stores 1 to variable y and then signals that it’s done its job by setting the flag x. Thread 1 waits until the flag is set and then proceeds to read the stored value from variable y. The protocol ensures that in all circumstances Thread 0 completely executes before the loop in Thread 1 exists. Or at least it would if the two stores in Thread 0 were guaranteed to execute in their program order (from Thread 1’s perspective). There are two reasons why r1 might not contain 1:

The stores in Thread 0 are reordered (Store/Store reorder).
The load after the loop in Thread 1 is executed before the loop exits (Load/Store reorder).

Store/Store reorder means that two stores (not necessarily adjacent) are reordered. Similarly, Load/Store reorder allows loads to pass subsequent stores. x86 doesn’t allow these transformation, but other architectures do.

Relaxed C++ atomics don’t prohibit any of the four possible reorders: Load/Store, Load/Load, Store/Load and Store/Store.

It’s very important to understand that weak (non-sequentially-consistent) atomics don’t have any global order on which different threads must agree. Even inside a single program execution. One thread might see one order, while another–a completely different order. The modification order is relative to each thread. This relativity is another reason why effectively programming with weak atomics might require trading your sanity for the privilege.

Before discussing C++ memory orders, we need to understand one set of ordering guarantees on which all other memory orders are built.

Cache coherence

Cache coherence is a property of most modern architectures. It says that accesses to a single memory location create a global modification order on which all threads agree. And this order respects the program order of each participating thread.

This definition is very similar to sequential consistency, but limits itself to a single memory location.

Without this property, making any other memory order guarantee is very difficult, and architectures that don’t provide cache coherence are nightmarish to program against.

The practical result of cache coherence is that CPU caches are made invisible to threads accessing a single memory location. There’s no need to worry about stale values or manual cache flushes (using memory barriers). All of this is taken care of by a cache coherence protocol, such as MOESI or similar.

C++ recognizes the almost³ universal availability of cache coherence on modern architectures, and explicitly forbids all reorders of memory accesses to the same variable. In other words, the compiler guarantees that program order of C++ source code is preserved in generated code (for accesses to the same variable).

Once again, there is no guarantee as to how the modification orders of different variables relate to each other.

Atomics summary

The table below summarizes the common properties of different memory orders in C++.

Order	Intuition	Properties
Relaxed	Cache coherence	Atomicity, data race immunity.
Acquire-Release	Release packages all preceding memory effects together with the stored value. Acquire-loads the stored value and immediately sees all the packaged effects. Alternatively, acquire-release pair creates a critical section, similar to a lock.	Atomicity, data race immunity. Release (store) has to synchronize with acquire (load), otherwise no synchronization happens.
Sequential Consistency	Interleaving of the program orders of each thread.	Atomicity, data race immunity. A single global modification order on which all threads agree.

Read-modify-write operations, such as increment, can have different memory orders for the read and write parts.

The additional release-consume memory order is widely implemented the same way as acquire-release and doesn’t provide performance benefits on any platform or compiler. Due to the lack of efficient implementations, the C++20 standard currently discourages the use of consume operations, recommending using acquire-release instead. I won’t discuss consume operations further.

Formal definitions

C++11 first introduced a formal set of memory orders: relaxed, acquire-release, consume-release and sequential consistency. Since then, a few problems have been discovered, most notably:

Consume operations turned out to be unusable.
Pairing of acquire-release operations with sequentially consistent operations were shown to require more overhead on some architectures than had previously been thought.

As a result, C++20 revised its formal memory consistency model by introducing new definitions and making the model even less comprehensible in the process.

I don’t want to repeat the C++ standard. Especially since the formal definitions will most certainly evolve in the future⁴. Instead, I will briefly explain the intuition behind different memory orders and show a few examples.

Relaxed memory order

As already discussed above, relaxed atomics don’t guarantee any ordering of accesses to different memory locations, but they do provide the cache coherence guarantee. This lack of ordering severely reduces the applicability of relaxed atomics, mostly limiting their use to simple counters.

We’ve already seen that it’s impossible to implement even the simplest of synchronization protocols using relaxed atomics. Don’t use them in any context where there’s a need to order memory accesses.

Acquire-release memory order

In this order each store to a variable also releases all previously stored values to all other locations, made by any type of store, including regular and relaxed stores. Any load in another thread that reads the stored value, also acquires all the memory values released by the store in the first thread. Meaning that any subsequent read (of any type) after the acquire-load sees the released values.

This visibility guarantee is only provided to threads that establish the acquire-release relation between themselves by writing and then reading the same variable. Other threads can’t make any assumptions about the order of stores and loads of any other threads. In other words, there is no global modification order across different variables and threads.

No previous store or load (in program order) of any type can pass a release-store. Similarly, no subsequent store or load can pass an acquire-load. It’s often said that release and acquire operations behave as one-way memory barriers.

Another popular metaphor for acquire-release pairs is that of a critical section. In this view an acquire-load is analogous to a lock operation on a mutex, and opens a critical section. A release operation is analogues to an unlock operation, and closes the critical section. One can safely move additional loads and stores into that critical section, but is prohibited from moving loads and stores out of the critical section.

This metaphor might be misleading, given that when using a real mutex the lock operation precedes the unlock operation, and both are executed in the same thread. While in the acquire-release analogy the release operation logically precedes the associated acquire operation, and they also happen in different threads.

In practice, CPU architectures rarely provide acquire-release semantics directly⁵, nor do they provide one-way memory barriers. Implementations have to use stronger two-way barriers. Notably x86 is strong enough to provide acquire-release semantics without any overhead for all loads and stores.

We’ve already tried implementing a simple synchronization protocol using relaxed atomics and failed. Turns out acquire-release semantics are exactly the right choice for this kind of synchronization:

SpinUseAcqRelRegular

C++17 require(r1 == 1)
Thread 0	regular_y = 1; x.store(1, std::memory_order_release);
Thread 1	while (x.load(std::memory_order_acquire) == 0) ; r1 = regular_y;
x86_64 SUCCESS
movl $1, 28(%rdx) movl $1, 24(%rdx)	leaq 24(%rdx), %rcx .L263: movl (%rcx), %esi testl %esi, %esi je .L263 movl 28(%rdx), %edx
ARMv7 32-bit SUCCESS
str r7, [r3, #24] dmb ish str r7, [r3, #20]	add r1, r3, #20 .L677: ldr r0, [r1] dmb ish cmp r0, #0 beq .L677 ldr r3, [r3, #24]
ARMv8 64-bit SUCCESS
str w23, [x9, #28] add x9, x9, #24 stlr w23, [x9]	.LBB85_17: add x10, x9, #24 ldar w10, [x10] cbz w10, .LBB85_17 ldr w10, [x9, #28]
MIPS 32-bit SUCCESS
sw $19,24($2) sync sw $19,20($2)	addiu $4,$2,20 $L1131: lw $5,0($4) sync beq $5,$0,$L1131 lw $2,24($2)

Note that on x86 the generated assembly is very similar to what we have seen for relaxed atomics. No additional barriers are inserted. ARMv7 and MIPS have additional barriers: dmb and sync instructions. ARMv8 uses special loads and stores that provide acquire-release semantics. Actually, they give stronger guarantees than required for acquire-release. They additionally provide global modification order and as such implement sequential consistency.

Even though a number of useful algorithms can be implemented using acquire-release semantics, there are situations where there’s a need for stronger guarantees. Consider the following program:

SeqCstAcqRel2

C++17 require(_r1 == 1 \|\| _r2 == 1)
Thread 0	x.store(1, std::memory_order_release); r1 = y.load(std::memory_order_acquire);
Thread 1	y.store(1, std::memory_order_release); r2 = x.load(std::memory_order_acquire);
x86_64 FAILURE
movl $1, 24(%rdx) movl 28(%rdx), %ecx	movl $1, 28(%rdx) movl 24(%rdx), %ecx
ARMv7 32-bit FAILURE
dmb ish str r7, [r3, #20] ldr r0, [r3, #24] dmb ish	dmb ish str r7, [r3, #24] ldr r0, [r3, #20] dmb ish
ARMv8 64-bit SUCCESS
add x10, x9, #24 add x11, x9, #28 stlr w23, [x10] ldar w10, [x11]	add x10, x9, #28 add x11, x9, #24 stlr w23, [x10] ldar w10, [x11]
MIPS 32-bit SUCCESS
sync sw $18,20($2) lw $5,24($2) sync	sync sw $18,24($2) lw $5,20($2) sync

Intuitively, we assume that r1 and r2 can never be equal to zero at the same time. No interleaving of program orders produce this result. But acquire-release isn’t strong enough to guarantee that. No values are released or acquired in this example, and no ordering is enforced.

Sequential consistency memory order

Sequential consistency is the most intuitive memory order. Any execution of a program is simply an interleaving of each thread’s program orders. Unfortunately, it’s also the slowest one.

This is the default memory order of C++ atomics. If no other memory order is used at the same time, C++ is said to provide sequential consistency for data-race-free programs.

It provides the same guarantees as acquire-release memory order, but also introduces a global modification order across all threads. Sequentially consistent operations are strictly stronger that acquire-release operations and can be combined with them. For example, a store-release can synchronize with a sequentially consistent load. This combination only provides acquire-release semantics. The guarantee of a global modification order is lost.

The example from the previous section works as expected using the sequential consistency memory order:

SeqCst2

C++17 require(_r1 == 1 \|\| _r2 == 1)
Thread 0	x.store(1, std::memory_order_seq_cst); r1 = y.load(std::memory_order_seq_cst);
Thread 1	y.store(1, std::memory_order_seq_cst); r2 = x.load(std::memory_order_seq_cst);
x86_64 SUCCESS
movl $1, 24(%rdx) mfence movl 28(%rdx), %ecx	movl $1, 28(%rdx) mfence movl 24(%rdx), %ecx
ARMv7 32-bit SUCCESS
dmb ish str r7, [r3, #20] dmb ish dmb ish ldr r0, [r3, #24] dmb ish	dmb ish str r7, [r3, #24] dmb ish dmb ish ldr r0, [r3, #20] dmb ish
ARMv8 64-bit SUCCESS
add x10, x9, #24 add x11, x9, #28 stlr w23, [x10] ldar w10, [x11]	add x10, x9, #28 add x11, x9, #24 stlr w23, [x10] ldar w10, [x11]
MIPS 32-bit SUCCESS
sync sw $18,20($2) sync sync lw $5,24($2) sync	sync sw $18,24($2) sync sync lw $5,20($2) sync

x86 now requires a full memory fence to prevent the possible Store/Load reordering. It’s inserted after both stores.

ARMv7 and MIPS have full memory fences surrounding each sequentially consistent load and store. ARMv7 probably has an unnecessary fence before loads, and both have duplicate adjacent fences, that could have been safely collapsed into one.

ARMv8 has the exact same implementation as in the previous example. This is one case where using the weaker acquire-release memory order has no benefit.

Conclusion

Programming with atomics is hard. Even the default sequential consistency memory order allows for plenty of ways to make mistakes. Keep in mind that atomics itself should only be used as the last resort, when every other mechanism is insufficient.

Before using atomics consider high-level concurrency primitives like queues, channels and tasks. Failing that try lower-level locks and condition variables. Only if performance is still an issue, reach out for sequentially consistent atomics. Only the most extraordinary scenarios must warrant the use of weaker memory orders. Well-researched algorithms and operating systems code are examples.

It’s probably for the best if weak memory orders are used as a tool to deepen one’s understanding of CPU architectures, rather than a practical instrument of everyday programming.