sync/atomic.And and Or
Implementing bitwise atomic operations in Go
12 Jan 2025
Mauri de Souza Meneguzzo
12 Jan 2025
Mauri de Souza Meneguzzo
Go's sync/atomic package has always had Add, Swap, CompareAndSwap, and Load/Store.
But no bitwise operations.
Setting a flag used to look like this:
for {
old := atomic.LoadUint32(&flags)
if atomic.CompareAndSwapUint32(&flags, old, old|0x4) {
break
}
}
This is a CAS loop. Under contention it spins, wastes CPU cycles, and generates unnecessary retries.
2CAS loops are not equivalent to a hardware atomic bitwise instruction.
LOCK OR completes in one bus transaction.Real packages in the wild use this pattern: sync, net/http, runtime internals, os.
Issue #61395: add And and Or to sync/atomic
New API surface:
func AndInt32(addr *int32, mask int32) (old int32)
func AndInt64(addr *int64, mask int64) (old int64)
func AndUint32(addr *uint32, mask uint32) (old uint32)
func AndUint64(addr *uint64, mask uint64) (old uint64)
func AndUintptr(addr *uintptr, mask uintptr) (old uintptr)
func OrInt32(addr *int32, mask int32) (old int32)
func OrInt64(addr *int64, mask int64) (old int64)
func OrUint32(addr *uint32, mask uint32) (old uint32)
func OrUint64(addr *uint64, mask uint64) (old uint64)
func OrUintptr(addr *uintptr, mask uintptr) (old uintptr)
Also available as methods on atomic.Int32, atomic.Int64, atomic.Uint32, atomic.Uint64, atomic.Uintptr.
// Before: CAS loop, spins under contention
for {
old := atomic.LoadUint32(&flags)
if atomic.CompareAndSwapUint32(&flags, old, old|0x4) {
break
}
}
// After: one instruction, no spinning
atomic.OrUint32(&flags, 0x4)
5
Go supports many platforms. Every new operation needs assembly on all of them.
Architectures covered:
LOCK CMPXCHGLLDADD, STLR with load-exclusive/store-exclusiveSTREX loop (no hardware atomic bitwise on ARMv7)LDARX/STDCX. reservation loopLAO, LAN (Load And Or/And)LL/SC (Load Linked/Store Conditional) loopMultiple architectures. Lots of assembly.
6amd64 has native:
// func And32(addr *uint32, v uint32) old uint32
TEXT ·And32(SB), NOSPLIT, $0-20
MOVQ ptr+0(FP), BX
MOVL val+8(FP), CX
casloop:
MOVL CX, DX
MOVL (BX), AX
ANDL AX, DX
LOCK
CMPXCHGL DX, (BX)
JNZ casloop
MOVL AX, ret+16(FP)
RET
7
Go's race detector wraps every memory access with calls into TSAN (ThreadSanitizer).
Atomic operations need to be reported as tsan_atomic calls, not plain read/write, otherwise the detector generates false positives.
Each new operation needed a corresponding entry in race_amd64.s and a wrapper in race.go that routes through runtime.racecall.
// And
TEXT sync∕atomic·AndInt32(SB), NOSPLIT, $0-20
GO_ARGS
MOVD $__tsan_go_atomic32_fetch_and(SB), R9
BL racecallatomic<>(SB)
RET
8
The compiler can lower sync/atomic.OrUint32 to a single atomic instruction.
This required adding intrinsic entries to the SSA backend.
With the intrinsic in place, the compiler emits the instruction inline with no call overhead, no stack frame.
950–90% faster than CAS loops under contention.
10runtime/internal/atomic: And/Or for amd64/386sync/atomic: public API, race supportShipped in Go 1.23 and 1.24.
Nominated for the Google Open Source Peer Award in 2024 because of this work.
Now code owner for runtime and atomic packages in the stdlib https://dev.golang.org/owners
11My talks are written with golang.org/x/tools/present
Find this talk at talks.mauri870.com
12