sync/atomic.And and Or

Implementing bitwise atomic operations in Go

12 Jan 2025

Mauri de Souza Meneguzzo

The problem

Go's sync/atomic package has always had Add, Swap, CompareAndSwap, and Load/Store.

But no bitwise operations.

Setting a flag used to look like this:

for {
    old := atomic.LoadUint32(&flags)
    if atomic.CompareAndSwapUint32(&flags, old, old|0x4) {
        break
    }
}

This is a CAS loop. Under contention it spins, wastes CPU cycles, and generates unnecessary retries.

2

Why does this matter?

CAS loops are not equivalent to a hardware atomic bitwise instruction.

Real packages in the wild use this pattern: sync, net/http, runtime internals, os.

3

The proposal

Issue #61395: add And and Or to sync/atomic

New API surface:

func AndInt32(addr *int32, mask int32) (old int32)
func AndInt64(addr *int64, mask int64) (old int64)
func AndUint32(addr *uint32, mask uint32) (old uint32)
func AndUint64(addr *uint64, mask uint64) (old uint64)
func AndUintptr(addr *uintptr, mask uintptr) (old uintptr)

func OrInt32(addr *int32, mask int32) (old int32)
func OrInt64(addr *int64, mask int64) (old int64)
func OrUint32(addr *uint32, mask uint32) (old uint32)
func OrUint64(addr *uint64, mask uint64) (old uint64)
func OrUintptr(addr *uintptr, mask uintptr) (old uintptr)

Also available as methods on atomic.Int32, atomic.Int64, atomic.Uint32, atomic.Uint64, atomic.Uintptr.

4

After

// Before: CAS loop, spins under contention
for {
    old := atomic.LoadUint32(&flags)
    if atomic.CompareAndSwapUint32(&flags, old, old|0x4) {
        break
    }
}

// After: one instruction, no spinning
atomic.OrUint32(&flags, 0x4)
5

Architecture coverage

Go supports many platforms. Every new operation needs assembly on all of them.

Architectures covered:

Multiple architectures. Lots of assembly.

6

amd64 example

amd64 has native:

// func And32(addr *uint32, v uint32) old uint32
TEXT ·And32(SB), NOSPLIT, $0-20
    MOVQ	ptr+0(FP), BX
    MOVL	val+8(FP), CX
casloop:
    MOVL 	CX, DX
    MOVL	(BX), AX
    ANDL	AX, DX
    LOCK
    CMPXCHGL	DX, (BX)
    JNZ casloop
    MOVL 	AX, ret+16(FP)
    RET
7

Race detector integration

Go's race detector wraps every memory access with calls into TSAN (ThreadSanitizer).

Atomic operations need to be reported as tsan_atomic calls, not plain read/write, otherwise the detector generates false positives.

Each new operation needed a corresponding entry in race_amd64.s and a wrapper in race.go that routes through runtime.racecall.

// And
TEXT	sync∕atomic·AndInt32(SB), NOSPLIT, $0-20
    GO_ARGS
    MOVD	$__tsan_go_atomic32_fetch_and(SB), R9
    BL	racecallatomic<>(SB)
    RET
8

Compiler intrinsics

The compiler can lower sync/atomic.OrUint32 to a single atomic instruction.

This required adding intrinsic entries to the SSA backend.

With the intrinsic in place, the compiler emits the instruction inline with no call overhead, no stack frame.

9

Performance

50–90% faster than CAS loops under contention.

10

Merged

Shipped in Go 1.23 and 1.24.

Nominated for the Google Open Source Peer Award in 2024 because of this work.

Now code owner for runtime and atomic packages in the stdlib https://dev.golang.org/owners

11

Useful links

Proposal #61395

My CLs on Gerrit

sync/atomic package docs

My talks are written with golang.org/x/tools/present

Find this talk at talks.mauri870.com

12

Thank you

12 Jan 2025

Mauri de Souza Meneguzzo

Use the left and right arrow keys or click the left and right edges of the page to navigate between slides.
(Press 'H' or navigate to hide this message.)