Archive for January, 2010

Barriers to understanding Memory Barriers

Friday, January 29th, 2010

I find lots of conflicting, or at least apparently conflicting information ‘out there’ about memory barriers.  So I’m gathering bits and pieces to see if it is possible to make sense of it all.

LoadStore, LoadLoad, StoreLoad, LoadStore

The sparc way, which may form the basis of the understanding.  Barriers are placed *between* memory operations.  Given the following:

  memOpA
  memOpB
  ...
  memOpP
---barrier---
  memOpQ
  memOpR
  ...

There are 2 main types of operations: read (load) and write (store).  This leads to 4 combinations of what you want to NOT reorder:

  • LoadLoad – Loads before the barrier can’t be reordered past Loads that are after the barrier
  • LoadStore – Loads before the barrier can’t be reordered past Stores that are after the barrier
  • StoreStore – Stores before the barrier can’t be reordered past Stores that are after the barrier
  • StoreLoad – Stores before the barrier can’t be reordered past Loads that are after the barrier

These can be combined.  ie a LoadLoad+LoadStore prevents loads from moving down, and *anything* from moving up.

(references: Sun Developer Network in particular  “The semantics of each #XY relationship is as follows: “All X operations that appear before the membar in program order complete before any Y operations that follow after the membar in program order.” More complex ordering requirements can be created by combining relationships using the bit-wise or operator.” )

Linux:

Linux tends to only use simple read/write barriers (or full barrier when needed) in the kernel, at least.

  • smp_wmb() – write barrier – only prevents writes from being reordered (ie StoreStore)
  • smp_rmb()  – read barrier – only prevents reads from being reordered( ie LoadLoad)

(refs: Paul McKenney. See also Kernel Docs)

Acquire/Release

The other way of looking at memory barriers, is by placing constraints *on* memory operations, instead of between them.  In particular:

  • read_acquire(x) – read x with ‘acquire’ semantics
  • write_release(x, 1) – write x = 1 with ‘release’ semantics

But, what does acquire/release mean?…

Microsoft:

Acquire memory semantics specify that the memory operation being performed by the current thread will be visible before any other memory operations are attempted. Release memory semantics specify that the memory operation being performed by the current thread will be visible after all other memory operations have been completed.

(ref MSDN)

Paul McKenney (IBM Distinguished Engineer & CTO for Linux at IBM)

  • acquire == LoadStore + StoreStore
  • release  == LoadStore + LoadLoad

ref: talk from Mel8ourne Linux conference, particularly page 22.

C++0x

  • std::memory_order_acquire
  • std::memory_order_release

Synchronizes-with

An operation A synchronizes-with an operation B if

  • A is a store to some atomic variable m, with an ordering of std::memory_order_release, or std::memory_order_seq_cst,
  • B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst, and
  • B reads the value stored by A.
  • Happens-before

    An operation A happens-before an operation B if:

    • A is performed on the same thread as B, and A is before B in program order, or
    • A synchronizes-with B, or
    • A happens-before some other operation C, and C happens-before B.

    ref: Anthony Williams

    Sun <sys/atomic.h>

    (‘acquire’ becomes ‘enter’, ‘release’ becomes ‘exit’):

    /*
     * Generic memory barrier used during lock entry, placed after the
     * memory operation that acquires the lock to guarantee that the lock
     * protects its data.  No stores from after the memory barrier will
     * reach visibility, and no loads from after the barrier will be
     * resolved, before the lock acquisition reaches global visibility.
     */
    extern void membar_enter(void); 
    
    /*
     * Generic memory barrier used during lock exit, placed before the
     * memory operation that releases the lock to guarantee that the lock
     * protects its data.  All loads and stores issued before the barrier
     * will be resolved before the subsequent lock update reaches visibility.
     */
    extern void membar_exit(void); 
    
    /*
     * Arrange that all stores issued before this point in the code reach
     * global visibility before any stores that follow; useful in producer
     * modules that update a data item, then set a flag that it is available.
     * The memory barrier guarantees that the available flag is not visible
     * earlier than the updated data, i.e. it imposes store ordering.
     */
    extern void membar_producer(void); 
    
    /*
     * Arrange that all loads issued before this point in the code are
     * completed before any subsequent loads; useful in consumer modules
     * that check to see if data is available and read the data.
     * The memory barrier guarantees that the data is not sampled until
     * after the available flag has been seen, i.e. it imposes load ordering.
     */
    extern void membar_consumer(void);

    Conclusions

    • acquire == LoadLoad + LoadStore
    • release  == LoadStore + StoreStore

    In particular, for a memory locations referenced as …,x,y,…, and registers r1, r2, …:

    r1 = v;
    w = r2;
    r3 = read_acquire(x);
    r4 = y;
    z = r5;
    write_release(x, 0);
    r6 = t;
    u = r7;

    becomes

    r1 = v;
    w = r2;
    r3 = x; // ***
    ---LoadLoad+LoadStore--- (ie keep the following below the read of x)
    r4 = y;
    z = r5;
    ---LoadStore+StoreStore--- (ie keep the previous above the write of x)
    x = 0; // ***
    r6 = t;
    u = r7;

    So note that:

    • Memory Operations before read_acquire(x) (ie r1,r2 in the example) can be moved after the acquire because the r3=x is before the barrier, and thus free to reorder with the r1 and r2 instructions;
    • Memory Operations after write_release(x,0) (ie r6, r7 in the example) can be moved before the release, as, again, the x=0 is outside (this time after) the barrier, and thus free to reorder with the r6 and r7 instructions.
    • Memory Operations between the acquire and release must stay between them
    • StoreLoad is NOT used. This is a benefit as it is typically the slowest barrier.

    ie the following is a possible reordering. In fact, in some sense the ‘maximal’ reordering. ie I have tried to reorder instructions as far away from where they started (relative to other instructions) as possible:

    r3 = x; // ***
    r1 = v;
    ---LoadLoad+LoadStore--- (ie keep the following below the read of x)
    r6 = t;
    z = r5;
    r4 = y;
    w = r2;
    ---StoreStore+LoadStore--- (ie keep the previous above the write of x)
    u = r7;
    x = 0; // ***

    Ta da!