Archive for the ‘programming’ Category

A non-allocating std::future/promise

Wednesday, May 23rd, 2012

std::future and std::promise are two ends of a single-item queue. A typical implementation has the future and the promise sharing a 3rd struct that is the queue. Typically via a shared_ptr. Is this 3rd wheel necessary? Wouldn’t it be nice if the allocation could be avoided, and instead, all the bookkeeping was in the future and promise?

Well, it would be nice, but is it possible? Note that each object may be moved or destroyed at any time – how can they communicate (for example, commit the result, or wake a thread waiting on the result, etc) when each may have moved or be moving? How can they track each other? If each has a pointer to the other, the pointers would become invalid (on move, delete), and it would not be safe for one side to update the other. The shared 3rd wheel seems safest.

If we had a lock (eg, a mutex) we could grab the lock before a move/destruction and update the other’s state. ie in the future‘s implementation of a move-constructor, it could obtain the lock, update the promise‘s pointer to the future, complete the move, then release the lock. Similarly, the promise would also grab the lock whenever it needed to read/write the future. But where is the lock? If it was, for example, in the future, then the promise would need to traverse its pointer before grabbing the lock, defeating the purpose. Also, std::mutex is not movable, so it can’t be placed inside either the future nor the promise as those both are movable.

One answer, that is mostly non-allocating, would be to have a cache of mutexes available for the future/promise pairs to use. You would need a way to ensure they both get the same mutex out of the cache, whenever a mutex was needed. ie if the promise wants to write the result into the future, it goes to the mutex-cache hands it an ID (possibly the address of the promise), and locks the returned mutex before writing the result. Meanwhile, if the future wants to move, it also goes to the mutex-cache, gives it the same ID, gets the same mutex, and finds that it needs to wait for the mutex to become available.

The only problem with this answer is what happens when we run out of mutexes in the cache? Two options – either allocate some more mutexes, or re-use currently in-use mutexes. Allocating more works fine, but does mean our future/promise is no longer non-allocating, just mostly non-allocating. Re-using an in-use mutex is an interesting idea. Since the mutex is only locked for ‘small’ actions, like moving the future (which in our scenario is only a few pointers), there would seem to be no problem. HOWEVER, we need to lock the mutex while we move the result into the future. And the result is of type T – ie anything. Does T’s move operator also grab a lock? Does it depend on another future/promise? Does it grab a lock the leads to a chain of dependencies that leads to another promise/future that is colliding/deadlocking on the same from-the-cache mutex? Highly unlikely, but possible. Which is the worst kind of bug – almost impossible to track down, particularly for a user who doesn’t even know that promise/futures could deadlock with each other via an implementation detail. So sharing mutexes cannot be allowed…

Do we need the mutex-cache at all. Short answer: no. Read on…

structure of non-allocating future/promise

So the future on the left has a pointer to the promise on the right, and vice-versa.
But we need to be careful before following a pointer, as the other side may be moving or in mid-destruction. How do we synchronize?

The basic pattern is built upon 2 rules, which form a bit of a love story:
1) State “I need you” before accessing your partner
2) “Don’t leave without saying goodbye”

#1 is an acceptance and admittance of personal state, so is proclaimed and manifested on the side doing the claiming/needing.
ie if the promise needs to talk to the future (for example, to set the result) it first sets its own state to N for “I need you”.
It can then (if set successfully) venture over to the future side, and set the result. It knows the future is not going to go away, because the future won’t leave without saying goodbye – which includes checking in on its partner’s state…

// CAS is like compare_exchage for when you don't want the old val back in 'comp'
bool CAS(atomic & target, int comp, int val, memory_order order = memory_order_seq_cst)
{
   return target.compare_exchange(comp, val, order);
}

void promise::set_value(R && r)
{
   // if nothing is going on, then our state should be 0,
   // and we can set it to N signifying that we Need to talk to our partner
   if (CAS(state, 0, 'N'))
   {
      // now check/set our partner's state:
      if (CAS(future->state, 0, 'R')) // we are setting 'R'esult
      {
         // somehow set the result into the future
         // (this would more likely be an in-place new
         //  as the result would not have been previously initialized)
         future->result = r;
      }
      else
      {
         // this gets tricky.  for now imagine looping/waiting and trying again
      }
   }
   else
   {
      // well, various scenarios could be here
      // but we probably need to loop around and try again.
      // for now, let's leave this blank to focus on other things...
   }
}

Now, setting your own state first is only useful if your partner abides by it.
For example, on move, a future needs to check with the promise, and let it know it is moving.
ie “don’t leave without saying goodbye – don’t even change addresses without telling me”

void future::do_move(future & destination)
{
   // signal we need to talk:
   if (CAS(state, 0, 'N'))
   {
      // if successful, go talk
      if (CAS(promise->state, 0, 'M')) // I'm 'M'oving
      {
          move_to_dest(dest);
          promise->future = &dest; // inform partner of new address
          promise->state = 0;  // done
      }
      // else partner is busy, wait (somehow)...
   }
   // else loop, try again, etc...
}

So the rules are that you first check/set your own state – you either see that your partner is the middle of using you (and thus you need to wait) or you successfully declare your need for your partner.
Second step is to check on your partners state.

There are lots of problems still, but you can hopefully see that with these two rules we can safely make our intent known, and then read/write your partner without fear of your partner going away.

Now, onto some of the problems. The devil is always in the details…

Barriers to understanding Memory Barriers

Friday, January 29th, 2010

I find lots of conflicting, or at least apparently conflicting information ‘out there’ about memory barriers.  So I’m gathering bits and pieces to see if it is possible to make sense of it all.

LoadStore, LoadLoad, StoreLoad, LoadStore

The sparc way, which may form the basis of the understanding.  Barriers are placed *between* memory operations.  Given the following:

  memOpA
  memOpB
  ...
  memOpP
---barrier---
  memOpQ
  memOpR
  ...

There are 2 main types of operations: read (load) and write (store).  This leads to 4 combinations of what you want to NOT reorder:

  • LoadLoad – Loads before the barrier can’t be reordered past Loads that are after the barrier
  • LoadStore – Loads before the barrier can’t be reordered past Stores that are after the barrier
  • StoreStore – Stores before the barrier can’t be reordered past Stores that are after the barrier
  • StoreLoad – Stores before the barrier can’t be reordered past Loads that are after the barrier

These can be combined.  ie a LoadLoad+LoadStore prevents loads from moving down, and *anything* from moving up.

(references: Sun Developer Network in particular  “The semantics of each #XY relationship is as follows: “All X operations that appear before the membar in program order complete before any Y operations that follow after the membar in program order.” More complex ordering requirements can be created by combining relationships using the bit-wise or operator.” )

Linux:

Linux tends to only use simple read/write barriers (or full barrier when needed) in the kernel, at least.

  • smp_wmb() – write barrier – only prevents writes from being reordered (ie StoreStore)
  • smp_rmb()  – read barrier – only prevents reads from being reordered( ie LoadLoad)

(refs: Paul McKenney. See also Kernel Docs)

Acquire/Release

The other way of looking at memory barriers, is by placing constraints *on* memory operations, instead of between them.  In particular:

  • read_acquire(x) – read x with ‘acquire’ semantics
  • write_release(x, 1) – write x = 1 with ‘release’ semantics

But, what does acquire/release mean?…

Microsoft:

Acquire memory semantics specify that the memory operation being performed by the current thread will be visible before any other memory operations are attempted. Release memory semantics specify that the memory operation being performed by the current thread will be visible after all other memory operations have been completed.

(ref MSDN)

Paul McKenney (IBM Distinguished Engineer & CTO for Linux at IBM)

  • acquire == LoadStore + StoreStore
  • release  == LoadStore + LoadLoad

ref: talk from Mel8ourne Linux conference, particularly page 22.

C++0x

  • std::memory_order_acquire
  • std::memory_order_release

Synchronizes-with

An operation A synchronizes-with an operation B if

  • A is a store to some atomic variable m, with an ordering of std::memory_order_release, or std::memory_order_seq_cst,
  • B is a load from the same variable m, with an ordering of std::memory_order_acquire or std::memory_order_seq_cst, and
  • B reads the value stored by A.
  • Happens-before

    An operation A happens-before an operation B if:

    • A is performed on the same thread as B, and A is before B in program order, or
    • A synchronizes-with B, or
    • A happens-before some other operation C, and C happens-before B.

    ref: Anthony Williams

    Sun <sys/atomic.h>

    (‘acquire’ becomes ‘enter’, ‘release’ becomes ‘exit’):

    /*
     * Generic memory barrier used during lock entry, placed after the
     * memory operation that acquires the lock to guarantee that the lock
     * protects its data.  No stores from after the memory barrier will
     * reach visibility, and no loads from after the barrier will be
     * resolved, before the lock acquisition reaches global visibility.
     */
    extern void membar_enter(void); 
    
    /*
     * Generic memory barrier used during lock exit, placed before the
     * memory operation that releases the lock to guarantee that the lock
     * protects its data.  All loads and stores issued before the barrier
     * will be resolved before the subsequent lock update reaches visibility.
     */
    extern void membar_exit(void); 
    
    /*
     * Arrange that all stores issued before this point in the code reach
     * global visibility before any stores that follow; useful in producer
     * modules that update a data item, then set a flag that it is available.
     * The memory barrier guarantees that the available flag is not visible
     * earlier than the updated data, i.e. it imposes store ordering.
     */
    extern void membar_producer(void); 
    
    /*
     * Arrange that all loads issued before this point in the code are
     * completed before any subsequent loads; useful in consumer modules
     * that check to see if data is available and read the data.
     * The memory barrier guarantees that the data is not sampled until
     * after the available flag has been seen, i.e. it imposes load ordering.
     */
    extern void membar_consumer(void);

    Conclusions

    • acquire == LoadLoad + LoadStore
    • release  == LoadStore + StoreStore

    In particular, for a memory locations referenced as …,x,y,…, and registers r1, r2, …:

    r1 = v;
    w = r2;
    r3 = read_acquire(x);
    r4 = y;
    z = r5;
    write_release(x, 0);
    r6 = t;
    u = r7;

    becomes

    r1 = v;
    w = r2;
    r3 = x; // ***
    ---LoadLoad+LoadStore--- (ie keep the following below the read of x)
    r4 = y;
    z = r5;
    ---LoadStore+StoreStore--- (ie keep the previous above the write of x)
    x = 0; // ***
    r6 = t;
    u = r7;

    So note that:

    • Memory Operations before read_acquire(x) (ie r1,r2 in the example) can be moved after the acquire because the r3=x is before the barrier, and thus free to reorder with the r1 and r2 instructions;
    • Memory Operations after write_release(x,0) (ie r6, r7 in the example) can be moved before the release, as, again, the x=0 is outside (this time after) the barrier, and thus free to reorder with the r6 and r7 instructions.
    • Memory Operations between the acquire and release must stay between them
    • StoreLoad is NOT used. This is a benefit as it is typically the slowest barrier.

    ie the following is a possible reordering. In fact, in some sense the ‘maximal’ reordering. ie I have tried to reorder instructions as far away from where they started (relative to other instructions) as possible:

    r3 = x; // ***
    r1 = v;
    ---LoadLoad+LoadStore--- (ie keep the following below the read of x)
    r6 = t;
    z = r5;
    r4 = y;
    w = r2;
    ---StoreStore+LoadStore--- (ie keep the previous above the write of x)
    u = r7;
    x = 0; // ***

    Ta da!

    An Overly Simple fixed-length FIFO/Queue

    Tuesday, September 15th, 2009

    I’ve been thinking about a lockfree queue lately.  Well, on and off for a while now, but even moreso lately.  Most lockfree queues (ie Michael and Scott’s, etc) tend to be built from nodes, link-list style.  I’ve been thinking about going that way, but I have another idea stuck in my head, and it just won’t seem to leave.  So I think I better write it down and see where it ends up.  Possibly a hybrid of some kind – we’ll see. But to get there, I think we’d better start at the beginning, and build up from there, one step at a time.

    (more…)