C++11 CLH Lock Implementation

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite
1












As a hobbyist programmer I have been reading "The Art of Multiprocessor Programming" (Herlihy, Maurice; Shavit, Nir) and converting the Java spin locks into C++ - I have enjoyed this and learned a lot. I have implemented the CLH spin queue lock (due to Travis Craig, Erik Hagersten, and Anders Landin) but I feel unhappy about the way that I have the raw pointers interacting with smart pointers and would be grateful for code review advice.



As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...



I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.



Therefore, I have to come up with some combination of either, raw or smart, prior node pointer and mangle it between raw tail and smart local_node pointers. (I settled on raw prior_node pointer as this gave least, but still, ugly code in the unlock member function.)



Or am I missing something?



I have verified the lock on a 24 logical core HP Z600 refurbished workstation and measured its performance. It has proved correct and also efficient when compared to std::mutex. Further, according to MSVC profiler, it has no memory leaks.



The implementation uses two general purpose header files:



  1. A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.


  2. A (naive) cache line information file.


Here is the cpu_relax header file:



namespace sync {
inline static void cpu_relax() COMPILER == LLVM)
asm volatile("pausen": : :"memory");
#endif



Here is the cache_info header file:



namespace sync 

#define CACHELINE_SIZE 64
//almost uniformly 64 bytes on Intel and AMD processors at all cache levels
//it will do for now




Here is the header file:



namespace sync 

/**
* @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
*
* [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
* [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165– 171, April 1994. IEEE Computer Society, April 1994. Vancouver
*/
class clh_lock

// neat cacheline padding trick by David Geier geidav.wordpress.com
// cacheline padding is essential to avoid false sharing
using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

public:

clh_lock() : tail(new node_t)

void lock()
local_node->first = true;
prior_node = tail.exchange(local_node.get());
while (prior_node->first)
cpu_relax();



void unlock()
local_node->first = false;
node_t* tmp = prior_node;
prior_node = local_node.release();
local_node.reset(tmp);


~clh_lock()
delete tail;


private:

std::atomic<node_t*> tail;

static thread_local std::unique_ptr<node_t> local_node;
static thread_local node_t* prior_node;

;




Here is the implementation file:



namespace sync 

thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;




Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:



//thread thrashing tests
clocks::stopwatch<> timer;
int samples100;
int nthreads25;
int ftor1000;
std::ofstream df"clh_lock.csv";
using lock_t = sync::mcs_lock;
lock_t l;

std::vector<int> N;
std::atomic_bool gofalse;

auto f = [&](int n)->void
while(!go);
for(int i0; i < ftor; ++i)
l.lock();
N.push_back(n);
l.unlock();

;

for(int j2; j < nthreads; ++j)
df << j << ((j < nthreads - 1) ?"," :"n");


for(int n0; n < samples; ++n)
std::cout << n << " ";
for(int j2; j < nthreads; ++j)
std::vector<std::thread> v;
int thdsj;
std::cout << thds << ",";
N.clear();

for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

timer.start();
go = true;
for(auto& t : v) t.join();
timer.stop();

int s0;
for(auto& n : N) s += n;
int sum = ((thds * (thds + 1)) / 2) * ftor;
assert(s == sum);

df << w.elapsed() << (j < nthreads - 1 ?"," :"n");

std::cout << "n";

std::cout << "n";






share|improve this question



























    up vote
    1
    down vote

    favorite
    1












    As a hobbyist programmer I have been reading "The Art of Multiprocessor Programming" (Herlihy, Maurice; Shavit, Nir) and converting the Java spin locks into C++ - I have enjoyed this and learned a lot. I have implemented the CLH spin queue lock (due to Travis Craig, Erik Hagersten, and Anders Landin) but I feel unhappy about the way that I have the raw pointers interacting with smart pointers and would be grateful for code review advice.



    As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...



    I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.



    Therefore, I have to come up with some combination of either, raw or smart, prior node pointer and mangle it between raw tail and smart local_node pointers. (I settled on raw prior_node pointer as this gave least, but still, ugly code in the unlock member function.)



    Or am I missing something?



    I have verified the lock on a 24 logical core HP Z600 refurbished workstation and measured its performance. It has proved correct and also efficient when compared to std::mutex. Further, according to MSVC profiler, it has no memory leaks.



    The implementation uses two general purpose header files:



    1. A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.


    2. A (naive) cache line information file.


    Here is the cpu_relax header file:



    namespace sync {
    inline static void cpu_relax() COMPILER == LLVM)
    asm volatile("pausen": : :"memory");
    #endif



    Here is the cache_info header file:



    namespace sync 

    #define CACHELINE_SIZE 64
    //almost uniformly 64 bytes on Intel and AMD processors at all cache levels
    //it will do for now




    Here is the header file:



    namespace sync 

    /**
    * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
    *
    * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
    * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165– 171, April 1994. IEEE Computer Society, April 1994. Vancouver
    */
    class clh_lock

    // neat cacheline padding trick by David Geier geidav.wordpress.com
    // cacheline padding is essential to avoid false sharing
    using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

    static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

    public:

    clh_lock() : tail(new node_t)

    void lock()
    local_node->first = true;
    prior_node = tail.exchange(local_node.get());
    while (prior_node->first)
    cpu_relax();



    void unlock()
    local_node->first = false;
    node_t* tmp = prior_node;
    prior_node = local_node.release();
    local_node.reset(tmp);


    ~clh_lock()
    delete tail;


    private:

    std::atomic<node_t*> tail;

    static thread_local std::unique_ptr<node_t> local_node;
    static thread_local node_t* prior_node;

    ;




    Here is the implementation file:



    namespace sync 

    thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
    thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;




    Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:



    //thread thrashing tests
    clocks::stopwatch<> timer;
    int samples100;
    int nthreads25;
    int ftor1000;
    std::ofstream df"clh_lock.csv";
    using lock_t = sync::mcs_lock;
    lock_t l;

    std::vector<int> N;
    std::atomic_bool gofalse;

    auto f = [&](int n)->void
    while(!go);
    for(int i0; i < ftor; ++i)
    l.lock();
    N.push_back(n);
    l.unlock();

    ;

    for(int j2; j < nthreads; ++j)
    df << j << ((j < nthreads - 1) ?"," :"n");


    for(int n0; n < samples; ++n)
    std::cout << n << " ";
    for(int j2; j < nthreads; ++j)
    std::vector<std::thread> v;
    int thdsj;
    std::cout << thds << ",";
    N.clear();

    for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

    timer.start();
    go = true;
    for(auto& t : v) t.join();
    timer.stop();

    int s0;
    for(auto& n : N) s += n;
    int sum = ((thds * (thds + 1)) / 2) * ftor;
    assert(s == sum);

    df << w.elapsed() << (j < nthreads - 1 ?"," :"n");

    std::cout << "n";

    std::cout << "n";






    share|improve this question























      up vote
      1
      down vote

      favorite
      1









      up vote
      1
      down vote

      favorite
      1






      1





      As a hobbyist programmer I have been reading "The Art of Multiprocessor Programming" (Herlihy, Maurice; Shavit, Nir) and converting the Java spin locks into C++ - I have enjoyed this and learned a lot. I have implemented the CLH spin queue lock (due to Travis Craig, Erik Hagersten, and Anders Landin) but I feel unhappy about the way that I have the raw pointers interacting with smart pointers and would be grateful for code review advice.



      As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...



      I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.



      Therefore, I have to come up with some combination of either, raw or smart, prior node pointer and mangle it between raw tail and smart local_node pointers. (I settled on raw prior_node pointer as this gave least, but still, ugly code in the unlock member function.)



      Or am I missing something?



      I have verified the lock on a 24 logical core HP Z600 refurbished workstation and measured its performance. It has proved correct and also efficient when compared to std::mutex. Further, according to MSVC profiler, it has no memory leaks.



      The implementation uses two general purpose header files:



      1. A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.


      2. A (naive) cache line information file.


      Here is the cpu_relax header file:



      namespace sync {
      inline static void cpu_relax() COMPILER == LLVM)
      asm volatile("pausen": : :"memory");
      #endif



      Here is the cache_info header file:



      namespace sync 

      #define CACHELINE_SIZE 64
      //almost uniformly 64 bytes on Intel and AMD processors at all cache levels
      //it will do for now




      Here is the header file:



      namespace sync 

      /**
      * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
      *
      * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
      * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165– 171, April 1994. IEEE Computer Society, April 1994. Vancouver
      */
      class clh_lock

      // neat cacheline padding trick by David Geier geidav.wordpress.com
      // cacheline padding is essential to avoid false sharing
      using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

      static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

      public:

      clh_lock() : tail(new node_t)

      void lock()
      local_node->first = true;
      prior_node = tail.exchange(local_node.get());
      while (prior_node->first)
      cpu_relax();



      void unlock()
      local_node->first = false;
      node_t* tmp = prior_node;
      prior_node = local_node.release();
      local_node.reset(tmp);


      ~clh_lock()
      delete tail;


      private:

      std::atomic<node_t*> tail;

      static thread_local std::unique_ptr<node_t> local_node;
      static thread_local node_t* prior_node;

      ;




      Here is the implementation file:



      namespace sync 

      thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
      thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;




      Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:



      //thread thrashing tests
      clocks::stopwatch<> timer;
      int samples100;
      int nthreads25;
      int ftor1000;
      std::ofstream df"clh_lock.csv";
      using lock_t = sync::mcs_lock;
      lock_t l;

      std::vector<int> N;
      std::atomic_bool gofalse;

      auto f = [&](int n)->void
      while(!go);
      for(int i0; i < ftor; ++i)
      l.lock();
      N.push_back(n);
      l.unlock();

      ;

      for(int j2; j < nthreads; ++j)
      df << j << ((j < nthreads - 1) ?"," :"n");


      for(int n0; n < samples; ++n)
      std::cout << n << " ";
      for(int j2; j < nthreads; ++j)
      std::vector<std::thread> v;
      int thdsj;
      std::cout << thds << ",";
      N.clear();

      for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

      timer.start();
      go = true;
      for(auto& t : v) t.join();
      timer.stop();

      int s0;
      for(auto& n : N) s += n;
      int sum = ((thds * (thds + 1)) / 2) * ftor;
      assert(s == sum);

      df << w.elapsed() << (j < nthreads - 1 ?"," :"n");

      std::cout << "n";

      std::cout << "n";






      share|improve this question













      As a hobbyist programmer I have been reading "The Art of Multiprocessor Programming" (Herlihy, Maurice; Shavit, Nir) and converting the Java spin locks into C++ - I have enjoyed this and learned a lot. I have implemented the CLH spin queue lock (due to Travis Craig, Erik Hagersten, and Anders Landin) but I feel unhappy about the way that I have the raw pointers interacting with smart pointers and would be grateful for code review advice.



      As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...



      I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.



      Therefore, I have to come up with some combination of either, raw or smart, prior node pointer and mangle it between raw tail and smart local_node pointers. (I settled on raw prior_node pointer as this gave least, but still, ugly code in the unlock member function.)



      Or am I missing something?



      I have verified the lock on a 24 logical core HP Z600 refurbished workstation and measured its performance. It has proved correct and also efficient when compared to std::mutex. Further, according to MSVC profiler, it has no memory leaks.



      The implementation uses two general purpose header files:



      1. A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.


      2. A (naive) cache line information file.


      Here is the cpu_relax header file:



      namespace sync {
      inline static void cpu_relax() COMPILER == LLVM)
      asm volatile("pausen": : :"memory");
      #endif



      Here is the cache_info header file:



      namespace sync 

      #define CACHELINE_SIZE 64
      //almost uniformly 64 bytes on Intel and AMD processors at all cache levels
      //it will do for now




      Here is the header file:



      namespace sync 

      /**
      * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
      *
      * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
      * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165– 171, April 1994. IEEE Computer Society, April 1994. Vancouver
      */
      class clh_lock

      // neat cacheline padding trick by David Geier geidav.wordpress.com
      // cacheline padding is essential to avoid false sharing
      using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

      static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

      public:

      clh_lock() : tail(new node_t)

      void lock()
      local_node->first = true;
      prior_node = tail.exchange(local_node.get());
      while (prior_node->first)
      cpu_relax();



      void unlock()
      local_node->first = false;
      node_t* tmp = prior_node;
      prior_node = local_node.release();
      local_node.reset(tmp);


      ~clh_lock()
      delete tail;


      private:

      std::atomic<node_t*> tail;

      static thread_local std::unique_ptr<node_t> local_node;
      static thread_local node_t* prior_node;

      ;




      Here is the implementation file:



      namespace sync 

      thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
      thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;




      Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:



      //thread thrashing tests
      clocks::stopwatch<> timer;
      int samples100;
      int nthreads25;
      int ftor1000;
      std::ofstream df"clh_lock.csv";
      using lock_t = sync::mcs_lock;
      lock_t l;

      std::vector<int> N;
      std::atomic_bool gofalse;

      auto f = [&](int n)->void
      while(!go);
      for(int i0; i < ftor; ++i)
      l.lock();
      N.push_back(n);
      l.unlock();

      ;

      for(int j2; j < nthreads; ++j)
      df << j << ((j < nthreads - 1) ?"," :"n");


      for(int n0; n < samples; ++n)
      std::cout << n << " ";
      for(int j2; j < nthreads; ++j)
      std::vector<std::thread> v;
      int thdsj;
      std::cout << thds << ",";
      N.clear();

      for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

      timer.start();
      go = true;
      for(auto& t : v) t.join();
      timer.stop();

      int s0;
      for(auto& n : N) s += n;
      int sum = ((thds * (thds + 1)) / 2) * ftor;
      assert(s == sum);

      df << w.elapsed() << (j < nthreads - 1 ?"," :"n");

      std::cout << "n";

      std::cout << "n";








      share|improve this question












      share|improve this question




      share|improve this question








      edited Jan 5 at 23:53









      Jamal♦

      30.1k11114225




      30.1k11114225









      asked Jan 5 at 23:48









      headwedge

      287




      287

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "196"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );








           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184407%2fc11-clh-lock-implementation%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes










           

          draft saved


          draft discarded


























           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184407%2fc11-clh-lock-implementation%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          Greedy Best First Search implementation in Rust

          Function to Return a JSON Like Objects Using VBA Collections and Arrays