C++11 CLH Lock Implementation

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;

up vote
1
down vote

favorite

As a hobbyist programmer I have been reading "The Art of Multiprocessor Programming" (Herlihy, Maurice; Shavit, Nir) and converting the Java spin locks into C++ - I have enjoyed this and learned a lot. I have implemented the CLH spin queue lock (due to Travis Craig, Erik Hagersten, and Anders Landin) but I feel unhappy about the way that I have the raw pointers interacting with smart pointers and would be grateful for code review advice.

As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...

I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.

Therefore, I have to come up with some combination of either, raw or smart, prior node pointer and mangle it between raw tail and smart local_node pointers. (I settled on raw prior_node pointer as this gave least, but still, ugly code in the unlock member function.)

Or am I missing something?

I have verified the lock on a 24 logical core HP Z600 refurbished workstation and measured its performance. It has proved correct and also efficient when compared to std::mutex. Further, according to MSVC profiler, it has no memory leaks.

The implementation uses two general purpose header files:

A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.

A (naive) cache line information file.

Here is the cpu_relax header file:

namespace sync {
inline static void cpu_relax() COMPILER == LLVM)
 asm volatile("pausen": : :"memory");
#endif

Here is the cache_info header file:

namespace sync 

#define CACHELINE_SIZE 64 
//almost uniformly 64 bytes on Intel and AMD processors at all cache levels
//it will do for now

Here is the header file:

namespace sync 

/**
 * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
 *
 * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
 * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165Ã¢Â€Â“ 171, April 1994. IEEE Computer Society, April 1994. Vancouver
 */
class clh_lock 

 // neat cacheline padding trick by David Geier geidav.wordpress.com
 // cacheline padding is essential to avoid false sharing
 using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

 static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

public:

 clh_lock() : tail(new node_t) 

 void lock() 
 local_node->first = true;
 prior_node = tail.exchange(local_node.get());
 while (prior_node->first) 
 cpu_relax();
 
 

 void unlock() 
 local_node->first = false;
 node_t* tmp = prior_node;
 prior_node = local_node.release();
 local_node.reset(tmp);
 

 ~clh_lock() 
 delete tail;
 

private:

 std::atomic<node_t*> tail;

 static thread_local std::unique_ptr<node_t> local_node;
 static thread_local node_t* prior_node;

;

Here is the implementation file:

namespace sync 

thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;

Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:

//thread thrashing tests
clocks::stopwatch<> timer;
int samples100;
int nthreads25;
int ftor1000;
std::ofstream df"clh_lock.csv";
using lock_t = sync::mcs_lock;
lock_t l;

std::vector<int> N;
std::atomic_bool gofalse;

auto f = [&](int n)->void 
 while(!go);
 for(int i0; i < ftor; ++i) 
 l.lock();
 N.push_back(n);
 l.unlock();
 
;

for(int j2; j < nthreads; ++j) 
 df << j << ((j < nthreads - 1) ?"," :"n");


for(int n0; n < samples; ++n) 
 std::cout << n << " ";
 for(int j2; j < nthreads; ++j) 
 std::vector<std::thread> v;
 int thdsj;
 std::cout << thds << ",";
 N.clear();

 for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

 timer.start();
 go = true;
 for(auto& t : v) t.join();
 timer.stop();

 int s0;
 for(auto& n : N) s += n;
 int sum = ((thds * (thds + 1)) / 2) * ftor;
 assert(s == sum);

 df << w.elapsed() << (j < nthreads - 1 ?"," :"n");
 
 std::cout << "n";

std::cout << "n";

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

asked Jan 5 at 23:48

headwedge

287

add a commentÂ |Â

up vote
1
down vote

favorite

As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...

I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.

Or am I missing something?

The implementation uses two general purpose header files:

A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.

A (naive) cache line information file.

Here is the cpu_relax header file:

namespace sync {
inline static void cpu_relax() COMPILER == LLVM)
 asm volatile("pausen": : :"memory");
#endif

Here is the cache_info header file:

namespace sync 

#define CACHELINE_SIZE 64 
//almost uniformly 64 bytes on Intel and AMD processors at all cache levels
//it will do for now

Here is the header file:

namespace sync 

/**
 * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
 *
 * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
 * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165Ã¢Â€Â“ 171, April 1994. IEEE Computer Society, April 1994. Vancouver
 */
class clh_lock 

 // neat cacheline padding trick by David Geier geidav.wordpress.com
 // cacheline padding is essential to avoid false sharing
 using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

 static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

public:

 clh_lock() : tail(new node_t) 

 void lock() 
 local_node->first = true;
 prior_node = tail.exchange(local_node.get());
 while (prior_node->first) 
 cpu_relax();
 
 

 void unlock() 
 local_node->first = false;
 node_t* tmp = prior_node;
 prior_node = local_node.release();
 local_node.reset(tmp);
 

 ~clh_lock() 
 delete tail;
 

private:

 std::atomic<node_t*> tail;

 static thread_local std::unique_ptr<node_t> local_node;
 static thread_local node_t* prior_node;

;

Here is the implementation file:

namespace sync 

thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;

Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:

//thread thrashing tests
clocks::stopwatch<> timer;
int samples100;
int nthreads25;
int ftor1000;
std::ofstream df"clh_lock.csv";
using lock_t = sync::mcs_lock;
lock_t l;

std::vector<int> N;
std::atomic_bool gofalse;

auto f = [&](int n)->void 
 while(!go);
 for(int i0; i < ftor; ++i) 
 l.lock();
 N.push_back(n);
 l.unlock();
 
;

for(int j2; j < nthreads; ++j) 
 df << j << ((j < nthreads - 1) ?"," :"n");


for(int n0; n < samples; ++n) 
 std::cout << n << " ";
 for(int j2; j < nthreads; ++j) 
 std::vector<std::thread> v;
 int thdsj;
 std::cout << thds << ",";
 N.clear();

 for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

 timer.start();
 go = true;
 for(auto& t : v) t.join();
 timer.stop();

 int s0;
 for(auto& n : N) s += n;
 int sum = ((thds * (thds + 1)) / 2) * ftor;
 assert(s == sum);

 df << w.elapsed() << (j < nthreads - 1 ?"," :"n");
 
 std::cout << "n";

std::cout << "n";

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

asked Jan 5 at 23:48

headwedge

287

add a commentÂ |Â

up vote
1
down vote

favorite

As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...

I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.

Or am I missing something?

The implementation uses two general purpose header files:

A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.

A (naive) cache line information file.

Here is the cpu_relax header file:

namespace sync {
inline static void cpu_relax() COMPILER == LLVM)
 asm volatile("pausen": : :"memory");
#endif

Here is the cache_info header file:

namespace sync 

#define CACHELINE_SIZE 64 
//almost uniformly 64 bytes on Intel and AMD processors at all cache levels
//it will do for now

Here is the header file:

namespace sync 

/**
 * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
 *
 * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
 * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165Ã¢Â€Â“ 171, April 1994. IEEE Computer Society, April 1994. Vancouver
 */
class clh_lock 

 // neat cacheline padding trick by David Geier geidav.wordpress.com
 // cacheline padding is essential to avoid false sharing
 using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

 static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

public:

 clh_lock() : tail(new node_t) 

 void lock() 
 local_node->first = true;
 prior_node = tail.exchange(local_node.get());
 while (prior_node->first) 
 cpu_relax();
 
 

 void unlock() 
 local_node->first = false;
 node_t* tmp = prior_node;
 prior_node = local_node.release();
 local_node.reset(tmp);
 

 ~clh_lock() 
 delete tail;
 

private:

 std::atomic<node_t*> tail;

 static thread_local std::unique_ptr<node_t> local_node;
 static thread_local node_t* prior_node;

;

Here is the implementation file:

namespace sync 

thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;

Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:

//thread thrashing tests
clocks::stopwatch<> timer;
int samples100;
int nthreads25;
int ftor1000;
std::ofstream df"clh_lock.csv";
using lock_t = sync::mcs_lock;
lock_t l;

std::vector<int> N;
std::atomic_bool gofalse;

auto f = [&](int n)->void 
 while(!go);
 for(int i0; i < ftor; ++i) 
 l.lock();
 N.push_back(n);
 l.unlock();
 
;

for(int j2; j < nthreads; ++j) 
 df << j << ((j < nthreads - 1) ?"," :"n");


for(int n0; n < samples; ++n) 
 std::cout << n << " ";
 for(int j2; j < nthreads; ++j) 
 std::vector<std::thread> v;
 int thdsj;
 std::cout << thds << ",";
 N.clear();

 for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

 timer.start();
 go = true;
 for(auto& t : v) t.join();
 timer.stop();

 int s0;
 for(auto& n : N) s += n;
 int sum = ((thds * (thds + 1)) / 2) * ftor;
 assert(s == sum);

 df << w.elapsed() << (j < nthreads - 1 ?"," :"n");
 
 std::cout << "n";

std::cout << "n";

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

asked Jan 5 at 23:48

headwedge

287

As I understand it I must use a smart pointer, in this case std::unique_ptr, for my thread local node pointer or the lock will not deallocate properly and the clh_lock will leak. But...

I also have no choice but to use a raw atomic tail queue pointer in order access the atomic exchange member function.

Or am I missing something?

The implementation uses two general purpose header files:

A cross platform PAUSE function to avoid spin-wait empty loop undesirable effects such as branch mis-prediction, thread starvation on HT processors and overly high power consumption.

A (naive) cache line information file.

Here is the cpu_relax header file:

namespace sync {
inline static void cpu_relax() COMPILER == LLVM)
 asm volatile("pausen": : :"memory");
#endif

Here is the cache_info header file:

namespace sync 

#define CACHELINE_SIZE 64 
//almost uniformly 64 bytes on Intel and AMD processors at all cache levels
//it will do for now

Here is the header file:

namespace sync 

/**
 * @brief The Craig-Landin-Hagersten (CLH) Lock is due to Travis Craig, Erik Hagersten, and Anders Landin[2,3] and like the Anderson Lock it uses the queue lock strategy, however, unlike the bounded queue of Anderson it uses an unbounded and, interestingly, implicit linked list of synchronisation variables. Being queue based it also offer the guarantees of first-come-first-served fairness and no thread starvation.
 *
 * [2] T. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical Report TR 93-02-02, University of Washington, Department of Computer Science, February 1993.
 * [3] P. Magnussen, A. Landin, and E. Hagersten. Queue locks on cache coherent multiprocessors. In Proc. of the Eighth International Symposium on Parallel Processing (IPPS), pp. 165Ã¢Â€Â“ 171, April 1994. IEEE Computer Society, April 1994. Vancouver
 */
class clh_lock 

 // neat cacheline padding trick by David Geier geidav.wordpress.com
 // cacheline padding is essential to avoid false sharing
 using node_t = std::pair<bool, uint8_t[CACHELINE_SIZE - sizeof(bool)]>;

 static_assert(sizeof(node_t) == CACHELINE_SIZE, "");

public:

 clh_lock() : tail(new node_t) 

 void lock() 
 local_node->first = true;
 prior_node = tail.exchange(local_node.get());
 while (prior_node->first) 
 cpu_relax();
 
 

 void unlock() 
 local_node->first = false;
 node_t* tmp = prior_node;
 prior_node = local_node.release();
 local_node.reset(tmp);
 

 ~clh_lock() 
 delete tail;
 

private:

 std::atomic<node_t*> tail;

 static thread_local std::unique_ptr<node_t> local_node;
 static thread_local node_t* prior_node;

;

Here is the implementation file:

namespace sync 

thread_local std::unique_ptr<clh_lock::node_t> clh_lock::local_node(new clh_lock::node_t);
thread_local clh_lock::node_t* clh_lock::prior_node = nullptr;

Here is the (part) verification and timing code that performs a number of thread thrashing timing runs and saves the data out as a CSV file for R studio:

//thread thrashing tests
clocks::stopwatch<> timer;
int samples100;
int nthreads25;
int ftor1000;
std::ofstream df"clh_lock.csv";
using lock_t = sync::mcs_lock;
lock_t l;

std::vector<int> N;
std::atomic_bool gofalse;

auto f = [&](int n)->void 
 while(!go);
 for(int i0; i < ftor; ++i) 
 l.lock();
 N.push_back(n);
 l.unlock();
 
;

for(int j2; j < nthreads; ++j) 
 df << j << ((j < nthreads - 1) ?"," :"n");


for(int n0; n < samples; ++n) 
 std::cout << n << " ";
 for(int j2; j < nthreads; ++j) 
 std::vector<std::thread> v;
 int thdsj;
 std::cout << thds << ",";
 N.clear();

 for(int i0; i < thds; ++i) v.emplace_back(f, i + 1);

 timer.start();
 go = true;
 for(auto& t : v) t.join();
 timer.stop();

 int s0;
 for(auto& n : N) s += n;
 int sum = ((thds * (thds + 1)) / 2) * ftor;
 assert(s == sum);

 df << w.elapsed() << (j < nthreads - 1 ?"," :"n");
 
 std::cout << "n";

std::cout << "n";

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

asked Jan 5 at 23:48

headwedge

287

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

edited Jan 5 at 23:53

Jamalâ™¦

30.1k11114225

asked Jan 5 at 23:48

headwedge

287

asked Jan 5 at 23:48

headwedge

287

asked Jan 5 at 23:48

headwedge

287

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
);
);
, "mathjax-editing");

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f184407%2fc11-clh-lock-implementation%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

trjhtr