Implementing and benchmarking Radix Sort on unsigned integers

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
8
down vote

favorite












Introduction



I implemented a restricted version of the sorting algorithm, the one that works only with unsigned integers. I wanted to try using std::stable_partition with bitmask, as I chose base 2. The reason partition works here is because there are only two "bins", thus it is somewhat like partitioning. I mandated customizing the maximum power of two that is encountered in the sequence.




Code



#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;






Weird results



The algorithm has a lottle of wiggle regarding the constant factor, though it is linear. The picture below is for highest power of 16, x-axis is element count, and y-axis is nanoseconds (sorry I'm not on good terms with gnuplot). So it takes ~1.6 milliseconds to sort 30'000 elements, which is I think quite a lot for linear sorting algorithm.



Performance of radix sort



I believe those spikes are context switches, but I couldn't verify that. It seems like there is no way to reliably track them and correspond to the spike.




Full code



#include <vector>
#include <algorithm>
#include <random>

std::vector<unsigned> generate_vector(std::size_t size, unsigned power_limit)

auto generator = [power_limit]

static std::mt19937 twister;
static std::uniform_int_distribution<unsigned > dist0u, (1u << power_limit);
return dist(twister);
;

std::vector<unsigned> v(size);
std::generate(v.begin(), v.end(), generator);

return v;


#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;




#include <chrono>
#include <atomic>

namespace shino

template <typename Clock = std::chrono::high_resolution_clock>
class stopwatch

const typename Clock::time_point start_point;
public:
stopwatch() :
start_point(Clock::now())


template <typename Rep = typename Clock::duration::rep, typename Units = typename Clock::duration>
Rep elapsed_time() const

std::atomic_thread_fence(std::memory_order_relaxed);
auto counted_time = std::chrono::duration_cast<Units>(Clock::now() - start_point).count();
std::atomic_thread_fence(std::memory_order_relaxed);
return static_cast<Rep>(counted_time);

;

using precise_stopwatch = stopwatch<>;
using system_stopwatch = stopwatch<std::chrono::system_clock>;
using monotonic_stopwatch = stopwatch<std::chrono::steady_clock>;


#include <stdexcept>
#include <fstream>

int main()

std::ofstream measurings("measurings.csv");
if (!measurings)
throw std::runtime_error("measurings file opening failed");

for (unsigned power = 4; power <= 16; ++power)

for (auto size = 0ull; size <= 20'000; ++size)

std::cout << "radix sorting vector of size " << size << 'n';
auto input = generate_vector(size, power);
shino::precise_stopwatch stopwatch;
shino::radix_sort(input.begin(), input.end(), power);
auto elapsed_time = stopwatch.elapsed_time();
measurings << power << ',' << size << ',' << elapsed_time << 'n';
if (not std::is_sorted(input.begin(), input.end()))
throw std::logic_error("radix sort doesn't work correctly on size " + std::to_string(size));





The code records performance and checks if sort actually worked correctly. So if you make any changes, just run the code to test it. stopwatch code is taken from here. In fact, I believe it deserves a review on its own.



I couldn't get gnuplot to plot 3d graphs, sorry no pics for the changes related to maximum power.




Concerns



  • Is algorithm implemented well?

Are there points to improve performance? Make a better use of standard library?



  • Will the inner lambda cause code bloat?

Lambdas are anonymous classes, so will be instantiated for every version of radix sort itself. I'm unsure if using lambda was a wise choice.



  • Anything else that comes to mind






share|improve this question

















  • 1




    One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
    – 1201ProgramAlarm
    Jun 23 at 1:42










  • @1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
    – Incomputable
    Jun 23 at 7:36










  • Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
    – JDługosz
    Jun 23 at 9:23










  • @JDługosz, I believe it does. I’m not sure though.
    – Incomputable
    Jun 23 at 10:34










  • Could I get a copy of your cvs file please? :)
    – Rakete1111
    Jun 26 at 15:55
















up vote
8
down vote

favorite












Introduction



I implemented a restricted version of the sorting algorithm, the one that works only with unsigned integers. I wanted to try using std::stable_partition with bitmask, as I chose base 2. The reason partition works here is because there are only two "bins", thus it is somewhat like partitioning. I mandated customizing the maximum power of two that is encountered in the sequence.




Code



#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;






Weird results



The algorithm has a lottle of wiggle regarding the constant factor, though it is linear. The picture below is for highest power of 16, x-axis is element count, and y-axis is nanoseconds (sorry I'm not on good terms with gnuplot). So it takes ~1.6 milliseconds to sort 30'000 elements, which is I think quite a lot for linear sorting algorithm.



Performance of radix sort



I believe those spikes are context switches, but I couldn't verify that. It seems like there is no way to reliably track them and correspond to the spike.




Full code



#include <vector>
#include <algorithm>
#include <random>

std::vector<unsigned> generate_vector(std::size_t size, unsigned power_limit)

auto generator = [power_limit]

static std::mt19937 twister;
static std::uniform_int_distribution<unsigned > dist0u, (1u << power_limit);
return dist(twister);
;

std::vector<unsigned> v(size);
std::generate(v.begin(), v.end(), generator);

return v;


#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;




#include <chrono>
#include <atomic>

namespace shino

template <typename Clock = std::chrono::high_resolution_clock>
class stopwatch

const typename Clock::time_point start_point;
public:
stopwatch() :
start_point(Clock::now())


template <typename Rep = typename Clock::duration::rep, typename Units = typename Clock::duration>
Rep elapsed_time() const

std::atomic_thread_fence(std::memory_order_relaxed);
auto counted_time = std::chrono::duration_cast<Units>(Clock::now() - start_point).count();
std::atomic_thread_fence(std::memory_order_relaxed);
return static_cast<Rep>(counted_time);

;

using precise_stopwatch = stopwatch<>;
using system_stopwatch = stopwatch<std::chrono::system_clock>;
using monotonic_stopwatch = stopwatch<std::chrono::steady_clock>;


#include <stdexcept>
#include <fstream>

int main()

std::ofstream measurings("measurings.csv");
if (!measurings)
throw std::runtime_error("measurings file opening failed");

for (unsigned power = 4; power <= 16; ++power)

for (auto size = 0ull; size <= 20'000; ++size)

std::cout << "radix sorting vector of size " << size << 'n';
auto input = generate_vector(size, power);
shino::precise_stopwatch stopwatch;
shino::radix_sort(input.begin(), input.end(), power);
auto elapsed_time = stopwatch.elapsed_time();
measurings << power << ',' << size << ',' << elapsed_time << 'n';
if (not std::is_sorted(input.begin(), input.end()))
throw std::logic_error("radix sort doesn't work correctly on size " + std::to_string(size));





The code records performance and checks if sort actually worked correctly. So if you make any changes, just run the code to test it. stopwatch code is taken from here. In fact, I believe it deserves a review on its own.



I couldn't get gnuplot to plot 3d graphs, sorry no pics for the changes related to maximum power.




Concerns



  • Is algorithm implemented well?

Are there points to improve performance? Make a better use of standard library?



  • Will the inner lambda cause code bloat?

Lambdas are anonymous classes, so will be instantiated for every version of radix sort itself. I'm unsure if using lambda was a wise choice.



  • Anything else that comes to mind






share|improve this question

















  • 1




    One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
    – 1201ProgramAlarm
    Jun 23 at 1:42










  • @1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
    – Incomputable
    Jun 23 at 7:36










  • Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
    – JDługosz
    Jun 23 at 9:23










  • @JDługosz, I believe it does. I’m not sure though.
    – Incomputable
    Jun 23 at 10:34










  • Could I get a copy of your cvs file please? :)
    – Rakete1111
    Jun 26 at 15:55












up vote
8
down vote

favorite









up vote
8
down vote

favorite











Introduction



I implemented a restricted version of the sorting algorithm, the one that works only with unsigned integers. I wanted to try using std::stable_partition with bitmask, as I chose base 2. The reason partition works here is because there are only two "bins", thus it is somewhat like partitioning. I mandated customizing the maximum power of two that is encountered in the sequence.




Code



#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;






Weird results



The algorithm has a lottle of wiggle regarding the constant factor, though it is linear. The picture below is for highest power of 16, x-axis is element count, and y-axis is nanoseconds (sorry I'm not on good terms with gnuplot). So it takes ~1.6 milliseconds to sort 30'000 elements, which is I think quite a lot for linear sorting algorithm.



Performance of radix sort



I believe those spikes are context switches, but I couldn't verify that. It seems like there is no way to reliably track them and correspond to the spike.




Full code



#include <vector>
#include <algorithm>
#include <random>

std::vector<unsigned> generate_vector(std::size_t size, unsigned power_limit)

auto generator = [power_limit]

static std::mt19937 twister;
static std::uniform_int_distribution<unsigned > dist0u, (1u << power_limit);
return dist(twister);
;

std::vector<unsigned> v(size);
std::generate(v.begin(), v.end(), generator);

return v;


#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;




#include <chrono>
#include <atomic>

namespace shino

template <typename Clock = std::chrono::high_resolution_clock>
class stopwatch

const typename Clock::time_point start_point;
public:
stopwatch() :
start_point(Clock::now())


template <typename Rep = typename Clock::duration::rep, typename Units = typename Clock::duration>
Rep elapsed_time() const

std::atomic_thread_fence(std::memory_order_relaxed);
auto counted_time = std::chrono::duration_cast<Units>(Clock::now() - start_point).count();
std::atomic_thread_fence(std::memory_order_relaxed);
return static_cast<Rep>(counted_time);

;

using precise_stopwatch = stopwatch<>;
using system_stopwatch = stopwatch<std::chrono::system_clock>;
using monotonic_stopwatch = stopwatch<std::chrono::steady_clock>;


#include <stdexcept>
#include <fstream>

int main()

std::ofstream measurings("measurings.csv");
if (!measurings)
throw std::runtime_error("measurings file opening failed");

for (unsigned power = 4; power <= 16; ++power)

for (auto size = 0ull; size <= 20'000; ++size)

std::cout << "radix sorting vector of size " << size << 'n';
auto input = generate_vector(size, power);
shino::precise_stopwatch stopwatch;
shino::radix_sort(input.begin(), input.end(), power);
auto elapsed_time = stopwatch.elapsed_time();
measurings << power << ',' << size << ',' << elapsed_time << 'n';
if (not std::is_sorted(input.begin(), input.end()))
throw std::logic_error("radix sort doesn't work correctly on size " + std::to_string(size));





The code records performance and checks if sort actually worked correctly. So if you make any changes, just run the code to test it. stopwatch code is taken from here. In fact, I believe it deserves a review on its own.



I couldn't get gnuplot to plot 3d graphs, sorry no pics for the changes related to maximum power.




Concerns



  • Is algorithm implemented well?

Are there points to improve performance? Make a better use of standard library?



  • Will the inner lambda cause code bloat?

Lambdas are anonymous classes, so will be instantiated for every version of radix sort itself. I'm unsure if using lambda was a wise choice.



  • Anything else that comes to mind






share|improve this question













Introduction



I implemented a restricted version of the sorting algorithm, the one that works only with unsigned integers. I wanted to try using std::stable_partition with bitmask, as I chose base 2. The reason partition works here is because there are only two "bins", thus it is somewhat like partitioning. I mandated customizing the maximum power of two that is encountered in the sequence.




Code



#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;






Weird results



The algorithm has a lottle of wiggle regarding the constant factor, though it is linear. The picture below is for highest power of 16, x-axis is element count, and y-axis is nanoseconds (sorry I'm not on good terms with gnuplot). So it takes ~1.6 milliseconds to sort 30'000 elements, which is I think quite a lot for linear sorting algorithm.



Performance of radix sort



I believe those spikes are context switches, but I couldn't verify that. It seems like there is no way to reliably track them and correspond to the spike.




Full code



#include <vector>
#include <algorithm>
#include <random>

std::vector<unsigned> generate_vector(std::size_t size, unsigned power_limit)

auto generator = [power_limit]

static std::mt19937 twister;
static std::uniform_int_distribution<unsigned > dist0u, (1u << power_limit);
return dist(twister);
;

std::vector<unsigned> v(size);
std::generate(v.begin(), v.end(), generator);

return v;


#include <iterator>
#include <algorithm>

namespace shino

template <typename BidirIterator>
void radix_sort(BidirIterator first, BidirIterator last, unsigned upper_power_bound)

using category = typename std::iterator_traits<BidirIterator>::iterator_category;
using value_type = typename std::iterator_traits<BidirIterator>::value_type;
static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");
value_type mask = 1;
for (auto power = 0u; power <= upper_power_bound; ++power)

auto predicate = [mask](const auto& value)

return !(value & mask);
;
std::stable_partition(first, last, predicate);
mask <<= 1;




#include <chrono>
#include <atomic>

namespace shino

template <typename Clock = std::chrono::high_resolution_clock>
class stopwatch

const typename Clock::time_point start_point;
public:
stopwatch() :
start_point(Clock::now())


template <typename Rep = typename Clock::duration::rep, typename Units = typename Clock::duration>
Rep elapsed_time() const

std::atomic_thread_fence(std::memory_order_relaxed);
auto counted_time = std::chrono::duration_cast<Units>(Clock::now() - start_point).count();
std::atomic_thread_fence(std::memory_order_relaxed);
return static_cast<Rep>(counted_time);

;

using precise_stopwatch = stopwatch<>;
using system_stopwatch = stopwatch<std::chrono::system_clock>;
using monotonic_stopwatch = stopwatch<std::chrono::steady_clock>;


#include <stdexcept>
#include <fstream>

int main()

std::ofstream measurings("measurings.csv");
if (!measurings)
throw std::runtime_error("measurings file opening failed");

for (unsigned power = 4; power <= 16; ++power)

for (auto size = 0ull; size <= 20'000; ++size)

std::cout << "radix sorting vector of size " << size << 'n';
auto input = generate_vector(size, power);
shino::precise_stopwatch stopwatch;
shino::radix_sort(input.begin(), input.end(), power);
auto elapsed_time = stopwatch.elapsed_time();
measurings << power << ',' << size << ',' << elapsed_time << 'n';
if (not std::is_sorted(input.begin(), input.end()))
throw std::logic_error("radix sort doesn't work correctly on size " + std::to_string(size));





The code records performance and checks if sort actually worked correctly. So if you make any changes, just run the code to test it. stopwatch code is taken from here. In fact, I believe it deserves a review on its own.



I couldn't get gnuplot to plot 3d graphs, sorry no pics for the changes related to maximum power.




Concerns



  • Is algorithm implemented well?

Are there points to improve performance? Make a better use of standard library?



  • Will the inner lambda cause code bloat?

Lambdas are anonymous classes, so will be instantiated for every version of radix sort itself. I'm unsure if using lambda was a wise choice.



  • Anything else that comes to mind








share|improve this question












share|improve this question




share|improve this question








edited Jun 23 at 7:35
























asked Jun 22 at 22:53









Incomputable

5,99721144




5,99721144







  • 1




    One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
    – 1201ProgramAlarm
    Jun 23 at 1:42










  • @1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
    – Incomputable
    Jun 23 at 7:36










  • Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
    – JDługosz
    Jun 23 at 9:23










  • @JDługosz, I believe it does. I’m not sure though.
    – Incomputable
    Jun 23 at 10:34










  • Could I get a copy of your cvs file please? :)
    – Rakete1111
    Jun 26 at 15:55












  • 1




    One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
    – 1201ProgramAlarm
    Jun 23 at 1:42










  • @1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
    – Incomputable
    Jun 23 at 7:36










  • Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
    – JDługosz
    Jun 23 at 9:23










  • @JDługosz, I believe it does. I’m not sure though.
    – Incomputable
    Jun 23 at 10:34










  • Could I get a copy of your cvs file please? :)
    – Rakete1111
    Jun 26 at 15:55







1




1




One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
– 1201ProgramAlarm
Jun 23 at 1:42




One issue is when you calculate the elapsed time. This should be done right after the sorting is done, before you start doing any output. That output will have variable amounts of additional time, depending on the internals of ofstream and in particular when it needs to stop and flush the data buffer to disk.
– 1201ProgramAlarm
Jun 23 at 1:42












@1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
– Incomputable
Jun 23 at 7:36




@1201ProgramAlarm, thanks, fixed. I guess I shouldn't write algorithms after 20 hours of no sleep...
– Incomputable
Jun 23 at 7:36












Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
– JDługosz
Jun 23 at 9:23




Does the stopwatch class take efforts to prevent code movement during optimization and re-ordering during execution? It could be inaccurate on the scale of a few hundred instructions.
– JDługosz
Jun 23 at 9:23












@JDługosz, I believe it does. I’m not sure though.
– Incomputable
Jun 23 at 10:34




@JDługosz, I believe it does. I’m not sure though.
– Incomputable
Jun 23 at 10:34












Could I get a copy of your cvs file please? :)
– Rakete1111
Jun 26 at 15:55




Could I get a copy of your cvs file please? :)
– Rakete1111
Jun 26 at 15:55










1 Answer
1






active

oldest

votes

















up vote
5
down vote



accepted










I'll focus just on the algorithm (not the timing code), and the questions.



static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
"At least bidirectional iterators are required");
static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
"Only unsigned integers are supported for radix sort");


It seems a bit superfluous to say "static_assert(is_foo, "must be foo")". The assertion predicates are clear enough on their own to not require additional explanation. Also, I wouldn't recommend combining requirements. It makes it harder to diagnose exactly which assertion failed:



static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>);
static_assert(std::is_integral_v<value_type>);
static_assert(std::is_unsigned_v<value_type>);


Down to the lambda:



auto predicate = [mask](const auto& value)

return !(value & mask);
;


Unless you're planning to support non-fundamental integer types, taking the argument by const& is probably an unnecessary pessimization. Even if it's not a built-in type, if the binary & operator and conversion to bool are implemented sanely, the lambda's frame can be basically optimized away. It's nice to be able to do everything in-line:



std::stable_partition(first, last, [mask](auto v) return !(v & mask); );


As for the questions:



Is it implemented well? I'd say so. I can't think of many improvements/extensions I'd make. Maybe a check to make sure upper_power_bound is less than the number of bits in value_type so you're not just spinning your wheels... but that could be tricky to pull off generally if you're supporting non-builtin integer types (maybe check to see if mask is zero, assuming the 1 bit drops off the end into the ether if you shift it too far?). Maybe a range version, but that's a pretty trivial addition once the iterator version's working.



Will the inner lambda cause code bloat? No. On any compiler worth mentioning and on any optimization level worth caring about, it will be inlined - likely to a single instruction, maybe two (assuming a built-in unsigned type). Modern compilers are terrifyingly good at optimizing lambdas (or anything where all the code is right there in front of them, really).



Even if it weren't inlined to effectively nothing, the lambda would be dwarfed by std::stable_partition()'s code. Like by two or three orders of magnitude, easy. It's totally not something to worry about.






share|improve this answer





















    Your Answer




    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "196"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: false,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );








     

    draft saved


    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f197087%2fimplementing-and-benchmarking-radix-sort-on-unsigned-integers%23new-answer', 'question_page');

    );

    Post as a guest






























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    5
    down vote



    accepted










    I'll focus just on the algorithm (not the timing code), and the questions.



    static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
    "At least bidirectional iterators are required");
    static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
    "Only unsigned integers are supported for radix sort");


    It seems a bit superfluous to say "static_assert(is_foo, "must be foo")". The assertion predicates are clear enough on their own to not require additional explanation. Also, I wouldn't recommend combining requirements. It makes it harder to diagnose exactly which assertion failed:



    static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>);
    static_assert(std::is_integral_v<value_type>);
    static_assert(std::is_unsigned_v<value_type>);


    Down to the lambda:



    auto predicate = [mask](const auto& value)

    return !(value & mask);
    ;


    Unless you're planning to support non-fundamental integer types, taking the argument by const& is probably an unnecessary pessimization. Even if it's not a built-in type, if the binary & operator and conversion to bool are implemented sanely, the lambda's frame can be basically optimized away. It's nice to be able to do everything in-line:



    std::stable_partition(first, last, [mask](auto v) return !(v & mask); );


    As for the questions:



    Is it implemented well? I'd say so. I can't think of many improvements/extensions I'd make. Maybe a check to make sure upper_power_bound is less than the number of bits in value_type so you're not just spinning your wheels... but that could be tricky to pull off generally if you're supporting non-builtin integer types (maybe check to see if mask is zero, assuming the 1 bit drops off the end into the ether if you shift it too far?). Maybe a range version, but that's a pretty trivial addition once the iterator version's working.



    Will the inner lambda cause code bloat? No. On any compiler worth mentioning and on any optimization level worth caring about, it will be inlined - likely to a single instruction, maybe two (assuming a built-in unsigned type). Modern compilers are terrifyingly good at optimizing lambdas (or anything where all the code is right there in front of them, really).



    Even if it weren't inlined to effectively nothing, the lambda would be dwarfed by std::stable_partition()'s code. Like by two or three orders of magnitude, easy. It's totally not something to worry about.






    share|improve this answer

























      up vote
      5
      down vote



      accepted










      I'll focus just on the algorithm (not the timing code), and the questions.



      static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
      "At least bidirectional iterators are required");
      static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
      "Only unsigned integers are supported for radix sort");


      It seems a bit superfluous to say "static_assert(is_foo, "must be foo")". The assertion predicates are clear enough on their own to not require additional explanation. Also, I wouldn't recommend combining requirements. It makes it harder to diagnose exactly which assertion failed:



      static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>);
      static_assert(std::is_integral_v<value_type>);
      static_assert(std::is_unsigned_v<value_type>);


      Down to the lambda:



      auto predicate = [mask](const auto& value)

      return !(value & mask);
      ;


      Unless you're planning to support non-fundamental integer types, taking the argument by const& is probably an unnecessary pessimization. Even if it's not a built-in type, if the binary & operator and conversion to bool are implemented sanely, the lambda's frame can be basically optimized away. It's nice to be able to do everything in-line:



      std::stable_partition(first, last, [mask](auto v) return !(v & mask); );


      As for the questions:



      Is it implemented well? I'd say so. I can't think of many improvements/extensions I'd make. Maybe a check to make sure upper_power_bound is less than the number of bits in value_type so you're not just spinning your wheels... but that could be tricky to pull off generally if you're supporting non-builtin integer types (maybe check to see if mask is zero, assuming the 1 bit drops off the end into the ether if you shift it too far?). Maybe a range version, but that's a pretty trivial addition once the iterator version's working.



      Will the inner lambda cause code bloat? No. On any compiler worth mentioning and on any optimization level worth caring about, it will be inlined - likely to a single instruction, maybe two (assuming a built-in unsigned type). Modern compilers are terrifyingly good at optimizing lambdas (or anything where all the code is right there in front of them, really).



      Even if it weren't inlined to effectively nothing, the lambda would be dwarfed by std::stable_partition()'s code. Like by two or three orders of magnitude, easy. It's totally not something to worry about.






      share|improve this answer























        up vote
        5
        down vote



        accepted







        up vote
        5
        down vote



        accepted






        I'll focus just on the algorithm (not the timing code), and the questions.



        static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
        "At least bidirectional iterators are required");
        static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
        "Only unsigned integers are supported for radix sort");


        It seems a bit superfluous to say "static_assert(is_foo, "must be foo")". The assertion predicates are clear enough on their own to not require additional explanation. Also, I wouldn't recommend combining requirements. It makes it harder to diagnose exactly which assertion failed:



        static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>);
        static_assert(std::is_integral_v<value_type>);
        static_assert(std::is_unsigned_v<value_type>);


        Down to the lambda:



        auto predicate = [mask](const auto& value)

        return !(value & mask);
        ;


        Unless you're planning to support non-fundamental integer types, taking the argument by const& is probably an unnecessary pessimization. Even if it's not a built-in type, if the binary & operator and conversion to bool are implemented sanely, the lambda's frame can be basically optimized away. It's nice to be able to do everything in-line:



        std::stable_partition(first, last, [mask](auto v) return !(v & mask); );


        As for the questions:



        Is it implemented well? I'd say so. I can't think of many improvements/extensions I'd make. Maybe a check to make sure upper_power_bound is less than the number of bits in value_type so you're not just spinning your wheels... but that could be tricky to pull off generally if you're supporting non-builtin integer types (maybe check to see if mask is zero, assuming the 1 bit drops off the end into the ether if you shift it too far?). Maybe a range version, but that's a pretty trivial addition once the iterator version's working.



        Will the inner lambda cause code bloat? No. On any compiler worth mentioning and on any optimization level worth caring about, it will be inlined - likely to a single instruction, maybe two (assuming a built-in unsigned type). Modern compilers are terrifyingly good at optimizing lambdas (or anything where all the code is right there in front of them, really).



        Even if it weren't inlined to effectively nothing, the lambda would be dwarfed by std::stable_partition()'s code. Like by two or three orders of magnitude, easy. It's totally not something to worry about.






        share|improve this answer













        I'll focus just on the algorithm (not the timing code), and the questions.



        static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>,
        "At least bidirectional iterators are required");
        static_assert(std::is_integral_v<value_type> and std::is_unsigned_v<value_type>,
        "Only unsigned integers are supported for radix sort");


        It seems a bit superfluous to say "static_assert(is_foo, "must be foo")". The assertion predicates are clear enough on their own to not require additional explanation. Also, I wouldn't recommend combining requirements. It makes it harder to diagnose exactly which assertion failed:



        static_assert(std::is_base_of_v<std::bidirectional_iterator_tag, category>);
        static_assert(std::is_integral_v<value_type>);
        static_assert(std::is_unsigned_v<value_type>);


        Down to the lambda:



        auto predicate = [mask](const auto& value)

        return !(value & mask);
        ;


        Unless you're planning to support non-fundamental integer types, taking the argument by const& is probably an unnecessary pessimization. Even if it's not a built-in type, if the binary & operator and conversion to bool are implemented sanely, the lambda's frame can be basically optimized away. It's nice to be able to do everything in-line:



        std::stable_partition(first, last, [mask](auto v) return !(v & mask); );


        As for the questions:



        Is it implemented well? I'd say so. I can't think of many improvements/extensions I'd make. Maybe a check to make sure upper_power_bound is less than the number of bits in value_type so you're not just spinning your wheels... but that could be tricky to pull off generally if you're supporting non-builtin integer types (maybe check to see if mask is zero, assuming the 1 bit drops off the end into the ether if you shift it too far?). Maybe a range version, but that's a pretty trivial addition once the iterator version's working.



        Will the inner lambda cause code bloat? No. On any compiler worth mentioning and on any optimization level worth caring about, it will be inlined - likely to a single instruction, maybe two (assuming a built-in unsigned type). Modern compilers are terrifyingly good at optimizing lambdas (or anything where all the code is right there in front of them, really).



        Even if it weren't inlined to effectively nothing, the lambda would be dwarfed by std::stable_partition()'s code. Like by two or three orders of magnitude, easy. It's totally not something to worry about.







        share|improve this answer













        share|improve this answer



        share|improve this answer











        answered Jun 23 at 11:08









        indi

        3,021417




        3,021417






















             

            draft saved


            draft discarded


























             


            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f197087%2fimplementing-and-benchmarking-radix-sort-on-unsigned-integers%23new-answer', 'question_page');

            );

            Post as a guest













































































            Popular posts from this blog

            Greedy Best First Search implementation in Rust

            Function to Return a JSON Like Objects Using VBA Collections and Arrays

            C++11 CLH Lock Implementation