An overview of sorting algorithms and a pretty fast multithreaded Quicksort in C

The first examples are demonstrated in the simplified programming language kabas.online and are runnable programs. There is an interactive online tutorial. Alternatively, you can just view them as more formal pseudocode examples.

There is some textual explanation of the different sorting algorithms. For a better understanding it is very helpful to study pictorial representations of the algorithms located at Wikipedia or at other sites.

Test and initialization code

The test code calls the sorting routine with random integer values initialized arrays and measures the needed time. It’s the same code for all following sorting algorithms.

subr sort
  # sorting code
.
subr init
  for i = 0 to size data[] - 1
    data[i] = random 1000000
  .
.
subr out
  for i = 0 to size data[] - 1
    pr data[i]
  .
  pr ""
.
len = 500
while t# < 0.3
  size data[] len
  call init
  pr "Sorting " & len & " numbers ..."
  t# = time
  call sort
  t# = time - t#
  pr t# & "s"
  pr ""
  len = len * 10
.

Bubblesort - an easy entry to sorting

We have - say 10 - random numbers in an array. We want to sort it in ascending order. It works quite simple: go through the array comparing the adjacent items and swap them, if it is necessary. After this the largest number has been swapped to the right side. Repeat this process each time excluding the rightmost element, until there is only one item left.

subr sort
  for n = size data[] - 1 downto 1
    for i = 0 to n - 1
      if data[i] > data[i + 1]
        swap data[i] data[i + 1]
      .
    .
  .
.
Sorting 500 numbers ...
0.029s

Sorting 5000 numbers ...
2.455s

Selection sort - easy to understand

It works similar to Bubblesort, but is faster: search the maximum value and put it on the right side of the array. Repeat this process each time excluding the rightmost element, until there is only one item left.

subr sort
  for n = size data[] - 1 downto 1
    imax = n
    max = data[n]
    for i = 0 to n - 1
      if data[i] > max
        max = data[i]
        imax = i
      .
    .
    data[imax] = data[n]
    data[n] = max
  .
.
Sorting 500 numbers ...
0.015s

Sorting 5000 numbers ...
1.169s

Insertion sort - fast for small arrays

It is usually faster than Selection sort, but that also depends on the programming language and hardware. It works like this: go through the array and keep the left side sorted. Move the current element to the left until it is in the correct position.

subr sort
  for i = 1 to size data[] - 1
    h = data[i]
    j = i - 1
    while j >= 0 and h < data[j]
      data[j + 1] = data[j]
      j -= 1
    .
    data[j + 1] = h
  .
.
Sorting 500 numbers ...
0.016s

Sorting 5000 numbers ...
1.514s

Heapsort - fast for larger arrays

Heapsort is much faster than Insertion sort for larger arrays. It can be considered a much better variant of Selection sort: Each round, the maximum value is moved to the right side and the size of the array is reduced by one. The big improvement is that the maximum value is selected very quickly - this is achieved by keeping the left part in a heap structure.

In a heap, each node has two children smaller than the node. In an array where the node is at i, the left child is at 2i+1 and the right child at 2i+2. The first sorting step is: convert the array into a heap array. The maximum value in a heap is always at 0. If you move the maximum value into the sorted part of the array, the left side of the array must be converted to a heap again. This can be done quickly.

subr make_heap
  for i = 1 to n - 1
    if data[i] > data[(i - 1) / 2]
      j = i
      while data[j] > data[(j - 1) / 2]
        swap data[j] data[(j - 1) / 2]
        j = (j - 1) / 2
      .
    .
  .
.
subr sort
  n = size data[]
  call make_heap
  for i = n - 1 downto 1
    swap data[0] data[i]
    j = 0
    ind = 1
    while ind < i
      if ind + 1 < i and data[ind + 1] > data[ind]
        ind += 1
      .
      if data[j] < data[ind]
        swap data[j] data[ind]
      .
      j = ind
      ind = 2 * j + 1
    .
  .
.
Sorting 500 numbers ...
0.004s

Sorting 5000 numbers ...
0.042s

Sorting 50000 numbers ...
0.491s

Mergesort - fast but needs space

The speed of Mergesort is similar to Heapsort, but takes up much more space. This is how it works: imagine you have a stack of numbered cards to sort. The Mergesort method is: take two cards each and merge them into stacks of 2 cards. Then take two 2-card stacks and merge them into sorted 4-card stacks, and so on.

The merge subroutine merges two sorted subarrays into one sorted array. First the two subarrays are copied to a temporary location, then the temporary subarrays are scanned, with the lower element selected at each step and placed into the original array.

The sort subroutine turns many small arrays, starting with size one, into larger arrays by merging two of each into one until a single array is created.

subr merge
  mid = left + sz
  if mid > sz_data
    mid = sz_data
  .
  right = mid + sz
  if right > sz_data
    right = sz_data
  .
  for i = left to right - 1
    tmp[i] = data[i]
  .
  l = left
  r = mid
  for i = left to right - 1
    if r = right or l < mid and tmp[l] < tmp[r]
      data[i] = tmp[l]
      l += 1
    else
      data[i] = tmp[r]
      r += 1
    .
  .
.
subr sort
  sz_data = size data[]
  size tmp[] sz_data
  sz = 1
  while sz < sz_data
    left = 0
    while left < sz_data
      call merge
      left += sz + sz
    .
    sz += sz
  .
.
Sorting 500 numbers ...
0.003s

Sorting 5000 numbers ...
0.032s

Sorting 50000 numbers ...
0.370s

Quicksort - normally the fastest one

Quicksort is even quicker than Mergesort, and does not need much space. The way it works is not so complicated: imagine you have a stack of 200 numbered cards to sort, the Quicksort way is: split the cards by putting the cards from 1 to 100 to one stack and the others to a second stack. Then you have only to sort stacks with 100 cards, which is easier. For this you split each 100 card stack into two stacks with 50 cards and so on.

The partition subroutine splits the subarray using an element as a comparison element - this is called the pivot element - into two subarrays. Lower numbers go to the left side and higher numbers to the right side. The pivot element goes between them.

The sort subroutine calls the partition subroutine for all unsorted subarrays, these are listed in the stack to-do list. partition creates two subarrays. The larger subarray is added in the to-do list and sorting is continued with the smaller subarray. If the to-do list is empty, the job is done and the whole array is sorted.

subr partition
  swap data[(left + right) / 2] data[left]
  piv = data[left]
  mid = left
  for i = left + 1 to right
    if data[i] < piv
      mid += 1
      swap data[i] data[mid]
    .
  .
  swap data[left] data[mid]
.
subr sort
  size stack[] 64
  # this is sufficient to sort 4 billion numbers
  left = 0
  right = size data[] - 1
  while left <> -1
    if left < right
      call partition
      if mid < (right + left) / 2
        stack[sp] = mid + 1
        stack[sp + 1] = right
        right = mid - 1
      else
        stack[sp] = left
        stack[sp + 1] = mid - 1
        left = mid + 1
      .
      sp += 2
    elif sp > 0
      sp -= 2
      left = stack[sp]
      right = stack[sp + 1]
    else
      # finished
      left = -1
    .
  .
.
Sorting 500 numbers ...
0.001s

Sorting 5000 numbers ...
0.016s

Sorting 50000 numbers ...
0.199s

Sorting 500000 numbers ...
2.331s

Quicksort combined with Insertion sort - even faster

Quicksort is very fast - but for small arrays Insertion sort is faster. So lets combine these two sorting algorithms. Subarrays generated by the partiton subroutine which are smaller than 15 are sorted by Insertion sort. In addition, improve the selection of the pivot element by taking it randomly.

subr insert_sort
  for i = left + 1 to right
    h = data[i]
    j = i - 1
    while j >= left and h < data[j]
      data[j + 1] = data[j]
      j -= 1
    .
    data[j + 1] = h
  .
.
subr partition
  swap data[left + random (right - left)] data[left]
  piv = data[left]
  mid = left
  for i = left + 1 to right
    if data[i] < piv
      mid += 1
      swap data[i] data[mid]
    .
  .
  swap data[left] data[mid]
.
subr sort
  size stack[] 64
  left = 0
  right = size data[] - 1
  while left <> -1
    if right - left < 15
      call insert_sort
      if sp > 0
        sp -= 2
        left = stack[sp]
        right = stack[sp + 1]
      else
        left = -1
      .
    else
      call partition
      if mid < (right + left) / 2
        stack[sp] = mid + 1
        stack[sp + 1] = right
        right = mid - 1
      else
        stack[sp] = left
        stack[sp + 1] = mid - 1
        left = mid + 1
      .
      sp += 2
    .
  .
.
Sorting 500 numbers ...
0.001s

Sorting 5000 numbers ...
0.015s

Sorting 50000 numbers ...
0.192s

Sorting 500000 numbers ...
2.076s

Quicksort recursive - watch the stack

The following Quicksort variant is recursive. The code looks clean. But there is one dangerous thing: the stack and its limited size. The stack size is different on different operating systems. With embedded systems, it can be very small.

With the iterative variant and reserved space for the unfinished tasks, we were able to ensure that we could sort an array with a certain size. This is not possible with the recursive variant - with sorted lists and “first element is pivot” there is probably a stack overflow. With the iterative variant we get in this case “only” the worst-case time complexity.

func qsort left right . data[] .
  # 
  if left < right
    # partition 
    piv = data[left]
    mid = left
    for i = left + 1 to right
      if data[i] < piv
        mid += 1
        swap data[i] data[mid]
      .
    .
    swap data[left] data[mid]
    # 
    call qsort left mid - 1 data[]
    call qsort mid + 1 right data[]
  .
.
subr sort
  call qsort 0 size data[] - 1 data[]
.
Sorting 500 numbers ...
0.001s

Sorting 5000 numbers ...
0.013s

Sorting 50000 numbers ...
0.166s

Sorting 500000 numbers ...
1.982s

Comparison

We see that for Bubblesort, Selection sort and Insertion sort the needed time increases by quadratic order - 10 times more items leads to about 100 times the time. That makes these sorting algorithms unusable for large arrays.

With Heapsort and Mergesort the time increases only by n log(n). Heapsort and Mergesort need about the same time. Heapsort needs less memory, Mergesort can be parallelized better.

Quicksort has on average the same behaviour in time and is usually faster. It is even faster in the combination with Insertion sort. When selecting the pivot element randomly, the runtime worst-case - quadratic-order time growth - is very unlikely.


Quicksort in C

The following programs are compiled in Linux with the option -O3 and tested on an older laptop with an Intel i3 CPU.

Use the sorting function of the C library

The C library function qsort can sort any data with fixed sized and comparable elements. The caller only has to support a compare function. Although the name qsort stands for Quicksort, it can also be another sorting algorithm behind.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

int cmpfunc (const void* a, const void* b) {
    return *(int*)a - *(int*)b;
}

void sort(int* data, int len) {
    qsort(data, len, sizeof(int), cmpfunc);
}

void init(int* data, int len) {
    int i = 0;
    while (i < len) {
        data[i] = rand();
        i += 1;
    }
}

double t(void) {
    static double t0;
    struct timeval tv;
    gettimeofday(&tv, NULL);
    double h = t0;
    t0 = tv.tv_sec + tv.tv_usec / 1000000.0;
    return t0 - h;
}

void test(int* data, int len) {
    int i = 1;
    while (i < len) {
        if (data[i] < data[i - 1]) {
            printf("*** data not sorted ***\n");
            break;
        }
        i += 1;
    }
}

#define SIZE (50 * 1000 * 1000)
int data[SIZE];

int main(void) {
    int len = SIZE;
    init(data, len);
    printf("Sorting %d numbers with libc qsort ...\n", len);
    t();
    sort(data, len);
    printf("%.2f s\n", t());
    test(data, len);
    return 0;
}
Sorting 50000000 numbers with libc qsort ...
18.15 s

Use std::sort in C++

.
#include <algorithm>

void sort(int* data, int len) {
    std::sort(data, data + len);
}
.
Sorting 50000000 numbers with C++ std::sort ...
10.12 s

That’s much faster. With c++ sort the compiler can inline the comparisons being made.

Programmed Quicksort

.
void insert_sort(int* left, int* right) {

    // put minimum to left position, so we can save
    // one inner loop comparsion for insert sort
    int min = *left;
    int* pmin = left;
    int* pi = left + 1;
    while (pi <= right) {
        if (*pi < min) {
            pmin = pi;
            min = *pi;
        }
        pi += 1;
    }
    *pmin = *left;
    *left = min;

    pi = left + 2;
    while (pi <= right) {
        int h = *pi;
        int* pj = pi - 1;
        while (h < *pj) {
            *(pj + 1) = *pj;
            pj -= 1;
        }
        *(pj + 1) = h;
        pi += 1;
    }
}

#define swap(a, b) { int _h = a; a = b; b = _h; }

#define sort3fast(a, b, c)                             \
    if (b < a) {                                       \
        if (c < a) {                                   \
            if (c < b) { swap(a, c);}                  \
            else { int h = a; a = b; b = c; c = h;}    \
        }                                              \
        else { swap(a, b); }                           \
    }                                                  \
    else {                                             \
        if (c < b) {                                   \
            if (c < a) { int h = c; c = b; b = a; a = h;} \
            else { swap(b, c); }                       \
        }                                              \
    }                                                  \

int* partition(int* left, int* right) {

    int* left0 = left;
    int pivl = *left;
    left += 1;
    int* p = left0 + (right - left0) / 2;    
    int piv = *p;
    *p = *left;
    *left = piv;
    int pivr = *right;

    sort3fast(pivl, piv, pivr);

    *right = pivr;
    *left0 = pivl;
    *left = piv;

    while (1) {
        do left += 1; while(*left < piv);
        do right -= 1; while (*right > piv);
        if (left >= right) break;
        swap(*left, *right);
    }
    *(left0 + 1) = *right;
    *right = piv;
    return right;
}
void sort(int* data, int len) {

    int* stack[64];
    int sp = 0;
    int* left = data;
    int* right = data + len - 1;

    while (1) {
        // for arrays smaller than 50 use insert sort
        if (right - left < 50) {
            insert_sort(left, right);
            if (sp == 0) break;
            sp -= 2;
            left = stack[sp];
            right = stack[sp + 1];
        }
        else {
            int* mid = partition(left, right);
            if (mid < left + (right - left) / 2) {
                stack[sp] = mid + 1;
                stack[sp + 1] = right;
                right = mid - 1;
            }
            else {
                stack[sp] = left;
                stack[sp + 1] = mid - 1;
                left = mid + 1;
            }
            sp += 2;
        }
    }
}
.
Sorting 50000000 numbers with Quicksort ...
9.18 s

That’s a bit better than the C++ std::sort.

We use the median of three fixed elements for the pivot element. This prevents the O(N²) worst case from occurring with random or sorted data, but not with specially prepared input data. In this case, a good random number generator should be used for the selection of the pivot element.

Multithreaded Quicksort

An effective but too seldom used way to speed up programs is to use more than one core of the processor. This is not possible for every algorithm. Quicksort is a Divide and conquer algorithm. This kind of algorithms is usually well fitted for parallelisation. After the partition function, we have two independent arrays to sort, which could be done in parallel. Starting a thread needs time and anyway there are only a few processor cores. So we only start a new thread when the array to sort is big enough and when there are not too many threads running. When the last thread has finished its work, all the work is done.

.
#include <pthread.h>
// cc -O3 -lpthread
#define MAX_THREADS 8
int n_threads;

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

void* sort_thr(void *arg) {

    int** stack = (int**)arg;
    int sp = 0;
    int* left = stack[sp];
    int* right = stack[sp + 1];

    while (1) {
        if (right - left < 50) {
            insert_sort(left, right);
            if (sp == 0) break;
            sp -= 2;
            left = stack[sp];
            right = stack[sp + 1];
        }
        else {
            int* l;
            int* r;
            int* mid = partition(left, right);
            if (mid < left + (right - left) / 2) {
                l = mid + 1;
                r = right;
                right = mid - 1;
            }
            else {
                l = left;
                r = mid - 1;
                left = mid + 1;
            }
            if (r - l > 1000000 && n_threads < MAX_THREADS) {
                // start a new thread - n_threads is a soft limit
                pthread_t thread;
                int** stack_new = malloc(64 * sizeof(int*));
                stack_new[0] = l;
                stack_new[1] = r;
                pthread_mutex_lock(&mutex);
                n_threads += 1;
                pthread_mutex_unlock(&mutex);
                pthread_create(&thread, NULL, sort_thr, stack_new);
            }
            else {
                stack[sp] = l;
                stack[sp + 1] = r;
                sp += 2;
            }
        }
    }
    free(stack);
    pthread_mutex_lock(&mutex);
    n_threads -= 1;
    if (n_threads == 0) pthread_cond_signal(&cond);
    pthread_mutex_unlock(&mutex);

    return NULL;
}

void sort(int* data, int len) {

    pthread_t thread;
    int** stack = malloc(64 * sizeof(int*));
    stack[0] = data;
    stack[1] = data + len - 1;
    n_threads = 1;
    pthread_create(&thread, NULL, sort_thr, stack);

    pthread_mutex_lock(&mutex);
    pthread_cond_wait(&cond, &mutex);
    pthread_mutex_unlock(&mutex);
}

// todo: check if 'malloc' and 'pthread_create' are successful
.
Sorting 50000000 numbers with multithreaded Quicksort ...
3.38 s

That’s finally pretty fast - we can be happy with that.

For the sake of completeness: C++17 defines a parallel running std::sort. Since this is currently not available in g++, there are no test results for it.

Conclusion

Iterative Quicksort combined with Insertion sort is fast. Multithreading can dramatically improve application performance.


kabas.online - easy programming in the browser