Sorting with a random comparator

Most sorting algorithms rely on the correct implementation of a comparison function that returns the ordering between two elements – that is, a function that takes two elements and returns whether the first is less than, equal to, or greater than the second:

function compare(x, y) {
    if (x < y) {
        return LESS_THAN;
    } else if (x > y) {
        return GREATER_THAN;
    } else {
        return EQUAL;
    }
}

The comparison function is required to define a valid total ordering on the elements in the sequence in order for the sorting algorithm to make any sense.

It is tempting to ask whether using a random comparison function would offer a clever way to shuffle a sequence. The short answer is no, generally speaking. It’s really not a good idea to do that even if it does somehow magically work. The correct solution is to use the Fisher-Yates shuffling algorithm.

From an engineering perspective, sorting functions are designed to sort things. Abusing it to shuffle an array is kind of just asking for trouble, because you never know if a change in the sorting algorithm down the road might break the “shuffling” algorithm. Worse, the breakage could be subtle, creating biases in the output that might be hard to detect unless the tests rigorously verify the statistics. Moreover, some sorting algorithms can run unusually slow with a random comparator, or take a very long time if the randomness was not in their favor.

From a mathematical perspective, bounded-time deterministic comparison sorts just can’t give you a high quality distribution of random permutations! In fact, if you use a random comparator with an equal probability of returning LESS_THAN or GREATER_THAN (a “uniformly random comparator”), it’s provably impossible for a such an algorithm to shuffle correctly.

Suppose you have an array of size N. The probability of any element (say, the first element) ending up at any position (say, the last position) is always 1/N in a perfect shuffle.

At every step of a deterministic comparison sort algorithm, it must make some sort of binary decision based on the ordering between two selected elements, leading to a binary decision tree. The outcome of these decisions determine the final sorted output.

Now if you feed it a uniformly random comparator, the decisions made are always random. This means every decision has a 1/2 probability of going one way or another. This leads to a path p down the decision tree that eventually terminates after n[p] rounds, since we are assuming bounded time. Certain paths in this decision tree will lead to the scenario where the first element ends up at the last position. Thus, the overall probability of this scenario is the sum of the probability of every path p that leads to the desired scenario:

∑[p ∈ all_desired_paths] (1/2)^(n[p])

Adding up powers of 1/2 will always yield a fraction where the denominator is some power of 2. Since the correct probability is 1/N, unless N is also a power of two, there is simply no way for the result to be correct! Hence, there’s no way to get the exact probability 1/N in general.

Edit: Chris P. pointed out that the algorithm must have bounded time for this proof to work. Thanks!

Comments

Tarball mix-up in the Haskell directory library

Release management can be a real pain.

I normally do all of these steps in a single pass with the aid of a script:

  • Tag the commit using git-tag
  • Create a GitHub Release on this tag.
  • Upload the tarball generated by cabal sdist to Hackage.

However, this one instance I decided to do the Hackage release later out of caution. But this “cautionary measure” ended up causing more problems than it solved, because I went off-script and disrupted my usual tried-and-tested workflow.

Normally a GitHub Release includes:

  • A tarball automatically generated by GitHub from the Git tree
  • A tarball generated by running cabal sdist, attached as a “binary”

But somehow for version 1.2.6.1 my upload script failed to attach the sdist-generated tarball.

Later when I wanted to upload to Hackage, I thought it’d be a good idea to use the original sdist-generated tarball that contained the original timestamps on the files. So, I habitually downloaded the first tarball link I saw on the GitHub Release page for 1.2.6.1. Obviously that was not the sdist-generated tarball, it was the one auto-generated by GitHub.

At first, Hackage complained that the tar format was not ustar. That itself should’ve aroused suspicion, since cabal sdist always generates them in ustar format, but I didn’t pay attention.* Instead, I extracted the tarball and re-made the archive in ustar format.

The end result was that I accidentally uploaded a GitHub-generated tarball instead of an sdist-generated tarball, which didn’t work because the Autoconf-generated configure script was missing.

* I thought maybe GitHub was altering (“improving”) tarballs behind our backs. In retrospect that was a very silly idea.

Comments

A basic introduction to unique_ptr

C++11 introduced the unique_ptr template in the <memory> header. It provides a convenient way of managing memory (and more generally, lifetimes of arbitrary objects). It’s a bit odd to use, so this article provides a very basic introduction to its semantics.

The article will not attempt explain its features in detail. You can find more info in the API documentation.

(The astute reader will notice that I’m using the same ownership terminology as Rust.)

Scalar vs array

There are two kinds of unique_ptr, one for scalars (i.e. non-arrays) and one for arrays:

  • unique_ptr<double> can hold a scalar of type double;
  • unique_ptr<double[]> can hold an array of double values with an unknown number of elements.

Of course, one may substitute double with any arbitrary type.

Construction

A std::unique_ptr variable can constructed using make_unique:

  • The first line creates a unique_ptr to a double value 3.14 and saves it in the p1 variable.
  • The second line creates a unique_ptr to a double array of n elements and saves it to in the p2 variable.

Note that make_unique requires C++14. If you want to stick to plain C++11 you can substitute those with:

Ownership and uniqueness

A unique_ptr variable is said to be the owner of the object it points to: if the owner dies without passing its ownership to another variable, the object is deleted (and hence the memory is deallocated), thereby invalidating all pointers and references to it.

There can only be one owner at any one time: one does not simply copy a unique_ptr. For example:

Typical errors when this happens:

(g++)
error: use of deleted function ‘std::unique_ptr<…>::unique_ptr(const std::unique_ptr<…>&)
error: use of deleted function ‘… std::unique_ptr<…>::operator=(const std::unique_ptr<…>&)

(clang++)
error: call to deleted constructor of 'std::unique_ptr<…>'
error: overload resolution selected deleted operator '='

If an owner wants to lend the contents of unique_ptr to another function without relinquishing its ownership, it can give a pointer/reference to it:

If an owner wants to yield its ownership of a unique_ptr to another function or variable, it can perform a move:

Resetting

Although normally a unique_ptr will automatically delete its object once it dies (or when it gains ownership of something else), one can hasten the process by manually calling .reset():

This will immediately delete the object pointed to by p and cause the p to become empty.

Emptiness

A unique_ptr can be empty, in which case it owns nothing – the analog of a nullptr. This is the default state if is not explicitly initialized with something. It is also the state after it loses its ownership of something, or is forcifully emptied via .reset().

Releasing

Normally, a unique_ptr automatically loses its ownership when it is moved to another unique_ptr. Alternatively, one can force it give up its ownership in the form of a raw pointer via .release():

After releasing, the pointer will not be freed automatically by the unique_ptr as it is now empty. The programmer is then responsible for deleting the pointer manually.

Example

Here’s a more complete example demonstrating the use of unique_ptr:

// compile with: c++ -std=c++14 -lopenblas
#include <cmath>
#include <cstdlib>
#include <iostream>
#include <memory> // required for std::unique_ptr, std::make_unique, std::size_t
#include <utility> // required for std::move
#include <cblas.h> // required for cblas_daxpy

// create an owned vector initialized to zero
std::unique_ptr<double[]> new_vector(std::size_t n)
{
    return std::make_unique<double[]>(n);
}

// borrow one vector (v1), seize another (v2), and return an owned vector
// equal to their sum
std::unique_ptr<double[]> destructively_add_vectors(
    std::size_t n,
    const double* v1,
    std::unique_ptr<double[]> v2)
{
    // sum <- v2
    std::unique_ptr<double[]> sum = std::move(v2);

    // sum <- v1 + sum
    cblas_daxpy(n, 1., v1, 1, sum.get(), 1);

    // for obscure reasons (C++11 §12.8/32), using an explicit std::move here
    // is actually optional, but we use it anyway for consistency and clarity;
    // see also: https://stackoverflow.com/a/14856553
    return std::move(sum);
}

// borrow two vectors and return an owned vector equal to their sum
std::unique_ptr<double[]> add_vectors(
    std::size_t n,
    const double* v1,
    const double* v2)
{
    // v2_copy <- 0
    std::unique_ptr<double[]> v2_copy = new_vector(n);

    // v2_copy <- v2
    cblas_dcopy(n, v2, 1, v2_copy.get(), 1);

    return destructively_add_vectors(n, v1, std::move(v2_copy));
}

double fibby(double i)
{
    using std::pow;
    using std::sqrt;
    const double a = (1. + sqrt(5.)) / 2.;
    const double b = (1. - sqrt(5.)) / 2.;
    return (pow(a, i) - pow(b, i)) / sqrt(5);
}

// create an owned vector initialized to something fun
std::unique_ptr<double[]> example_vector(std::size_t n)
{
    std::unique_ptr<double[]> v = new_vector(n);
    for (std::size_t i = 0; i != n; ++i) {
        v[i] = fibby(i);
    }
    return v;
}

// borrow a vector and check that the result is correct
void check_result(std::size_t n, const double* v)
{
    for (std::size_t i = 0; i != n; ++i) {
        // always use !(difference < epsilon)
        // rather than (difference >= epsilon)
        // so NaN can't sneak past the check
        if (!(std::abs(v[i] - fibby(i) * 2.) < 1e-8)) {
            std::cerr << "what a cruel, cruel world" << std::endl;
            std::abort();
        }
    }
}

int main()
{
    const std::size_t n = 1024;

    std::unique_ptr<double[]> v1 = example_vector(n);
    std::unique_ptr<double[]> v2 = example_vector(n);

    // lend v1 and v2 to add_vectors
    std::unique_ptr<double[]> sum = add_vectors(n, v1.get(), v2.get());

    // lend sum to check_result
    check_result(n, sum.get());

    // reset sum (actually optional, since the following line will
    // automatically cause this to happen anyway)
    sum.reset();

    // lend v1 and relinquish v2 to destructively_add_vectors
    sum = destructively_add_vectors(n, v1.get(), std::move(v2));

    // lend sum to check_result
    check_result(n, sum.get());

    // v1 and sum are deleted automatically;
    // yay no memory leaks \o/
}

Comments