Rufflewind

A notebook of random ideas.

The need to associate function pointers with environments

A friend once asked me a question like this:

void register_event_handler(void *f_ctx, void (*f)(void *ctx));

I'm a little confused about the purpose of f_ctx. Here, f is a handler function that gets called when the event is triggered, and f_ctx is − according to the documentation – some pointer argument that gets passed to f whenever it gets called. Why do we need f_ctx? Wouldn't f alone suffice?

This is a trick for low-level languages like C where functions are represented using a raw function pointer, which does not store an enclosing environment (sometimes called a context). It is not needed in higher-level languages with support for first-class functions, such as Python, as these languages allow functions to be nested inside other functions and will automatically store the enclosing environment within the function objects in a combination called a closure.

The need for an environment pointer f_ctx arises when you want to write a function that depends on external parameters not known at compile time. The f_ctx parameter allows you to smuggle these external parameters into f however you like.

It might be best to illustrate this with an example. Consider a 1-dimensional numerical integrator like this:

double integrate_1(
    double (*f)(double x), /* function to be integrated */
    double x1,
    double x2
);

This works fine if you know the complete form of the function f ahead of time. But what if this is not the case -- what if the function requires parameters? Say we want to calculate the gamma function using an integral:

double integrand(double x)
{
    double t = /* where do we get "t" from?? */;
    return pow(x, t - 1.0) * exp(-x);
}

double gamma_function(double t)
{
    /* how do we send the value of "t" into "integrand"? */
    return integrate_1(&integrand, 0.0, INFINITY) / M_PI;
}

Using integrand_1 there are only three ways to do this:

  1. Store t into a global variable, sacrificing thread safety. It would be bad to simultaneously call gamma_function from different threads as they will both attempt to use the same global variable.

  2. Use a thread-local variable, a feature not available until C11. At least it is thread-safe now, but it is still not reentrant.

  3. Write raw machine code to create an integrand on the fly. This can be implemented in a thread-safe and reentrant manner, but it is both inefficient, unportable, and inhibits compiler optimizations.

However, if the numerical integrator were to be re-designed like this:

double integrate_2(
    double (*f)(void *f_ctx, double x),
    void *f_ctx, /* passed into every invocation of "f" */
    double x1,
    double x2
);

Then there is a much simpler solution that avoids all of these problems:

double integrand(void *ctx, double x)
{
    double t = *(double *)ctx;
    return pow(x, t - 1.0) * exp(-x);
}

double gamma_function(double t)
{
    return integrate_2(&integrand, &t, 0.0, INFINITY) / M_PI;
}

This is thread-safe, reentrant, efficient, and portable.

As mentioned earlier, this problem does not exist in languages like Python where functions can be nested inside other functions (or rather, it is automatically taken care of by the language itself):

def gamma_function(t):
    def integrand(x):
        return x ** (t - 1.0) * math.exp(-x)
    return integrate(integrand, 0.0, math.inf) / math.pi

Comments

Sorting with a random comparator

Most sorting algorithms rely on the correct implementation of a comparison function that returns the ordering between two elements -- that is, a function that takes two elements and returns whether the first is less than, equal to, or greater than the second:

function compare(x, y) {
    if (x < y) {
        return LESS_THAN;
    } else if (x > y) {
        return GREATER_THAN;
    } else {
        return EQUAL;
    }
}

The comparison function is required to define a valid total ordering on the elements in the sequence in order for the sorting algorithm to make any sense.

It is tempting to ask whether using a random comparison function would offer a clever way to shuffle a sequence. The short answer is no, generally speaking. It's really not a good idea to do that even if it does somehow magically work. The correct solution is to use the Fisher-Yates shuffling algorithm.

From an engineering perspective, sorting functions are designed to sort things. Abusing it to shuffle an array is kind of just asking for trouble, because you never know if a change in the sorting algorithm down the road might break the "shuffling" algorithm. Worse, the breakage could be subtle, creating biases in the output that might be hard to detect unless the tests rigorously verify the statistics. Moreover, some sorting algorithms can run unusually slow with a random comparator, or take a very long time if the randomness was not in their favor.

From a mathematical perspective, deterministic comparison sorts just can't give you a high quality distribution of random permutations! In fact, if you use a random comparator with an equal probability of returning LESS_THAN or GREATER_THAN (a "uniformly random comparator"), it's provably impossible for a deterministic comparison sort to shuffle correctly.

Suppose you have an array of size N. The probability of any element (say, the first element) ending up at any position (say, the last position) is always 1/N in a perfect shuffle.

At every step of a deterministic comparison sort algorithm, it must make some sort of binary decision based on the ordering between two selected elements, leading to a binary decision tree. The outcome of these decisions determine the final sorted output.

Now if you feed it a uniformly random comparator, the decisions made are always random. This means every decision has a 1/2 probability of going one way or another. This leads to a path p down the decision tree that eventually terminates after n[p] rounds. Certain paths in this decision tree will lead to the scenario where the first element ends up at the last position. Thus, the overall probability of this scenario is the sum of the probability of every path p that leads to the desired scenario:

∑[p ∈ all_desired_paths] (1/2)^(n[p])

Adding up powers of 1/2 will always yield a fraction where the denominator is some power of 2. Since the correct probability is 1/N, unless N is also a power of two, there is simply no way for the result to be correct! Hence, there's no way to get the exact probability 1/N in general.

Comments

Tarball mix-up in the Haskell directory library

Release management can be a real pain.

I normally do all of these steps in a single pass with the aid of a script:

  • Tag the commit using git-tag
  • Create a GitHub Release on this tag.
  • Upload the tarball generated by cabal sdist to Hackage.

However, this one instance I decided to do the Hackage release later out of caution. But this "cautionary measure" ended up causing more problems than it solved, because I went off-script and disrupted my usual tried-and-tested workflow.

Normally a GitHub Release includes:

  • A tarball automatically generated by GitHub from the Git tree
  • A tarball generated by running cabal sdist, attached as a "binary"

But somehow for version 1.2.6.1 my upload script failed to attach the sdist-generated tarball.

Later when I wanted to upload to Hackage, I thought it'd be a good idea to use the original sdist-generated tarball that contained the original timestamps on the files. So, I habitually downloaded the first tarball link I saw on the GitHub Release page for 1.2.6.1. Obviously that was not the sdist-generated tarball, it was the one auto-generated by GitHub.

At first, Hackage complained that the tar format was not ustar. That itself should've aroused suspicion, since cabal sdist always generates them in ustar format, but I didn't pay attention.* Instead, I extracted the tarball and re-made the archive in ustar format.

The end result was that I accidentally uploaded a GitHub-generated tarball instead of an sdist-generated tarball, which didn't work because the Autoconf-generated configure script was missing.

* I thought maybe GitHub was altering ("improving") tarballs behind our backs. In retrospect that was a very silly idea.

Comments