# An algorithm for comparing large rational numbers in scientific notation

Say you're given two numbers in the following representation:

x / y × 10 ^ a
u / v × 10 ^ b

with all the variables being integers and, in particular, y and v being always positive. How would you compare these two numbers efficiently in software? For sufficiently small numbers, it's not a problem since we can rely on floating-point hardware. But what about exact, arbitrarily large numbers?

You could convert them both into fractions with the same denominator and then compare the integers, but that can be rather inefficient if a and b are wildly different. (Similarly, addition and subtraction can be quite expensive if a and b differ greatly, but sadly there's no way to get around that.)

## The algorithm

We can write this in the form of an inequality:

x / y × 10 ^ a  ~  u / v × 10 ^ b

where the "unknown variable" to be solved is ~ and can be >, =, or <. We can manipulate this just like an equation, but with one caveat: we mustn't multiply or divide by a number unless we are absolutely sure that it's positive.

Since y and v are both positive, we can safely multiply by those to obtain:

x × v × 10 ^ a  ~  u × y × 10 ^ b

Now let's define some variables to clean up the inequality:

m ≡ x × v
n ≡ u × y

It's helpful to split the variables m and n into their signs and moduli:

σ ≡ sgn m
τ ≡ sgn n

M ≡ |m|
N ≡ |n|

The inequality then becomes:

σ × M × 10 ^ a  ~  τ × N × 10 ^ b

The signs σ and τ can already be used to determine the ordering in 7 out of the 9 cases:

 σ   τ  ~
---------
−1  −1  ?
−1   0  <
−1  +1  <
0  −1  >
0   0  =
0  +1  <
+1  −1  >
+1   0  >
+1  +1  ?

That leaves us with the two nontrivial cases where both signs are positive or both signs are negative. In these cases:

σ = τ = +1   ⟹   M × 10 ^ a  ~  N × 10 ^ b
σ = τ = −1   ⟹   N × 10 ^ b  ~  M × 10 ^ a

We'll consider only the first case. The second case is identical, just reversed.

The trick to making the comparisons more efficient is to use logarithms: since the logarithmic function is monotonic, they preserve the ordering and can be safely applied to both sides of the inequality:

log M + a  ~  log N + b

The problem is that logarithms aren't that easy to compute. But we can substitute this by computing floored logarithms (ilog) and its fractional part (rlog), which satisfy the following for any positive z:

log z ≡ ilog z + rlog z

0 ≤ rlog z < 1

ilog z ∈ ℕ

Floored logarithms are easy to calculate: simply combine trial division with exponential search. Here is an example of such in Haskell, based on the implementation of integerLogBase in section 14.4 of The Haskell 98 Language Report:

-- precondition: base is greater than one and num is positive
ilogBase :: (Integral a, Integral b) => a -> a -> b
ilogBase base num = fst (ilogr base num)
where ilogr b n | n  < b    =      (0, n)
| n' < b    =     (l2, n')
| otherwise = (1 + l2, n' div b)
where (l, n') = ilogr (b * b) n
l2      = l * 2

Applying ilog and rlog to the inequality, we obtain:

ilog M + rlog M + a  ~  ilog N + rlog N + b

Notice that the fractional parts of the inequality satisfies:

|rlog N − rlog M| < 1

thus, as long as the integral parts of the inequality differ, the comparison can be done without even worrying about the the fractional parts:

ilog M + a  ~  ilog N + b

Lastly, what about the situation where the integral parts are equal? We could use the fractional part of the logarithm somehow but that would introduce unnecessary complexities here. Instead, we can consult the previous equation:

M × 10 ^ a  ~  N × 10 ^ b

To perform this comparison without using rational arithmetic, we can divide both sides by 10 ^ min (a, b) to make the exponent nonnegative:

M × 10 ^ (a − min(a, b))  ~  N × 10 ^ (b − min(a, b))

Now, this comparison can be computed directly using only integer arithmetic.

## Algorithmic complexity

The algorithm involves performing some integer arithmetic and comparisons, as well as handling several subcases. Let

X = max(|x|, |u|)
Y = max(y, v)
A = max(|a|, |b|)

The complexity of this algorithm has no "unexpected surprises": in the worst case, one would need to calculate the floored logarithm, which is roughly O(max (log X, log Y)). Therefore, the overall complexity is roughly:

O(max(log X, log Y, log A))

Compare this to the naive technique that involves converting both numbers into fractions, which in the worst case could be as bad as O(A).

# Chart: hierarchy of the numeric type classes in Haskell

Here's a simple chart that shows the hierarchy of the various number-related type classes in Haskell. Also shown are the methods required for a minimal complete definition. (Click to zoom in!)

You can also get the SVG version.

# Pasting Unicode characters in Emacs (Windows)

Previously, I ran into an issue with pasting non-ASCII characters into Emacs on my Windows system. For example, copying a Greek small letter rho (ρ) would result in a question mark (?) appearing in the Emacs editor. Oddly enough, copying Unicode characters from Emacs generally worked fine.

Originally, I assumed it was a bug in Emacs. Yet, I couldn't find any bug reports about it so that seemed unlikely. Eventually, I came to realize that one of the settings in my Emacs profile (.emacs or init.el) caused the problem:

(set-selection-coding-system 'utf-8)              ; wrong

As it turns out, the correct value is 'utf-16-le because the Windows API is built entirely on top of UTF-16:

(set-selection-coding-system 'utf-16-le)          ; correct

The problem was introduced a while back: it was part of a hack to force Emacs to use UTF-8 as the default buffer encoding. At the time, I copied the code from this StackOverflow answer (fixed now) without too much thought, not realizing it would cause a bug discovered several months later. It turns out that set-selection-coding-system and set-clipboard-coding-system actually do the same thing, but the connection between selections and clipboards is not exactly obvious until one becomes familiar with the Emacs jargon.

Fortunately, the choice of the selection-coding-system doesn't affect the encoding of the buffer itself, so it doesn't invalidate the hack I used to enforce UTF-8 as the default buffer encoding.