Interpolation Search from Any Distribution

In my last blog post, I introduced interpolation search, a searching algorithm that can beat binary search asymptotically. To introduce the concept of interpolation search, I made a simplifying assumption that the underlying data came from a uniform distribution. It is still possible to use interpolation search on non-uniform distributions, as long as you have the CDF for that distribution.

Background Info
A cumulative distribution function is a function that tells you the probability that a random item from the distribution is less than or equal to a specified value.
$$ F(k) = P(X \leq k) $$ Thus, $F(k)$ can take on values in the range $[0,1]$ and it is non-descending. For a discrete uniform distribution of integers in the range $[1,10]$, the CDF looks like:




And for a continuous normal distribution, the CDF looks like:


How to get the CDF

There are a number of options to get a handle on the CDF. You can find information online about your distribution, do a one-pass through your data to calculate it, or take a random sample from your data set to approximate it. If you have an array of integers in the range $[1,10]$ as above, you can calculate the CDF and store it in an array as follows:
static double[] cdf(int[] data) {
    double[] freq = new double[11], cdf = new double[11];
    for(int d : data)
        freq[d]++;
    cdf[0] = freq[0] / data.length;
    for(int i=1; i < cdf.length; i++)
        cdf[i] = cdf[i-1] + freq[i] / data.length;
    return cdf;
}
Here, $ F(k) = P(X \leq k) = cdf[k] $, where $ cdf = ecdf(data) $. You only need to compute the cdf once - once you have it you never have to recalculate it.

Distribution of Words in a Dictionary

Here is the distribution of the first letter of words in the English language:

As you can see, the distribution of letters is not uniform. In order to compute the CDF, we need to systematically order the letters by letting $a=1,b=2,c=3,...,z=26$. Here is the corresponding CDF for this distribution:

This is the distribution that we can use in the interpolation search. We can use this CDF to answer a question such as: What is probability that the first letter of a random word comes before 't'. Equivalently, it can tell you the expected location of the word in the sorted array. For example, if I were to search for the word "interpolation", I would compute $F(′I′)=F(9)=0.431$. Then, I can check $index=⌊n \cdot 0.431⌋$, where $n$ is the length of the array. Depending on on the word at that index, I can either check the left sub-array or the right sub-array and refine the CDF accordingly.

Implementation 1

In this section, I will describe how the CDF can be used in the standardized version of interpolation search. In the previous section, I used only the first letter to build a distribution, but you can improve the expected time complexity at the cost of additional space by using the first two letters instead.

The first thing we need is a way to map two letter Strings to integers that can index into the array. The most natural definition is given by:

The above code snippet makes two simplifying assumptions that you may need to account for in your own implementation:
  • All strings have at least two characters.
  • Only lower case letters will be used in the first two characters of every string.
We will use this index to index into the CDF array. This algorithm is going to be very similar to the standard binary search algorithm with one important difference (and some other minor details as well). In binary search, you calculate the element to compare to the key by the formula
$$ mid = \frac{low+high}{2} $$
but in interpolation search, we must change this formula to something along the lines of
$$ mid = low + \frac{cdf[key] - cdf[s_{min}]}{cdf[s_{max}] - cdf[s_{min}]} \cdot (high-low) $$
where \( s_{min} \) and \( s_{max} \) are the elements at position \( low \) and \( high \) respectively.  In the initial guess for \( mid \), we can use a simpler formula by noting that \( cdf[s_{min}] = 0 \) and \( cdf[s_{max}] = 1 \)
$$ mid = cdf[key] \cdot a.length $$ While in theory this formula should predict a good starting index, I have found it difficult to implement in practice. Additionally, the overhead of computing this relatively complex formula can hurt the performance gain of predicting a good index, especially for smaller array sizes.

Implementation 2

The second implementation of interpolation search can accommodate for the change in distribution with virtually no change in the initial algorithm. We can compute the range that we expect to find the key in using the CDF as follows:
$$ \bar{X} = n \cdot cdf[key] \\
\sigma = \sqrt{cdf[key] \cdot (1-cdf[key]) \cdot n} \\
Range = [\bar{X} - 3 \sigma, \bar{X} + 3 \sigma] $$
where n is the size of the array. The size of the range is proportional to $\sqrt{n}$, and the range contains the key $99.7\%$ of the time. Thus, we will achieve $O(log(\sqrt{n}))=O(0.5 \cdot log(n))$ performance with high probability. While this is asymptotically the same performance as binary search, it reduces the search space enough in practice to make a noticeable difference. Here is a java implementation of this search, assuming we have a two character CDF as described above:
static int index(String s) {
  return 26*(s.charAt(0)-'a')+s.charAt(1)-'a';
}

static int interpSearch(String[] a, double[] cdf, String key) {
  double p = cdf[index(key)];
  double mean = p*a.length;
  double sd = Math.sqrt(p * (1-p) * a.length);

  int low = Math.max(0, (int) (mean - 3*sd));
  int high = Math.min(a.length-1, (int) (mean + 3*sd));

  if(a[low] > key)
    return Arrays.binarySearch(a, 0, low, key);
  else if(a[high] < key)
    return Arrays.binarySearch(a, high, a.length, key);
  return Arrays.binarySearch(a, low, high, key);
}
This is a trivial translation from the formula's stated above, and is much easier to implement than the other version of interpolation search. Further, since the expensive computations are only done once, there is very little overhead. Thus, this version of interpolation search outperforms binary search for small ($ \leq 2^{17} $) and large arrays.


Conclusion

Once again, the second interpolation search has proven to be a favorable option for efficient searches. Even though it doesn't achieve the same theoretical performance as the first implementation, it has fewer expensive calculations, and thus performs better in practice. The first implementation might still be useful for use cases where array lookups are much more computationally expensive than the index calculation, however. However, the ease of implementation for the second version makes it a very attractive option.

Comments

Post a Comment

Popular posts from this blog

Efficiently Remove Duplicate Rows from a 2D Numpy Array

Multi-Core Programming with Java

Beat the Streak: Day Three