quantile algorithm

An algorithm to compute the specified quantile of a sorted array of real numbers.

The n% quantile of a continuous real-valued distribution is the value below which n% of the values is expected to lie. If we are given an array of real numbers that we want to interpret as having been drawn from a distribution, we can estimate the quantiles of the underlying distribution.

1. The median

The median is a special case of a quantile: it is the 50% quantile. It is usually estimated as follows: from an odd number of values, take the middle value; form an even number, take the average of the two midmost values. For instance, if our values are 15, 20, and 32, the median is 20; if our values are 15, 20, 32, and 60, the median is 26.

This estimate is direction-independent: if we multiply all values by -1 (i.e., they become -60, -32, -20, and -15), the median is also multiplied by -1 (it becomes -26).

2. Percentiles?

The nth percentile of a set of values is usually defined as the highest attested value for which at most n% of all attested values are less or equal. For instance, if our values are 15, 20, 32, and 60, the 30th percentile is 15. Here is an extensive list:

Percentile numberValue
0-
10-
20-
3015
4015
5020
6020
7020
8032
9032
10060

However, this procedure does not yield an estimate of the quantiles of the underlying distribution. For instance, the estimate is direction-dependent: if we multiply all values by -1, the 50th percentile becomes -32 instead of -20, and the 70th percentile becomes -32 instead of the expected -15, which is minus the 30th percentile of the original data set.

3. Unbiased quantiles

To get a better estimate of the quantiles of the underlying distribution, the interpolation that we used to determine the median, is generalized to any quantile.

We assume that the attested values 15, 20, 32, and 60 each take up one quarter of the "quantile space". These four values are in the middles of those quarters, so they are at the 0.125, 0.375, 0.625, and 0.875 quantiles.

Quantiles in between 0.125 and 0.875 are evaluated by linear interpolation: the 0.25, 0.50, and 0.75 quantiles are 17.5, 26, and 46, respectively. Note that the 0.50 quantile is the median. The 0.40 quantile, for example, is estimated as 20 + (32 - 20)·(0.40 - 0.375)/(0.625 - 0.375) = 21.2.

Quantiles between 0 and 0.125 or between 0.875 and 1 are evaluated by linear extrapolation from the lowest or highest pair of values: the 0% quantile is estimated as 15 - 1/2 (20 - 15) = 12.5, and the 100% quantile is estimated as 60 + 1/2 (60 – 32) = 74. The 0.10 quantile is estimated as 12.5 + (15 – 12.5)·(0.10 – 0.0)/(0.125 – 0.0) = 14.5.

Note that the estimated values for the very low or high quantiles can lie outside the range of attested values. In fact, the computed 0% and 100% quantiles are thought to be estimates of the minimum and maximum values of the distribution. For uniform distributions, these estimates are reasonable; for a normal distribution, of course, the 0% and 100% quantiles are meaningless.

Links to this page


© ppgb 19980101