For a while I’ve had a cheesy way of estimating percentile values for measurements taken on streaming data. That is, as data streams in, I want to be able to periodically get estimates of the percentile values for values coming in on the stream.

For example, what is the 90th percentile value of byte sizes for messages coming into my service?

Or another example is, what is the 99th percentile of elapsed times to process these messages?

It’s easy capturing min, max, and average for a set of values over some time range. But percentiles, including the median, or 50th percentile, are much harder. To be correct, you have to keep all the individual values, sort them, and then find the values 50%, 90% or 99% of the way through the sorted list.

It might be impractical to set aside memory or disk space to hold all the values coming in — there could be thousands or tens of thousands per second. And even a fast sort could be prohibitive as well. Who has that kind of time on a loaded system? That’s why some sort of estimation technique for percentiles is useful.

So far, the implementation I’ve used (suggested by one of my favorite managers), is to set aside a set of buckets, as for a histogram. As you capture values coming in, increment a counter for the right bucket. When you’re ready to estimate the percentiles, sum up all the buckets, then walk through the buckets adding the values, and when you’ve reached 50% of the total, use the value associated with that bucket as the 50% percentile. Keep going until you hit all the percentiles that you care about.

There are some problems with that, but for quick, broad estimates of percentile values, it works OK. You just have to decide how many buckets you need and carry that many longs or ints in memory — depending on how worried you are about overflow.

My next enhancement was going to provide for larger buckets at the bottom than at the top, since usually we don’t care about percentile values below the median. Also, combine the buckets with some sort of circular array to hold the highest set of values, since the 99.99th or 99.999th value is probably going represent a small number of measurements. Rather than carrying buckets to go that high, it’s probably cheaper to just keep the top handful of values.

In my Google prep, I did a heap implementation, and that looked like a very promising possibility. Rather than have fixed buckets, I thought I might have buckets that live in a heap. Adding a value means either adding a new bucket if the heap isn’t full, or else incrementing an existing bucket if it is. The buckets lower in the heap could be bigger, and I might even have some logic to merge buckets if one gets pushed past the middle of the heap.

All that said….

This paper looks like it addresses that problem directly. Now I just need to understand it enough to convert it into code 😉

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.4993

Do tell us whenever you understand the solution and implement it.

The problem statement is a nice once .