Priority sampling for estimation of arbitrary subset sums Academic Article uri icon

abstract

  • From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items i = 0,, n 1 with weights w i . For each item i , we generate a random number i (0,1] and create a priority q i = w i / i . The sample S consists of the k highest priority items. Let be the ( k + 1)th highest priority. Each sampled item i in S gets a weight estimate i = max{ w i , }, while nonsampled items get weight estimate i = 0. Magically, it turns out that the weight estimates are unbiased, that is, E[ i ] = w i , and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates i and j of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller variance sum than priority sampling with k + 1 items. Szegedy settled this conjecture at STOC'06.

published proceedings

  • JOURNAL OF THE ACM

altmetric score

  • 7.58

author list (cited authors)

  • Duffield, N., Lund, C., & Thorup, M.

citation count

  • 67

complete list of authors

  • Duffield, Nick||Lund, Carsten||Thorup, Mikkel

publication date

  • December 2007