Scalable, and Accurate Weight Summarization of Unaggregated Data Sets Composable

Many data sets occur as unaggregated data sets , where multiple data points are associated with each key. In the aggregate view of the data, the weight of a key is the sum of the weights of data points associated with the key. Examples are measurements of IP packet header streams, distributed data streams produced by events registered by sensor networks, and Web page or multimedia requests to context distribution servers. We aim to combine sampling and aggregation to provide accurate and efficient summaries of the aggregate view. However, data points are scattered in time or across multiple servers and hence aggregation is subject to resource constraints on the size of summaries that can be stored or transmitted. We develop a summarization framework for unaggregated data where summarization is a scalable and composable operator, and as such, can be tailored to meet resource constraints. Our summaries support unbiased estimates of the weight of subpopulations of keys specified using arbitrary selection predicates. While we prove that under such scenarios there is no variance optimal scheme, our estimators have the desirable properties that the variance is progressively closer to the minimum possible when applied to a "more" aggregated data set. An extensive evaluation using synthetic and real data sets shows that our summarization framework outperforms all existing schemes for this fundamental problem, even for the special and well-studied case of data streams.

Composable, Scalable, and Accurate Weight Summarization of Unaggregated Data Sets Academic Article