Method for summarizing data in unaggregated data streams
Patent

Overview

Overview

abstract

A method for producing a summary A of data points in an unaggregated data stream wherein the data points are in the form of weighted keys (a, w) where a is a key and w is a weight, and the summary is a sample of k keys a with adjusted weights wa. A first reservoir L includes keys having adjusted weights which are additions of weights of individual data points of included keys and a second reservoir T includes keys having adjusted weights which are each equal to a threshold value whose value is adjusted based upon tests of new data points arriving in the data stream. The summary combines the keys and adjusted weights of the first reservoir L with the keys and adjusted weights of the second reservoir T to form the sample representing the data stream upon which further analysis may be performed. The method proceeds by first merging new data points in the stream into the reservoir L until the reservoir contains k different keys and thereafter applying a series of tests to new arriving data points to determine what keys and weights are to be added to or removed the reservoirs L and T to provide a summary with a variance that approaches the minimum possible for aggregated data sets. The method is composable, can be applied to high speed data streams such as those found on the Internet, and can be implemented efficiently.