EAGER: Adaptive Sampling of Massive Graph Streams
- View All
The electronic services that underpin daily life---such as the internet, social networks, banking systems, and online retailers---continuously generate vast amounts of data concerning their operation and usage. A key analytical task for service providers is to understand the commonality between different data points, e.g. to provide friend recommendations based on patterns of social interaction, or to understand how a set of related transactions may collectively signify a bank fraud. The volume at which such data is generated presents challenges for its storage and processing. On the other hand, sampling and other data reduction techniques hinder the ability to discern patterns for forensics and other retrospective analysis. While previous work has proposed solutions in specific instances, there is currently little general understanding of how to optimize the choice of sampling algorithm to best match the requirement of data analysis under operational resource constraints. Furthermore, while existing solutions have been tuned by their expert designers, in the absence of a general framework it is challenging for researchers in application domains to adapt these to their problems. To meet these gaps, this project establishes a new general framework for optimal sampling of streaming graph data. It will use this framework to implement a map between problem specifications and sampling algorithms in order to streamline the generation of working code for applications. The project will also develop educational materials on data stream sampling and outreach to researchers in application domains to promote applicability and usage of the results developed.This work will develop a transformative framework for graph stream sampling and its applications that address outstanding gaps in knowledge and practice. First, while current methods frequently focus on computation of global graph properties, applications often require a representative sample for rapid or retrospective analysis without having to reprocess the entire stream, even if available. Second, current methods are typically optimized for specific subgraph targets and problems without a general ability to optimize for analysis accuracy and resource requirements. Third, it is challenging for domain researchers to generalize these results beyond the original problem because of the expert tuning required for their creation. In response, this project will develop a framework graph data stream sampling that is applicable to a wide class of application problems and settings, allows tunable trade-off between accuracy and space and time computational resources, and is implementable as a mapping between problem specification and working application code. Specifically, the work will use a novel form of weighted reservoir priority sampling to adapt the sampling and retention of edges to their evolving role in the sampled graph and with respect to analysis goals. The framework will enable time-decayed clustered subgraph sampling, and multigraph sampling to support estimation from streams of repeated edges. Analysis and evaluation in a representative variety of applications will be used to optimize computational strategies allowing flexible trade-off between accuracy and resource consumption. These results will transform usage of sampling methods by realizing the framework as a mapping between problem specification and working application code that enables domain researchers to rapidly incorporate the results of this work within their own applications.This award reflects NSF''s statutory mission and has been deemed worthy of support through evaluation using the Foundation''s intellectual merit and broader impacts review criteria.