Optimization-driven sampling for analyzing big data streams
Real-time processing over data streams has become a popular trend for data analysis. With more business applications rely on real-time data analysis to make decisions, traditional batch data processing has become insufficient. While the demand of streaming analysis arises, analyzing big data streams quickly and accurately is a major challenge to overcome.
Sampling is a good approach to provide quick analysis over big data streams. Analyzing the sample gives us an approximation of the exact answer we obtain when analyzing original data. By avoiding analyzing the entire streams, the processing time could be greatly reduced. However, sampling over data streams leads to the following challenges: (1) given a limited budget size, how to build a sample such that the accuracy of approximation over sample is good? And (2) recent data are usually more valuable to some streaming analysis applications, e.g., a real-time intrusion detection system will focus on recent event logs. How to build a sample that weighs more on recent data and eliminates the ancient data in sample is another challenge.
In this research, we propose an optimization-driven sampling (ODS) framework as a solution that aims at (1) providing a more accurate analysis over streaming data and (2) elimination of older data using the sliding window model. Based on how the sample will be analyzed, we formulate the sampling process as an optimization problem and derive an optimal sampling algorithm that will be followed when constructing and maintaining sample over data stream. We study ODS with different sample usages over data streams and discuss how to construct an optimal sample in those settings. We also study lower bounds of accuracy of an ODS sample collected from data streams. Experiments and evaluations were also conducted to show our optimal sample can yield better analysis estimation compared to other existing streaming sampling methods.