Stratified Random Sampling from Streaming and Stored Data

Thumbnail Image
Date
2019-01-01
Authors
Nguyen, Trong
Shih, Ming-Hung
Srivastava, Divesh
Tirthapura, Srikanta
Xu, Bojian
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Person
Research Projects
Organizational Units
Organizational Unit
Organizational Unit
Journal Issue
Is Version Of
Versions
Series
Department
Computer ScienceElectrical and Computer Engineering
Abstract

Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.

Comments

This proceeding is published as Nguyen, Trong Duc, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, and Bojian Xu. "Stratified Random Sampling from Streaming and Stored Data." Proceedings of the 22nd International Conference on Extending Database Technology (EDBT). Lisbon, Portugal: EDBT/ICDT 2019 Joint Conference. March 26-29, 2019. Posted with permission.

Description
Keywords
Citation
DOI
Source
Copyright
Tue Jan 01 00:00:00 UTC 2019