Stratified Random Sampling from Streaming and Stored Data Nguyen, Trong Shih, Ming-Hung Srivastava, Divesh Tirthapura, Srikanta Tirthapura, Srikanta Xu, Bojian
dc.contributor.department Computer Science
dc.contributor.department Electrical and Computer Engineering 2019-02-14T14:48:28.000 2020-06-30T02:01:45Z 2020-06-30T02:01:45Z Tue Jan 01 00:00:00 UTC 2019 2019-02-01 2019-01-01
dc.description.abstract <p>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</p>
dc.description.comments <p>This proceeding is published as Nguyen, Trong Duc, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, and Bojian Xu. "Stratified Random Sampling from Streaming and Stored Data." <em>Proceedings of the 22nd International Conference on Extending Database Technology (EDBT)</em>. Lisbon, Portugal: EDBT/ICDT 2019 Joint Conference. March 26-29, 2019. Posted with permission.</p>
dc.format.mimetype application/pdf
dc.identifier archive/
dc.identifier.articleid 1065
dc.identifier.contextkey 13734903
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath ece_conf/65
dc.language.iso en
dc.source.bitstream archive/|||Sat Jan 15 01:23:53 UTC 2022
dc.subject.disciplines Electrical and Computer Engineering
dc.subject.disciplines Software Engineering
dc.title Stratified Random Sampling from Streaming and Stored Data
dc.type article
dc.type.genre conference
dspace.entity.type Publication
relation.isAuthorOfPublication b0235db2-0a72-4dd1-8d5f-08e5e2e2bf7d
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff
Original bundle
Now showing 1 - 1 of 1
2.04 MB
Adobe Portable Document Format