Stratified Random Sampling from Streaming and Stored Data

dc.contributor.author Nguyen, Trong
dc.contributor.author Shih, Ming-Hung
dc.contributor.author Srivastava, Divesh
dc.contributor.author Tirthapura, Srikanta
dc.contributor.author Tirthapura, Srikanta
dc.contributor.author Xu, Bojian
dc.contributor.department Computer Science
dc.contributor.department Electrical and Computer Engineering
dc.date 2019-02-14T14:48:28.000
dc.date.accessioned 2020-06-30T02:01:45Z
dc.date.available 2020-06-30T02:01:45Z
dc.date.copyright Tue Jan 01 00:00:00 UTC 2019
dc.date.embargo 2019-02-01
dc.date.issued 2019-01-01
dc.description.abstract <p>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</p>
dc.description.comments <p>This proceeding is published as Nguyen, Trong Duc, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, and Bojian Xu. "Stratified Random Sampling from Streaming and Stored Data." <em>Proceedings of the 22nd International Conference on Extending Database Technology (EDBT)</em>. Lisbon, Portugal: EDBT/ICDT 2019 Joint Conference. March 26-29, 2019. Posted with permission.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/ece_conf/65/
dc.identifier.articleid 1065
dc.identifier.contextkey 13734903
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath ece_conf/65
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/20889
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/ece_conf/65/2019_Tirthapura_StratifiedRandom.pdf|||Sat Jan 15 01:23:53 UTC 2022
dc.subject.disciplines Electrical and Computer Engineering
dc.subject.disciplines Software Engineering
dc.title Stratified Random Sampling from Streaming and Stored Data
dc.type article
dc.type.genre conference
dspace.entity.type Publication
relation.isAuthorOfPublication b0235db2-0a72-4dd1-8d5f-08e5e2e2bf7d
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff
File
Original bundle
Now showing 1 - 1 of 1
Name:
2019_Tirthapura_StratifiedRandom.pdf
Size:
2.04 MB
Format:
Adobe Portable Document Format
Description: