Stratified Random Sampling from Streaming and Stored Data
dc.contributor.author | Nguyen, Trong | |
dc.contributor.author | Shih, Ming-Hung | |
dc.contributor.author | Srivastava, Divesh | |
dc.contributor.author | Tirthapura, Srikanta | |
dc.contributor.author | Xu, Bojian | |
dc.contributor.department | Department of Computer Science | |
dc.contributor.department | Department of Electrical and Computer Engineering | |
dc.date | 2019-02-14T14:48:28.000 | |
dc.date.accessioned | 2020-06-30T02:01:45Z | |
dc.date.available | 2020-06-30T02:01:45Z | |
dc.date.copyright | Tue Jan 01 00:00:00 UTC 2019 | |
dc.date.embargo | 2019-02-01 | |
dc.date.issued | 2019-01-01 | |
dc.description.abstract | <p>Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for SRS must have (in the worst case) a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for SRS that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.</p> | |
dc.description.comments | <p>This proceeding is published as Nguyen, Trong Duc, Ming-Hung Shih, Divesh Srivastava, Srikanta Tirthapura, and Bojian Xu. "Stratified Random Sampling from Streaming and Stored Data." <em>Proceedings of the 22nd International Conference on Extending Database Technology (EDBT)</em>. Lisbon, Portugal: EDBT/ICDT 2019 Joint Conference. March 26-29, 2019. Posted with permission.</p> | |
dc.format.mimetype | application/pdf | |
dc.identifier | archive/lib.dr.iastate.edu/ece_conf/65/ | |
dc.identifier.articleid | 1065 | |
dc.identifier.contextkey | 13734903 | |
dc.identifier.s3bucket | isulib-bepress-aws-west | |
dc.identifier.submissionpath | ece_conf/65 | |
dc.identifier.uri | https://dr.lib.iastate.edu/handle/20.500.12876/20889 | |
dc.language.iso | en | |
dc.source.bitstream | archive/lib.dr.iastate.edu/ece_conf/65/2019_Tirthapura_StratifiedRandom.pdf|||Sat Jan 15 01:23:53 UTC 2022 | |
dc.subject.disciplines | Electrical and Computer Engineering | |
dc.subject.disciplines | Software Engineering | |
dc.title | Stratified Random Sampling from Streaming and Stored Data | |
dc.type | article | |
dc.type.genre | conference | |
dspace.entity.type | Publication | |
relation.isAuthorOfPublication | b0235db2-0a72-4dd1-8d5f-08e5e2e2bf7d | |
relation.isOrgUnitOfPublication | f7be4eb9-d1d0-4081-859b-b15cee251456 | |
relation.isOrgUnitOfPublication | a75a044c-d11e-44cd-af4f-dab1d83339ff |
File
Original bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- 2019_Tirthapura_StratifiedRandom.pdf
- Size:
- 2.04 MB
- Format:
- Adobe Portable Document Format
- Description: