Approximate query processing in a data warehouse using random sampling

Nguyen, Trong

Approximate query processing in a data warehouse using random sampling

dc.contributor.advisor	Srikanta Tirthapura
dc.contributor.author	Nguyen, Trong
dc.contributor.department	Department of Electrical and Computer Engineering
dc.date	2020-09-23T19:12:56.000
dc.date.accessioned	2021-02-25T21:35:42Z
dc.date.available	2021-02-25T21:35:42Z
dc.date.copyright	Sat Aug 01 00:00:00 UTC 2020
dc.date.embargo	2020-09-10
dc.date.issued	2020-01-01
dc.description.abstract	<p>Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data</p> <p>among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency.</p> <p>We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input.</p> <p>We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions.</p> <p>Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times.</p>
dc.format.mimetype	application/pdf
dc.identifier	archive/lib.dr.iastate.edu/etd/18195/
dc.identifier.articleid	9202
dc.identifier.contextkey	19236770
dc.identifier.doi	https://doi.org/10.31274/etd-20200902-114
dc.identifier.s3bucket	isulib-bepress-aws-west
dc.identifier.submissionpath	etd/18195
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/94347
dc.language.iso	en
dc.source.bitstream	archive/lib.dr.iastate.edu/etd/18195/Nguyen_iastate_0097E_18644.pdf\|\|\|Fri Jan 14 21:38:10 UTC 2022
dc.subject.keywords	Approximate query processing
dc.subject.keywords	Data warehouse
dc.subject.keywords	Query pipeline
dc.subject.keywords	Sampling
dc.title	Approximate query processing in a data warehouse using random sampling
dc.type	dissertation
dc.type.genre	dissertation
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	a75a044c-d11e-44cd-af4f-dab1d83339ff
thesis.degree.discipline	Computer Engineering(Software Systems)
thesis.degree.level	dissertation
thesis.degree.name	Doctor of Philosophy

File

Original bundle

Now showing 1 - 1 of 1

Name:: Nguyen_iastate_0097E_18644.pdf
Size:: 3.61 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations