Random sampling in Apache Hive

Parvathaneni, Sai Sree
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Journal Issue

Data generated by humans and machines is growing at a rapid pace. Analyzing the data provides trends, patterns, and useful insights in data which helps to make important organizational decisions. Traditional database systems have been storing and analyzing large amounts of data for many decades. In traditional databases, handling and analyzing growing data needs lots of resources and time. Reading and writing large data from a single disk is significantly slow. Storing and reading from multiple disks and combining them for analyzing on a single CPU is also not reasonable for huge amounts of data. The problem of storing and analyzing large amount of data is handled by Apache Hadoop. Apache Hadoop is a collection of open source big data software’s that can efficiently handle storing large amounts of data by dividing data into small blocks and replicates the data to handle system failures. Data is analyzed based on the concept of parallel computation. Hive is a data warehousing software that works on top of Hadoop file system. It has an Hive QL interface to execute queries, and are automatically converted into map reduce or tez or spark jobs. For aggregate queries like AVG, SUM, count e.t.c., and for analyzing trends in data, sampling gives good approximation about overall data. Analyzing sample population can be achieved with limited amount of resources. There are different sampling techniques to draw sample from a population, and choice of sampling technique depends on type of analysis we perform to achieve the goal. In this thesis, we have investigated three techniques to perform random sampling on Hive: simple random sampling using sorting, Bernoulli’s sampling, and our algorithm random sampling using bucketing. Simple random sampling using sorting, and Bernoulli’s sampling goes through the whole data to perform sampling in Hive. This slows down the performance when the data is huge. To avoid whole table scan while performing simple random sampling, our algorithm uses bucketing in hive architecture to manage the data stored on Hadoop Distributed File System. Bucketing divides the whole data into specified number of small blocks. Data is divided into buckets based on a specified column in a table. Bucketing allows to select any bucket of required size without scanning the whole table. Limiting data scan and sorting fewer elements decreases the time taken to perform simple random sampling using bucketing. Our experiments shows random sampling using bucketing performs much faster than random sampling using sorting and Bernoulli sampling when the data sizes or sample sizes are large.

Hive, Random Sampling