What are some hive optimization techniques

In this tutorial, we will be going through hive optimization techniques that can result in significant time and cost reductions proportional to the amount of data.

What are some Hive Optimization Techniques?

Apache Hive is a SQL-like software that works with Hadoop to allow users to run SQL-like queries on its HiveQL language rapidly and efficiently. It also gives users access to extra query and analytical capabilities not found in typical SQL structures.

We can use HiveQL or classic MapReduce systems with Apache Hive, depending on our needs and preferences. Hive is especially well-suited to analyzing huge datasets (petabytes) and comes with a variety of storage options.

While Hadoop/hive can process practically any quantity of data, optimizations can result in significant time and cost reductions proportional to the amount of data. In the hive, numerous optimizations can be implemented.

Access Snowflake Real Time Data Warehousing Project with Source Code 

1) Partitioning and Bucketing
The organization of data is critical to query performance. Whether you employ a star schema or a de-normalized (preferred) data warehouse depends on your preferences. Partitioning and bucketing aid in query performance optimization.
Hive reads the entire directory without splitting the data. Furthermore, the query filters are applied to it. Because every data must be read, this is a time-consuming and costly process. Users must also often filter data based on certain column values. However, to use Hive partitioning, users must first grasp the domain of the data they are analyzing. Essentially, partitioning separates and stores all of the items for the various columns of the dataset in their corresponding partitions. As a result, only the required partitions of the table are queried when writing the query to retrieve the values from the table. Therefore, the query's time to return a result is reduced. There is occasionally a large dataset accessible. However, after dividing on a specific field or fields, the partitioned file size does not match the expected size and remains enormous. We still need to divide the division results into sections. As a result, Hive proposes the Bucketing approach to address the partitioning issue. Essentially, this allows the user to break down large table data sets into smaller, more digestible chunks. Hence, we can use Bucketing to keep more controllable pieces. It also allows the user to customize the size of the manageable portions or Buckets.

2) Using appropriate file formats
If we utilize an appropriate file format based on the data, it will significantly improve the speed of our queries. ORC (Optimized Row Columnar) is a good choice for boosting query performance. This means we can store data more efficiently than other file formats. ORC, to be more precise, decreases the amount of the original data by up to 75%. As a result, data processing speed improves. When compared to Text, Sequence, and RC file formats, ORC performs better. Essentially, it is a collection of rows of data organized into groups. Stripes, for example, with a file footer. As a result, we can say that when Hive processes data, the ORC format enhances performance.

3) Compression
Internally, compression techniques limit the quantity of data transfers between mappers and reducers by considerably reducing the intermediate data volume. All of this usually takes place through the internet. The mapper and reducer outputs can each be compressed separately. Remember that gzip-compressed files cannot be divided and hence one must use them wisely. The size of a compressed file should not exceed a few hundred megabytes. Otherwise, it may result in a task that is unbalanced. Snappy, lzo, bzip and other compression codecs are possible alternatives.

The following are some of the most common scenarios in which I/O operations are performed and compression can help save money:

• Data is being read from a local DFS directory.
• Data from a non-local DFS directory is being read.
• Data is moved from the reducers to the next level Mappers/Reducers
• Returning the DFS with the final output.


4) De-normalizing data
Normalization is a typical procedure for modeling your data tables with certain principles in order to deal with data redundancy and anomalies. To put it another way, normalizing your data sets results in the creation of numerous relational tables that may be combined at run time to give the desired results. Joins are time-consuming and difficult to conduct, and they are a common cause of performance problems. As a result, it's best to avoid overly normalized table designs, which need the usage of join queries to obtain the desired metrics.

5) Parallel execution
Hadoop can run MapReduce processes in parallel, and numerous Hive queries leverage this parallelism automatically. Single, sophisticated Hive queries, on the other hand, are frequently converted into many MapReduce tasks, which are then run in the order in which they were created. Some MapReduce steps in a query are typically not coupled and can be run in parallel. They can then take advantage of unused capacity on a cluster to increase cluster utilization while lowering query execution times overall. Changing this behavior in Hive is as simple as toggling a single flag - SET hive.exec.parallel=true.

6) Vectorization
Hive uses vectorization to handle a batch of rows at once rather than one row at a time. A column vector, which is commonly an array of primitive types, is included in each batch. It is accomplished by doing them in batches of 1024 rows at a time, rather than one row at a time. The full column vector is used for operations, which increases the instruction pipelines and cache use. It dramatically reduces query execution time, and it's simple to set up with just two parameters –


7) Optimizing Join
If the table on the other side of the join is small enough to fit in memory, map joins are efficient. When set to "true," Hive offers an option called hive.auto.convert.join, which indicates that Hive tries to map join automatically. When using this parameter, make sure the Hive environment's auto-convert is turned on.
Before writing data to the bucketed table, always make sure the bucketing flag is set - SET hive.enforce.bucketing=true to make use of bucketing in the join procedure. SET hive.optimize.bucketmapjoin=true parameter instructs Hive to do a bucket level join during the map stage join. Because bucketing ensures that the key is present in a specified bucket, it also minimizes the number of scan cycles required to locate a certain key.

8) Cost-based Optimization
In a recent enhancement to Hive, Cost-Based Optimization (CBO) performs additional improvements based on query cost. This can lead to a variety of decisions, such as how to order joins, which sort of join to conduct, the degree of parallelism, and so on. By scanning the query, CBO generates an effective execution plan. It turns queries into an operator tree, assigns costs to operators, and then chooses the most cost-effective strategy. For tables, you should generate column statistics.

Set the following options at the start of your query to use CBO:


Then, use Hive's "analyse" command to collect various statistics on the tables we wish to utilize CBO on.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Graph Database Modelling using AWS Neptune and Gremlin
In this data analytics project, you will use AWS Neptune graph database and Gremlin query language to analyse various performance metrics of flights.

Build an Incremental ETL Pipeline with AWS CDK
Learn how to build an Incremental ETL Pipeline with AWS CDK using Cryptocurrency data

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Explore features of Spark SQL in practice on Spark 2.0
The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub