What are some hive optimization techniques

In this tutorial, we will be going through hive optimization techniques that can result in significant time and cost reductions proportional to the amount of data.

What are some Hive Optimization Techniques?

Apache Hive is a SQL-like software that works with Hadoop to allow users to run SQL-like queries on its HiveQL language rapidly and efficiently. It also gives users access to extra query and analytical capabilities not found in typical SQL structures.

We can use HiveQL or classic MapReduce systems with Apache Hive, depending on our needs and preferences. Hive is especially well-suited to analyzing huge datasets (petabytes) and comes with a variety of storage options.

While Hadoop/hive can process practically any quantity of data, optimizations can result in significant time and cost reductions proportional to the amount of data. In the hive, numerous optimizations can be implemented.

Access Snowflake Real Time Data Warehousing Project with Source Code 

1) Partitioning and Bucketing
The organization of data is critical to query performance. Whether you employ a star schema or a de-normalized (preferred) data warehouse depends on your preferences. Partitioning and bucketing aid in query performance optimization.
Hive reads the entire directory without splitting the data. Furthermore, the query filters are applied to it. Because every data must be read, this is a time-consuming and costly process. Users must also often filter data based on certain column values. However, to use Hive partitioning, users must first grasp the domain of the data they are analyzing. Essentially, partitioning separates and stores all of the items for the various columns of the dataset in their corresponding partitions. As a result, only the required partitions of the table are queried when writing the query to retrieve the values from the table. Therefore, the query's time to return a result is reduced. There is occasionally a large dataset accessible. However, after dividing on a specific field or fields, the partitioned file size does not match the expected size and remains enormous. We still need to divide the division results into sections. As a result, Hive proposes the Bucketing approach to address the partitioning issue. Essentially, this allows the user to break down large table data sets into smaller, more digestible chunks. Hence, we can use Bucketing to keep more controllable pieces. It also allows the user to customize the size of the manageable portions or Buckets.

2) Using appropriate file formats
If we utilize an appropriate file format based on the data, it will significantly improve the speed of our queries. ORC (Optimized Row Columnar) is a good choice for boosting query performance. This means we can store data more efficiently than other file formats. ORC, to be more precise, decreases the amount of the original data by up to 75%. As a result, data processing speed improves. When compared to Text, Sequence, and RC file formats, ORC performs better. Essentially, it is a collection of rows of data organized into groups. Stripes, for example, with a file footer. As a result, we can say that when Hive processes data, the ORC format enhances performance.

3) Compression
Internally, compression techniques limit the quantity of data transfers between mappers and reducers by considerably reducing the intermediate data volume. All of this usually takes place through the internet. The mapper and reducer outputs can each be compressed separately. Remember that gzip-compressed files cannot be divided and hence one must use them wisely. The size of a compressed file should not exceed a few hundred megabytes. Otherwise, it may result in a task that is unbalanced. Snappy, lzo, bzip and other compression codecs are possible alternatives.

The following are some of the most common scenarios in which I/O operations are performed and compression can help save money:

• Data is being read from a local DFS directory.
• Data from a non-local DFS directory is being read.
• Data is moved from the reducers to the next level Mappers/Reducers
• Returning the DFS with the final output.


4) De-normalizing data
Normalization is a typical procedure for modeling your data tables with certain principles in order to deal with data redundancy and anomalies. To put it another way, normalizing your data sets results in the creation of numerous relational tables that may be combined at run time to give the desired results. Joins are time-consuming and difficult to conduct, and they are a common cause of performance problems. As a result, it's best to avoid overly normalized table designs, which need the usage of join queries to obtain the desired metrics.

5) Parallel execution
Hadoop can run MapReduce processes in parallel, and numerous Hive queries leverage this parallelism automatically. Single, sophisticated Hive queries, on the other hand, are frequently converted into many MapReduce tasks, which are then run in the order in which they were created. Some MapReduce steps in a query are typically not coupled and can be run in parallel. They can then take advantage of unused capacity on a cluster to increase cluster utilization while lowering query execution times overall. Changing this behavior in Hive is as simple as toggling a single flag - SET hive.exec.parallel=true.

6) Vectorization
Hive uses vectorization to handle a batch of rows at once rather than one row at a time. A column vector, which is commonly an array of primitive types, is included in each batch. It is accomplished by doing them in batches of 1024 rows at a time, rather than one row at a time. The full column vector is used for operations, which increases the instruction pipelines and cache use. It dramatically reduces query execution time, and it's simple to set up with just two parameters –


7) Optimizing Join
If the table on the other side of the join is small enough to fit in memory, map joins are efficient. When set to "true," Hive offers an option called hive.auto.convert.join, which indicates that Hive tries to map join automatically. When using this parameter, make sure the Hive environment's auto-convert is turned on.
Before writing data to the bucketed table, always make sure the bucketing flag is set - SET hive.enforce.bucketing=true to make use of bucketing in the join procedure. SET hive.optimize.bucketmapjoin=true parameter instructs Hive to do a bucket level join during the map stage join. Because bucketing ensures that the key is present in a specified bucket, it also minimizes the number of scan cycles required to locate a certain key.

8) Cost-based Optimization
In a recent enhancement to Hive, Cost-Based Optimization (CBO) performs additional improvements based on query cost. This can lead to a variety of decisions, such as how to order joins, which sort of join to conduct, the degree of parallelism, and so on. By scanning the query, CBO generates an effective execution plan. It turns queries into an operator tree, assigns costs to operators, and then chooses the most cost-effective strategy. For tables, you should generate column statistics.

Set the following options at the start of your query to use CBO:


Then, use Hive's "analyse" command to collect various statistics on the tables we wish to utilize CBO on.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an ETL Pipeline with Talend for Export of Data from Cloud
In this Talend ETL Project, you will build an ETL pipeline using Talend to export employee data from the Snowflake database and investor data from the Azure database, combine them using a Loop-in mechanism, filter the data for each sales representative, and export the result as a CSV file.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.