How to archive file partitions for file count reduction in Hive

This recipe helps you archive file partitions for file count reduction in Hive
Last Updated: 24 Nov 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to Archive file partitions for file count reduction in Hive?

The in-built support in Hive to convert files in existing partitions to Hadoop Archive (HAR) is one approach to reducing the number of files in sections, as the number of files in the filesystem directly affects the memory consumption in the Namenode. In this recipe, we learn about archiving file partitions for file count reduction in Hive.

Recipe Objective: How to Archive file partitions for file count reduction in Hive?

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Archiving for file count reduction in Hive:

In this recipe, we use the “user_info_part” partition created by partitioning the data based on the “profession” of the user. Let us first describe the partition table and check the data present in it.

bigdata_1

Firstly, make sure the following settings are configured before using the archive.

hive&gt set hive.archive.enabled=true; hive&gt set hive.archive.har.parentdir.settable=true; hive&gt set har.partfile.size=1099511627776;

A partition can be archived using the “ARCHIVE” keyword in the “alter table” command. Once the command is issued, a MapReduce job will perform the archiving. Unlike Hive queries, there is no output on the CLI to indicate the process. The syntax for archiving is given below:

alter table &lttable partition&gt ARCHIVE partition (&ltcol used for partition&gt); bigdata_2

If necessary, the partition can be reverted to its original files with the unarchive command. The sample output is given below.

bigdata_3

Download Materials

bigdata_1

bigdata_2

bigdata_3

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Web Server Log Processing using Hadoop in Azure

In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

View Project Details

Learn Data Processing with Spark SQL using Scala on AWS

In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

View Project Details

A Hands-On Approach to Learn Apache Spark using Scala

Get Started with Apache Spark using Scala for Big Data Analysis

View Project Details

Build a big data pipeline with AWS Quicksight, Druid, and Hive

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

View Project Details

Learn to Create Delta Live Tables in Azure Databricks

In this Microsoft Azure Project, you will learn how to create delta live tables in Azure Databricks.

View Project Details

Building Data Pipelines in Azure with Azure Synapse Analytics

In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

How to archive file partitions for file count reduction in Hive

Recipe Objective: How to Archive file partitions for file count reduction in Hive?

Table of Contents

Prerequisites:

Steps to set up an environment:

Archiving for file count reduction in Hive:

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects