How to work with virtual columns in Hive

This recipe helps you work with virtual columns in Hive

Recipe Objective: How to work with virtual columns in Hive?

Like many other databases, Hive provides virtual columns that are not stored in the datafiles but can be accessed in queries. Usually, they provide some metadata that can be very handy. One is INPUT__FILE__NAME, which is the input file's name for a mapper task, and the other is BLOCK__OFFSET__INSIDE__FILE, which is the current global file position. The block compressed file is the current block's file offset, which is the current block's first byte's file offset. In this recipe, we work with virtual columns in Hive.

Prerequisites:

Before proceeding with the recipe, make sure Single node Hadoop and Hive are installed on your local EC2 instance. If not already installed, follow the below link to do the same.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if HDFS and Hive are installed. If not installed, please find the links provided above for installations.
  • Type "&ltyour public IP&gt:7180" in the web browser and log in to Cloudera Manager, where you can check if Hadoop is installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the "Add Services" in the cluster to add the required services in your local instance.

Working with Virtual columns in Hive:

Step 1: Start a database using the use &ltdatabase name> command. Here, "demo." The command show tables; lists all the tables present in that database.

bigdata_1

Step 2: Look at the table schema by giving the describe &lttable name> command. Here, "employee."

bigdata_2

Step 3: Check the data present in the table using the select * from &lttable name> command.

bigdata_3

Step 4: We can retrieve the file name for this mapper task by providing the virtual column - "INPUT__FILE__NAME" in the select statement and "BLOCK__OFFSET__INSIDE__FILE" to see the current global file position. Let us now try passing these two virtual columns along with the "emp_id" from the employee table and observe the output. The output contains three values for each record: mapper task file name, current global file position, and emp_id.

bigdata_4

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

Snowflake Azure Project to build real-time Twitter feed dashboard
In this Snowflake Azure project, you will ingest generated Twitter feeds to Snowflake in near real-time to power an in-built dashboard utility for obtaining popularity feeds reports.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.