Explain the use of explode and lateral view in hive?

Explain the use of explode and lateral view in hive?

Explain the use of explode and lateral view in hive?

This recipe explains what the use of explode and lateral view in hive


Recipe Objective

In big data scenarios , when data volume is huge, we may need to find a subset of data to speed up data analysis. Here comes a technique to select and analyze a subset of data in order to identify patterns and trends in the data known as sampling.

System requirements :

Step 1 : Prepare the dataset

Here we are using the employee related comma separated values (csv) dataset for the create hive table in local.

Data of Output looks as follows:

Before create a table open the hive shell and we need to create a database as follows :Open the hive shell as below

To create database using below queries :

Create database dezyre_db; use dezyre_db;

As follows below:

Step 2 : Create a Hive Table and Load the Data into the Table and verify the Data

Here we are going create a hive table for loading the data from this table to created bucketed tables, Use below to create a hive table:

CREATE TABLE employee ( employee_id int, company_id int, seniority int, salary int, join_date string, quit_date string, dept string ) ROW FORMAT DELIMITED fields terminated by ',' TBLPROPERTIES ("skip.header.line.count"="1");

the above query runs as follows :

Loading the data into hive table and verifying the data

load data local inpath '/home/bigdata/Downloads/empdata.csv' into table employee;

Verifying the data by running the select query as follows

Explanation of views in hive

Views are logical data structures that can be used to simplify queries by either hiding the complexities such as joins, subqueries, and filters or by flattening the data. Hive views do not store data. Once the Hive view is created, its schema is frozen immediately. Subsequent changes to the underlying tables will not be reflected in the view’s schema.

Step 3 : Creating a view

Here we are creating a view using select statement from table employee which employees has greater than 1 lakh salary

hive> CREATE VIEW emp_100000 AS SELECT * FROM employee WHERE salary>100000;

Output of the above query :

We can verify using below queries whether the view is created or not and format of the view as below:

hive> show create table emp_100000;

Output of the above query as follows :

hive> desc formatted emp_100000;

Output of the above query as follows :

Step 4 : Alter a view properties using alter statement

hive> ALTER VIEW emp_100000 SET TBLPROPERTIES ('comment' = 'This is my view table');

Output of the above query as follows :

Step 5 : Drop a view drop statement

Here we are going to drop the view to drop a view run the below query is as follows:

hive> DROP VIEW emp_100000;

Output of the above query as follows :

Step 6 : Usage of Explode and Lateral View

The LATERAL VIEW statement is used with user-defined table generating functions such as EXPLODE() to flatten the map or array type of a column.The explode function can be used on both ARRAY and MAP with LATERAL VIEW.

Before performing exploring and lateral view functions we will create table and insert data into it To create a table in hive with array data type column is as follows

hive> create table std_course_details( std_id int, stud_name string, location string, course array);

Query runs as follows below:

To insert data into above table using insert statement

hive> INSERT INTO TABLE std_course_details VALUES (1,'vamshi','hyd',array('hive','hadoop','spark')),(2,'chandana','bangalore',array('reactjs','javascript')),(3,'Divya','pune',array('python','pyspark','airflow','spark')),(4,'srikanth','pune',array('java','spring boot')),(5,'sreethan','pune',array('c','c++'));

To verify the insert data into table

  • Explode function:
  • The explode function explodes an array to multiple rows. Returns a row-set with a single column (col), one row for each element from the array.

    Here we are going to split array column values into rows by running the below query :

    hive > select explode(course) from std_course_details;

    the above query runs as follows

  • Lateral View :
  • Lateral view explodes the array data into multiple rows. In other words, lateral view expands the array into rows.

    When you use a lateral view along with the explode function, you will get the result something like below.

    hive> select std_id,stud_name,location,courses from std_course_details LATERAL VIEW explode(course) courses_list as courses;

    Output of the above query :

    Relevant Projects

    Create A Data Pipeline Based On Messaging Using PySpark And Hive - Covid-19 Analysis
    In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

    Explore features of Spark SQL in practice on Spark 2.0
    The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

    Real-time Auto Tracking with Spark-Redis
    Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

    Yelp Data Processing Using Spark And Hive Part 1
    In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.

    Yelp Data Processing using Spark and Hive Part 2
    In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

    Real-Time Log Processing using Spark Streaming Architecture
    In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

    Spark Project-Analysis and Visualization on Yelp Dataset
    The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

    Build a big data pipeline with AWS Quicksight, Druid, and Hive
    Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

    Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
    The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.

    Hadoop Project for Beginners-SQL Analytics with Hive
    In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.