What are the different file formats supported in Sqoop

This recipe explains what are the different file formats supported in Sqoop

Recipe Objective: What are the different file formats supported in Sqoop?

In this recipe, we see the different file formats supported in Sqoop. Sqoop can import data in various file formats like “parquet files” and “sequence files.” Irrespective of the data format in the RDBMS tables, once you specify the required file format in the sqoop import command, the Hadoop MapReduce job, running at the backend, automatically takes care of it.

Yelp Dataset Analysis with Spark and Parquet

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up the environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance.
  • To do this, type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop and Sqoop are installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Importing data in different file formats:

Step 1: Log in to MySQL using

mysql -u root -p;
use &ltuser name&gt
show tables

Enter the required credentials. And check tables in the database “test.” We used a “flights_info” table from the “test” database as an example and demonstrated this recipe.

bigdata_1

Step 2:View the data in the “flights_info” table, which we further use to demonstrate controlling parallelism in sqoop.

select * from &lttable name&gt

bigdata_2

bigdata_3

Step 3:Before running the “sqoop import” command, make sure that the target directory is not already present. Otherwise, the import command throws an error. To check this, let us try deleting the directory that we wish to use as our target directory.

hadoop -rm -r &lttarget directory>

Here, we are implementing two different file formats and checking how the data is stored in HDFS. We are checking for two directories: “flight_parquet_command” and “flights_sequence_command,” one for each file format.

bigdata_4

Step 4: Importing the data into HDFS in “parquet file” format:

Parquet file format is one of the popular file formats used in Apache Hive. It is a column-oriented file format and optimizes the performance of Big Data tools by about 10x on average compared to standard text files.

sqoop import \
--connect jdbc:mysql://localhost/&ltdatabase name&gt \
--table &lttable name&gt \
--username &ltusername&gt --password &ltpassword&gt \
--target-dir &lttarget location in HDFS&gt \
-m &ltno. of Mapper jobs you wish to create&gt
--as-parquetfile 

bigdata_5

 

Output:

bigdata_6

The above command stores the data as a “parquet file” in the target directory.

Step 5: Importing the data into HDFS in “sequential file” format:
The sequence file format is a standard file format while working with Hadoop MapReduce jobs.

sqoop import \
--connect jdbc:mysql://localhost/&ltdatabase name&gt \
--table &lttable name&gt \
--username &ltusername&gt --password &ltpassword&gt \
--target-dir &lttarget location in HDFS&gt \
-m &ltno. of Mapper jobs you wish to create&gt \
--as-sequencefile
bigdata_7

 

Output:

bigdata_8

The above command stores the data as a “sequence file” in the target directory.

What Users are saying..

profile image

Savvy Sahai

Data Science Intern, Capgemini
linkedin profile url

As a student looking to break into the field of data engineering and data science, one can get really confused as to which path to take. Very few ways to do it are Google, YouTube, etc. I was one of... Read More

Relevant Projects

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Yelp Data Processing Using Spark And Hive Part 1
In this big data project, you will learn how to process data using Spark and Hive as well as perform queries on Hive tables.

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.