How to select the Files to Import using Sqoop

This recipe helps you select the Files to Import using Sqoop

Recipe Objective: How to select the Files to Import using Sqoop?

This recipe teaches us to select files present in RDBMS and insert them into HDFS using the Sqoop tool. It was designed to transfer data between HDFS and RDBMS using a JDBC driver to connect.

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up the environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance.
  • To do this, type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop and Sqoop are installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Transferring files from MySQL to HDFS:

The following steps are involved in pulling the data present in the MySQL table and inserting it into HDFS. We see how sqoop is used to achieve this. In this example, we created apriori using the “retailinfo” table present in our “test” database.

Step 1: Log in to MySQL using

mysql -u root -p;

Enter the required credentials.

bigdata_1

 

Step 2:Before transferring the selected file, check how many records are present in it. To do this, change to the required database using.

use &ltdatabase name>

Now, check the number of records present in the table, i.e., the file we select to move its data into HDFS, using “count(*).”

select count(*) from &lttablename>

bigdata_2

Step 3: Before running the “sqoop import” command, ensure that the target directory is not already present. Otherwise, the import command throws an error. To check this, let us try deleting the directory that we wish to use as our target directory.

hadoop fs -rm -r &lt target directory &gt

bigdata_3

Step 4: Let us now do the “sqoop import” job to pull the data from the selected file from MySQL and insert it into HDFS using the command:

sqoop import \
--connect jdbc:mysql://localhost/&ltdatabase name&gt \
--table &lttable name&gt \
--username &ltusername&gt --password &ltpassword&gt \
--target-dir &lttarget location in HDFS&gt \
-m &ltno. of Mapper jobs you wish to create&gt

In our example, the database is “test,” the selected file is the table “retailinfo,” and our target location in HDFS is in the directory “/user/root/online_basic_command.” We proceeded with only one mapper job.

bigdata_4

 

Upon successful data transfer, the output looks similar to:

bigdata_5

bigdata_6

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

SQL Project for Data Analysis using Oracle Database-Part 1
In this SQL Project for Data Analysis, you will learn to efficiently leverage various analytical features and functions accessible through SQL in Oracle Database

Building Real-Time AWS Log Analytics Solution
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

Build a Spark Streaming Pipeline with Synapse and CosmosDB
In this Spark Streaming project, you will learn to build a robust and scalable spark streaming pipeline using Azure Synapse Analytics and Azure Cosmos DB and also gain expertise in window functions, joins, and logic apps for comprehensive real-time data analysis and processing.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

Python and MongoDB Project for Beginners with Source Code-Part 2
In this Python and MongoDB Project for Beginners, you will learn how to use Apache Sedona and perform advanced analysis on the Transportation dataset.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.