Apache Sqoop

Sqoop - Structured Data Import tool for Hadoop HDFS

Apache Sqoop is used to efficiently transfer bulk data from relational databases to Hadoop HDFS. Sqoop is a pretty straightforward command-line interface application which has the capability to transfer individual tables or complete data bases to HDFS and import SQL databases or data from Mainframes straight to the Hive data warehouse. Sqoop is also used to export data out of Hadoop to structured databases. Sqoop works with MySQL, Teradata, Oracle, Netezza. Sqoop's name is derived from SQL + Hadoop, mainly due to it's "SQL to Hadoop; Hadoop to SQL" feature.

What is Sqoop used for?

  • Hadoop's HDFS is a great storage for veracious and voluminous big data. But in most cases, you will see, that the data that is stored in the companies are in disparate sources. There will hardly be a uniform data storage facility in a business. This is due to the nature of the data. Now, bringing all this data into HDFS would be great for analysis, as MapReduce can then perform tasks on top of that data. But importing the data from different databases to HDFS is as cumbersome a task as they come. You need to do several MapReduce joins to transfer the data stored in an RDBMS to HDFS. This would not only be time consuming but the more alarming factor is that it will increase the load of the database server. Alternately, you can either periodically transfer data to HDFS and let your MapReduce joins perform analysis there, or you can take an in-HDFS copy and import it to Hive to run some ad-hoc queries there. All this sounds and is very complicated. Sqoop solves the problem of data import to HDFS with a single command line.

How Sqoop Works?

Let us take a look at the following image to understand the workflow of Sqoop and how data gets imported through Sqoop to HDFS.

How sqoop works?

Image Source: slideshare
  • Sqoop allows you to transfer data from a relational database or a mainframe dataset. For relational databases, Sqoop reads the table row by row and imports it into HDFS. Each row is stored as a record in HDFS. For the mainframe datasets - Sqoop reads each record and imports it to HDFS. These records are stored in HDFS as delimited text files. These files are stored with user specified delimiter. Either a comma or a tab separating the records or as binary data as Avro or SequenceFiles serialized records. Sqoop tool - import is used to transfer data to Hadoop HDFS. When the Sqoop commands are used - Sqoop retrieves the table's properties and writes a MapReduce job to to import the entire table body.

Advantages of Sqoop

  1. Reduces the time to import data into HDFS
    It is a known fact that importing data to HDFS from sequential files is time consuming. Writing a MapReduce job to transfer data to HDFS is inefficient. Sqoop uses alternate methods of importing data which reduces the import time to a fraction.
  2. Aides fast data transfers
    Sqoop allows for bulk copy of data from disparate sources.
  3. Supports multiple DBMS
    Sqoop supports multiple DBMS like Oracle, MySQL, DB2, PostGre, and also supports data import to Hive and HBase.
  4. Balances the load on the databases server
    Sqoop balances the load on the databases server by limiting excessive storage processing the load to other systems.

Sqoop Blogs

Sqoop vs. Flume - Battle of the Hadoop ETL tools
Apache Sqoop (SQL-to-Hadoop) is a lifesaver for anyone who is experiencing difficulties in moving data from the data warehouse into the Hadoop environment. Apache Sqoop is an effective hadoop tool used for importing data from RDBMS's like MySQL, Oracle, etc. into HBase, Hive or HDFS. Click to read more.
Hadoop Components and Architecture:Big Data and Hadoop Training
Sqoop component is used for importing data from external sources into related Hadoop components like HDFS, HBase or Hive. It can also be used for exporting data from Hadoop o other external structured data stores. Sqoop parallelized data transfer, mitigates excessive loads, allows data imports, efficient data analysis and copies data quickly. Click to read more.

Sqoop Tutorials

Fundamentals of Apache Sqoop
Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc. Click to read more.
Hadoop Sqoop Tutorial: Example of Data Export
Let us assume, we have business application, which use Netezza database for data storage. Assume, we have imported the data from Netezza tables and processed it in Hadoop in order to benefit the distributed processing. Once the data is processed in Hadoop, we need to load the data back to Netezza in order to provide the data to downstream jobs. Click to read more.
Hadoop Sqoop Tutorial: Example of Data Aggregation
Let us suppose, we have an online application which use "mysql" database for storing the users information and their activities. As number of visitors to the site increase, data will increase proportionally. Processing very huge data in RDBMS environments is a bottleneck. If the data is very huge, RDBMS is not feasible. That is where distributed systems help. Click to read more.

Sqoop Interview Questions

  1. Explain about some important Sqoop commands other than import and export.

    • Create Job (--create) Here we are creating a job with the name my job, which can import the table data from RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the db database to the HDFS file. Read more
  2. How Sqoop can be used in a Java program?

    • The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line. Read more.
  3. What is the process to perform an incremental data load in Sqoop?

    • The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop. Read more.

Sqoop Slides

Sqoop Videos

Sqoop Q&A

  1. Where is MySQL jdbc connector for Sqoop

    • I believe that I need the following file: mysql-connector-java-5.1.32.tar.gz, which I think I can get from Oracle, if I create an account with them. But I think this course has provided a copy but I cannot find it. Is that correct? Click to read answer
  2. unable to load hbase using sqoop

    • executed the below command sqoop import --connect dbc:mysql://\retail_db --username root --password cloudera --table emp --hbase-table hemp --column-family pers --hbase-row-key id --hbase-create-table. Click to read answer
  3. Sqoop Connect

    • I'm not able to connect to mysql db or ping to the local windows machine. tested mysql url with heidi, attached screen shot- which shows error when tried to run import. granted user mysql privileges,moved connector jar to /usr/lib/sqoop/lib. tried changing vm settings Network Adapter->Bridged:didnt work. Any suggestions? Click to read answer.

Sqoop Assignments

You will need to Import data from MySQL to HDFS using Sqoop.

processing person-icon