Hadoop Sqoop Tutorial: Example of Data Export
Let us assume, we have business application, which use Netezza database for data storage. Assume, we have imported the data from Netezza tables and processed it in Hadoop in order to benefit the distributed processing. Once the data is processed in Hadoop, we need to load the data back to Netezza in order to provide the data to downstream jobs.
For this case also, we would face the same problems, which we have faced in use case 1 if we are trying to load the data without any tool. Few of them are
- Data cannot be loaded parallel.
- Development time and code maintenance.
- Special connectors need to maintain for different databases.
- Need to copy data to the local file system, then we need to upload the data to database systems.
Here, we are interested in exporting the data in an easier and optimum way.
For loading data back to database systems, without any overheads mentioned above. Sqoop works perfect. Sqoop exports the data from distributed file system to database system very optimally. It provides simple command line option, where we can fetch data from different database systems by writing the simple sqoop command.
Technical prerequisites for this use case are
- Need to have Hadoop setup on your machine.
Either you can install and configure using the following link
and install sqoop on Hadoop
Download Cloudera or hortonworks distribution VM images and have a quick start.
Sqoop comes with the distribution itself.
- Need to have database installed. Based on the database system required, install the emulators for the same.
For Netezza, download it from the below site and install the database.
We have Control-A delimited data in HDFS directory. We need to load data into Netezza table using Sqoop.
The record count of data available in HDFS as below.
$ hadoop fs -cat /raw/hadoop_output/*|wc -l 1742809
Record count of the table before Sqoop export.
$ SELECT COUNT(*) FROM netezza_data_load; COUNT | --------------------- | 1 | $ sqoop export --connect jdbc:netezza://10.18.22.10:5480/ BIG_DATA --username 'USER1' --password passwd --export-dir /raw/hadoop_output --table netezza_data_load -m 10 --input-fields-terminated-by '\001' --direct
The above command exports the data from the HDFS directory /raw/hadoop_output to the netezza table netezza_data_load.
--connect Takes jdbc url and connects with the database provided in the url
--username Username to connect to netezza database
--password Password for the username
--export-dir HDFS directory where the data is copied from
--table Destination table where the data to be loaded
-m Number of mappers to be launched
--input-fields-terminated-by Input data field delimiter
--direct This option exports the data very fast using Netezza Connector
Sqoop provides connectors for the databases to load the data optimally. You can provide --driver option to specify jdbc driver class to use and --conection-manager to specify the connection-manager class to use.
If we are not providing any connection-manager, Sqoop checks for the jdbc drivers in Sqoop classpath for any database specific jdbc connectors (For example, based on the jdbc url specified, it checks for netezza jdbc jar in classpath). If Sqoop cannot find any connector it will use ojdbc connector.
If we are using --direct option, data would get loaded very fast even for huge data.
Other wise Sqoop works with preparing INSERT statements and hit the database for each INSERT statement. This process would take long time to load.
Sqoop has a bug, when we are using --direct option specifically to Netezza. If you are using --direct option while exporting data to Netezza, you need to keep few points in mind.
- The user name should be UPPER CASE.
- The owner of the Netezza table and the user, which is inserting the data to the table should be same.
Record count after the Sqoop export.
$ SELECT COUNT(*) FROM netezza_data_load; COUNT | --------------------- | 1742810 |
To conclude, Sqoop helps in transferring bulk data between RDBMS systems and Distributed Systems very optimally. It reduces the unnecessary efforts of the developers in coding and maintaining the code. As Sqoop transfers the data in parallel, the data transfer is also very fast. Because of its contributors and support, Sqoop helps very much in the Hadoop world.