Hadoop Sqoop Tutorial: Example of Data Aggregation
Let us suppose, we have an online application which use “mysql” database for storing the users information and their activities. As number of visitors to the site increase, data will increase proportionally. Processing very huge data in RDBMS environments is a bottleneck. If the data is very huge, RDBMS is not feasible. That is where distributed systems help. For this, we need to bring the data to distributed systems then we need to process. The data fetching process should also be fast.
Bringing data to distributed system is not that tedious job, if we have to bring the data from one database. But, in general business applications use different kind of databases. Following are the few problems we would face while bringing data directly.
- Data cannot be fetched in parallel.
- Need to maintain multiple connectors for different databases.
- Code maintenance.
- Time for code development.
- Need to import data to local then copy it to distributed file systems (Two Step).
- If we are fetching data incrementally, the last fetched value should be stored and maintained manually.
- If we want to load data into Hive or HBase, we would need to write different programs and need to maintain them.
- We need to create tables before hand.
- Cleansing of data should be done separately.
- If we need data in specialized file formats, we need to convert with an additional step.
Here, we are interested only at the problem of fetching data from different databases to our distributed systems like Hadoop.
For all the above problems mentioned, manual intervention is often required. We need a solution that would take care most of the things to make the data fetching process easy. Sqoop is the tool helps in bulk transferring the data between RDBMS database systems and distributed systems. Sqoop will take care of all the above problems mentioned. It provides simple command line option, where we can fetch data from different database systems by writing the simple sqoop command.
Technical prerequisites for this use case are
- Need to have Hadoop setup on your machine.
Either you can install and configure using the following link
and install sqoop on Hadoop
Download Cloudera or hortonworks distribution VM images and have a quick start.
Sqoop comes with the distribution itself.
- Need to have database installed. Based on the database system required, install the emulators for the same.
For mysql, download it from the below site and install the database.
We have a table customer_info in mysql database. We would like to import the table to Hadoop.
Table definition is as below:
Create table customer_info ( id bigint, first_name varchar(50), last_name varchar(50), address varchar(150), gender varchar(10), mobile varchar(12), email_address varchar(30) );
To list all the databases exist in the mysql database, we could directly use sqoop command.
The above command results in the list of all databases exist in mysql. Here “test” is the only database available in mysql database. Similarly to list all the tables in the specified database, we could directly use sqoop as below.
$ sqoop-list-databases --connect jdbc:mysql://10.200.29.27/test --username user1 --password passwd test
$ sqoop-list-tables --connect jdbc:mysql://10.200.29.27/test --username user1 --password passwd customer_events customer_info
We have test data of 8 records in customer_info table in mysql.
Data in the database is as below:
Following Sqoop command will fetch the data to HDFS.
$ sqoop import --connect jdbc:mysql://10.200.99.27/test --username user1 --password passwd --table customer_info -m 2 --split-by id --target-dir /tmp/customer_info --fields-terminated-by '\001'
The above command imports the data from customer_info table to HDFS directory /tmp/customer_info.
$ mysql> select * from customer_info; +------------+------------+-----------+---------+--------+-------------+-------------------+ | id | first_name | last_name | address | gender | mobile | email_address | +------------+------------+-----------+---------+--------+-------------+-------------------+ | 5483453457 | Alex | Madd | St 1,NY | Male | 7656545677 | firstname.lastname@example.org | | 5483453457 | Alex | Madd | St 1,NY | Male | 7656545679 | email@example.com | | 5234453455 | Alen | Wail | St 1,NY | Male | 7653643677 | firstname.lastname@example.org | | 5456753457 | Maddy | car | St 1,NY | Male | 7656346377 | email@example.com | | 5483643456 | Josh | Eddie | St 1,NY | Male | 7656523457 | firstname.lastname@example.org | | 5233445457 | Bill | Karls | St 1,NY | Male | 8326545677 | email@example.com | | 6583453437 | Max | Rodger | St 1,NY | Male | 76565455657 | firstname.lastname@example.org | | 7433453354 | Rodger | fur | St 1,NY | Male | 7656545677 | email@example.com | +------------+------------+-----------+---------+--------+-------------+-------------------+ 8 rows in set (0.00 sec)
Options used are:
--connect Takes jdbc url and connects with the database provided in the url
--username Username to connect to database
--password Password for the user
--table Source table from which data need to import
-m number of mappers to launch in parallel
--split-by split the mappers based on this column
--target-dir Stores the imported data into this directory
--fields-terminated-by Imported data is delimted with this.
So, just using simple Sqoop command we have imported data from database to HDFS. Similarly we could load directly to Hive and Hbase. The above command launced two mappers as we specified –m option to 2. So, it takes column specified in –split-by option and divides the data between the mappers. That’s how it achieves parallelism.
We could use --direct option, to fasten the import process.
We need use --incremental option, if the data import is incremental. So the Sqoop job fetches the records, which are greater than the value specified in the --last-value for the column specified in the –check-column option.
$ hadoop fs -cat /tmp/customer_info/*|cat -v 5483453457^AAlex^AMadd^ASt 1,NY^AMale^A7656545677^Aalex.firstname.lastname@example.org 5483453457^AAlex^AMadd^ASt 1,NY^AMale^A7656545679^Aalex.email@example.com 5234453455^AAlen^AWail^ASt 1,NY^AMale^A7653643677^Aalen.firstname.lastname@example.org 5456753457^AMaddy^Acar^ASt 1,NY^AMale^A7656346377^Amaddy.email@example.com 5483643456^AJosh^AEddie^ASt 1,NY^AMale^A7656523457^Ajosh.firstname.lastname@example.org 5233445457^ABill^AKarls^ASt 1,NY^AMale^A8326545677^Abill.email@example.com 6583453437^AMax^ARodger^ASt 1,NY^AMale^A76565455657^Amax.firstname.lastname@example.org 7433453354^ARodger^Afur^ASt 1,NY^AMale^A7656545677^Arodger.email@example.com
To conclude, Sqoop helps in transferring bulk data between RDBMS systems and Distributed Systems very optimally. It reduces the unnecessary efforts of the developers in coding and maintaining the code. As Sqoop transfers the data in parallel, the data transfer is also very fast. Because of its contributors and support, Sqoop helps very much in the Hadoop world.