ProjectPro wants you to be prepared with Hadoop Interview Questions around all the Hadoop components in the big data ecosystem so that you can answer any Hadoop interview question that gets thrown your way. This blog post covers commonly asked Hadoop Sqoop Interview Questions and Answers that will help you get through your next Hadoop job interview.
You are sitting in the lobby waiting to go in for your Hadoop job interview, mentally you have prepared dozens of Hadoop Interview Questions and Answers by referring to these blogs –
Needless to say, you are confident that you are going to nail this Hadoop job interview. But then, the interviewer instead of beginning the Hadoop interview with questions around Hadoop MapReduce, HDFS, Pig, Hive, throws a curveball at you by asking “What are the possible file formats for importing data using Sqoop in Hadoop?”
Before we dive into apache Sqoop Hadoop interview questions and answers, let’s take a look at why Sqoop was developed and what is its significance in the Hadoop ecosystem-
Suppose you want to process legacy data or lookup tables present in RDBMS using Hadoop MapReduce , the straightforward solution is to read data from the RDBMS into the mapper and process it but this could result in distributed denial of service (i.e. the bandwidth of the resources would be flooded). Thus, this solution is not practically recommended and this is when Apache Sqoop comes to the rescues of users that allows users to import data on HDFS.
Apache Sqoop is a lifesaver for people facing challenges with moving data out of a data warehouse into the Hadoop environment. Sqoop is a SQL to Hadoop tool for efficiently importing data from a RDBMS like MySQL, Oracle, etc. directly into HDFS or Hive or HBase. It can also be used to export the data in HDFS and back to the RDBMS. Users can import one or more tables, the entire database to selected columns from a table using Apache Sqoop. Sqoop is compatible with all JDBC compatible databases.
Apache Sqoop uses Hadoop MapReduce to get data from relational databases and stores it on HDFS. When importing data, Sqoop controls the number of mappers accessing RDBMS to avoid distributed denial of service attacks. 4 mappers can be used at a time by default, however, the value of this can be configured. It is suggested not to use a greater value that 4 as this might occupy the entire spool space of the database.
Apache Sqoop is an open-source tool available in the Hadoop ecosystem. Sqoop is designed for the efficient transfer of bulk data between the Hadoop ecosystem and external non-Hadoop structured datastores. Structured datastores indicate that Sqoop only works with Relational Database Management Systems (RDBMS). Apache Sqoop is used to provide bidirectional data transfer between Hadoop and RDBMS. Sqoop allows importing structured data from external sources into Hadoop and the export of data from Hadoop into an external non-Hadoop database table. In Hadoop, the data can be imported into HDFS (Hadoop Distributed File System), Hive, or HBase. External stores can be either relational databases or enterprise warehouses. Sqoop works with several relational databases, including Oracle, MySQL, Netezza, HSQLDB, Postgres, and Teradata.
Apache Sqoop works on a connector-based architecture, meaning that Sqoop comes with plugins that allow connectivity to external data sources. Sqoop is primarily used for cases where the data transfer has to be quick since Sqoop performs parallel data transfer. Sqoop is equipped with import tools that allow the import of tables or entire databases from an external database into the Hadoop environment and export tools that enable the export of directories from the Hadoop ecosystem onto external non-Hadoop databases. In Sqoop, once the data transfer is complete, the import or export processes terminate.
ETL is short for Export, Load, Transform. The purpose of ETL tools is to move data across different systems. Data is collected from various sources and moved into a destination in a different manner or a different context when compared to the data present on the source.
Apache Sqoop is one such ETL tool provided in the Hadoop environment. Using Sqoop, data can be imported into Hadoop from external relational databases.
Sqoop provides support for loading the entire database, loading only some of the tables in the database, and incremental loading of the database. Sqoop also provides functionalities to specify specific rows and columns that are to be imported.
|Used for importing data from structured data sources like RDBMS.||Used for moving bulk streaming data into HDFS.|
|It has a connector based architecture.||It has a agent based architecture.|
Data import in sqoop is not event driven.
|Data load in flume is event driven|
|HDFS is the destination for importing data.||Data flows into HDFS through one or more channels.|
Read more on Sqoop vs Flume
Sqoop allows data to be imported using two file formats
This is the default file format to import data using Sqoop. This file format can be explicitly specified using the –as-textfile argument to the import command in Sqoop. Passing this as an argument to the command will produce the string based representation of all the records to the output files with the delimited characters between rows and columns.
It is a binary file format where records are stored in custom record-specific data types which are shown as Java classes. Sqoop automatically creates these data types and manifests them as java classes.
This can be accomplished using the import-all-tables import command in Sqoop and by specifying the exclude-tables option with it as follows-
--connect –username –password --exclude-tables Table298, Table 123, Table 299
Yes, MySQL is the default database.
Apache Sqoop import command does not support direct import of BLOB and CLOB large objects. To import large objects, I Sqoop, JDBC based imports have to be used without the direct argument to the import utility.
This can be accomplished using the –m 1 option in the Sqoop import command. It will create only one MapReduce task which will then import rows serially.
Unlike sqoop-list-tables and sqoop-list-databases, there is no direct command like sqoop-list-columns to list all the columns. The indirect way of achieving this is to retrieve the columns of the desired tables and redirect them to a file which can be viewed manually containing the column names of a particular table.
Sqoop import --m 1 --connect 'jdbc: sqlserver: //nameofmyserver; database=nameofmydatabase; username=ProjectPro; password=mypassword' --query "SELECT column_name, DATA_TYPE FROM INFORMATION_SCHEMA.Columns WHERE table_name='mytableofinterest' AND \$CONDITIONS" --target-dir 'mytableofinterest_column_name'
Both distCP (Distributed Copy in Hadoop) and Sqoop transfer data in parallel but the only difference is that distCP command can transfer any kind of data from one Hadoop cluster to another whereas Sqoop transfers data between RDBMS and other components in the Hadoop ecosystem like HBase, Hive, HDFS, etc.
Sqoop metastore is a shared metadata repository for remote users to define and execute saved jobs created using sqoop job defined in the metastore. The sqoop –site.xml should be configured to connect to the metastore.
--Split-by clause is used to specify the columns of the table that are used to generate splits for data imports. This clause specifies the columns that will be used for splitting when importing the data into the Hadoop cluster. —split-by clause helps achieve improved performance through greater parallelism. Apache Sqoop will create splits based on the values present in the columns specified in the –split-by clause of the import command. If the –split-by clause is not specified, then the primary key of the table is used to create the splits while data import. At times the primary key of the table might not have evenly distributed values between the minimum and maximum range. Under such circumstances –split-by clause can be used to specify some other column that has even distribution of data to create splits so that data import is efficient.
Using the –boundary-query clause. Generally, sqoop uses the SQL query select min (), max () from to find out the boundary values for creating splits. However, if this query is not optimal then using the –boundary-query argument any random query can be written to generate two numeric columns.
Hadoop MapReduce cluster is configured to run a maximum of 4 parallel MapReduce tasks and the sqoop import can be configured with number of parallel tasks less than or equal to 4 but not more than 4.
If the imported records have rows that contain null values for all the columns, then probably those records might have been dropped off during import because HBase does not allow null values in all the columns of a record.
Using the –input-null-string parameter, a default value can be specified so that the row gets inserted with the default value for the column that it has a NULL value in HDFS.
Data can be synchronised using incremental parameter with data import –
--Incremental parameter can be used with one of the two options-
i) append-If the table is getting updated continuously with new rows and increasing row id values then incremental import with append option should be used where values of some of the columns are checked (columns to be checked are specified using –check-column) and if it discovers any modified value for those columns then only a new row will be inserted.
ii) lastmodified – In this kind of incremental import, the source has a date column which is checked for. Any records that have been updated after the last import based on the lastmodifed column in the source, the values would be updated.
16) Below command is used to specify the connect string that contains hostname to connect MySQL with local host and database name as test_db –
–connect jdbc: mysql: //localhost/test_db
When using Sqoop with a distributed Hadoop cluster the URL should not be specified with localhost in the connect string because the connect string will be applied on all the DataNodes with the Hadoop cluster. So, if the literal name localhost is mentioned instead of the IP address or the complete hostname then each node will connect to a different database on their localhosts. It is always suggested to specify the hostname that can be seen by all remote nodes.
1) I have 20000 records in a table. I want copy them to two separate files( records equally distributed) into HDFS (using Sqoop).
How do we achieve this, if table does not have primary key or unique key?