Last Update Made On January 3, 2017.
This article will give you a sneak peek into the commonly asked HBase interview questions and answers during Hadoop job interviews.
“What is the difference between Full Shutdown backup and Live Cluster backup in HBase?”
You stare at the interviewer asking you that question and think, I should know this! But at that moment, you cannot remember, and then blame yourself mentally for not preparing thoroughly for your Hadoop Job interview.
This is just a hypothetical case that we are talking about and if you prepare well, you will be able to answer any HBase Interview Question, during your next Hadoop job interview, having read DeZyre Hadoop Interview Questions blogs. Any company looking to hire a Hadoop Developer is looking for Hadoopers who can code well - beyond the basic Hadoop MapReduce concepts. Hadoop interviewers ask questions to validate an interviewee’s deep knowledge of the entire Hadoop Ecosystem. While we suggest that you learn hadoop through a comprehensive hands-on project based hadoop training before facing a Hadoop Job interview, we help you prepare for your next Hadoop job interview through a series of blogs –
Before we dive into HBase interview questions, here’s an overview of what is HBase and its features -
HBase, commonly referred to as the “Hadoop Database”, is a column oriented database based on the principles of Google Big Table. HBase does not directly use the capabilities of Hadoop MapReduce but can integrate with Hadoop to act as a source or destination for Hadoop MapReduce jobs. HBase provides real-time read or write access to data in HDFS. Data can be stored in HDFS directly or through HBase. Just like HDFS has a NameNode and Slave Node, Hadoop MapReduce has TaskTracker and JobTracker, HBase also has a Master Node and Region Server. Master node manages the cluster and region servers in HBase store portions of the HBase tables and perform data model operations.
HBase system consists of tables with rows and columns just like a traditional RDBMS. Every table must have a primary key which is used to access the data in HBase tables. HBase columns define the attributes of an object. For instance, if your HBase table stores web server logs then each row in the HBase will be a log record and the columns can be the server name from where web server log originated, the time when the log was written, etc. Several attributes can be grouped together in HBase to form column families. All the elements of a single column family are stored together. Column families should be specified when defining the table schema, however, HBase is so flexible and new columns can be added to a column family at any time based on application requirements.
For the complete list of big data companies and their salaries- CLICK HERE
1) Compare RDBMS with HBase
|Schema||Has a fixed schema||No fixed schema|
|Query Language||Supports structured powerful query language||Simple Query language|
|Transaction Processing||Support ACID transactions.||Is eventually consistent but does not support ACID transactions.|
2) What do you understand by CAP theorem and which features of CAP theorem does HBase follow?
CAP stands for Consistency, Availability and Partition Tolerance.
3) Name few other popular column oriented databases like HBase.
CouchDB, MongoDB, Cassandra
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
4) What do you understand by Filters in HBase?
HBase filters enhance the effectiveness of working with large data stored in tables by allowing users to add limiting selectors to a query and eliminate the data that is not required. Filters have access to the complete row to which they are applied. HBase has 18 filters –
5) Explain about the data model operations in HBase.
Put Method – To store data in HBase
Get Method – To retrieve data stored in HBase.
Delete Method- To delete the data from HBase tables.
Scan Method –To iterate over the data with larger key ranges or the entire table.
6) How will you back up an HBase cluster?
HBase cluster backups are performed in 2 ways-
Live Cluster Backup
Full Shutdown Backup
In live cluster backup strategy, copy table utility is used to copy the data from one table to another on the same cluster or another cluster. Export utility can also be used to dump the contents of the table onto HDFS on the same cluster.
In full shutdown backup approach, a periodic complete shutdown of the HBase cluster is performed so that the Master and Region Servers go down and if there are hardly any chances of losing out the in-flight changes happening to metadata or StoreFiles. However, this kind of approach can be used only for back-end analytic capacity and not for applications that serve front end webpages.
7) Does HBase support SQL like syntax?
SQL like support for HBase is not yet available. With the use of Apache Phoenix, user can retrieve data from HBase through SQL queries.
8) Is it possible to iterate through the rows of HBase table in reverse order?
Column values are put on disk and the length of the value is written first and then the actual value is written. To iterate through these values in reverse order-the bytes of the actual value should be written twice.
9) Should the region server be located on all DataNodes?
Yes. Region Servers run on the same servers as DataNodes.
10) Suppose that your data is stored in collections, for instance some binary data, message data or metadata is all keyed on the same value. Will you use HBase for this?
Yes, it is ideal to use HBase whenever key based access to data is required for storing and retrieving.
11) Assume that an HBase table Student is disabled. Can you tell me how will I access the student table using Scan command once it is disabled?
Any HBase table that is disabled cannot be accessed using Scan command.
12) What do you understand by compaction?
During periods of heavy incoming writes, it is not possible to achieve optimal performance by having one file per store. Thus, HBase combines all these HFiles to reduce the number of disk seeds for every read. This process is referred to as Compaction in HBase.
13) Explain about the various table design approaches in HBase.
Tall-Narrow and Flat-Wide are the two HBase table design approaches that can be used. However, which approach should be used when merely depends on what you want to achieve and how you want to use the data. The performance of HBase completely depends on the RowKey and hence on directly on how data is accessed.
On a high level the major difference between flat-wide and tall-narrow approach is similar to the difference between get and scan. Full scans are costly in HBase because of ordered RowKey storage policy. Tall-narrow approach can be used when there is a complex RowKey so that focused scans can be performed on logical group of entries.
Ideally, tall-narrow approach is used when there are less number of rows and large number of columns whereas flat-wide approach is used when there are less number of columns and large number of rows.
14) Which one would you recommend for HBase table design approach – tall-narrow or flat wide?
There are several factors to be considered when deciding between flat-wide (millions of columns and limited keys) and tall-narrow (millions of keys with limited columns), however, a tall-narrow approach is often recommended because of the following reasons –
15) What is the best practice on deciding the number of column families for HBase table?
It is ideal not to exceed the number of columns families per HBase table by 15 because every column family in HBase is stored as a single file, so large number of columns families will be required to read and merge multiple files.
16) How will you implement joins in HBase?
HBase does not support joins directly but by using MapReduce jobs join queries can be implemented to retrieve data from various HBase tables.
17) What is the difference between HBase and HDFS?
HDFS is a local file system in Hadoop for storing large files but it does not provide tabular form of storage. HDFS is more like a local file system (NTFS or FAT). Data in HDFS is accessed through MapReduce jobs and is well suited for high latency batch processing operations.
HBase is a column oriented database on Hadoop that runs on top of HDFS and stores data in tabular format. HBase is like a database management system that communicates with HDFS to write logical tabular data to physical file system. One can access single rows using HBase from billions of records it has and is well-suited for low latency operations. HBase puts data in indexed StoreFiles present on HDFS for high speed lookups.
1) How will you design the HBase Schema for Twitter data?
2) You want to fetch data from HBase to create a REST API. Which is the best way to read HBase data using a Spark Job or a Java program?
3) Design a HBase table for many to many relationship between two entities, for example employee and department.
4) Explain an example that demonstrates good de-normalization in HBase with consistency.
5) Should your HBase and MapReduce cluster be the same or they should be run on separate clusters?
If there are any other HBase interview questions that you have been asked in your Hadoop Job interview, then feel free to share it in the comments below.