HBase - The Database for Hadoop

HBase is a distributed columnar key value database on top of HDFS that provides real-time read/write random access to large datasets. Modelled from Google Big Table, it can host large sparsely populated tables on Hadoop clusters made of commodity hardware. HBase is not a relational database and nor does it support SQL.

Understanding HBase

  • In HBase, data is stored in rows and each row has a RowKey. RowKey resembles the primary key in a traditional RDBMS and is the pointer to the actual data. Row keys in HBase are byte arrays so even compound row keys can be created easily by merging different criteria to form a single key which helps optimize data access speed. A query on the RowKey returns the content of the row or if it's a columnar view then all of its columns are returned. HBase uses RowKey to determine the sort order of a table's rows and to guide data sharding (the process in which data should be distributed across the cluster).

How HBase works?

Applications store data, in labelled tables made of rows and columns. The intersection of row and column coordinates are versioned and are auto assigned when a cell is created. Row columns are grouped into column families having a common prefix with the details of column family name, email and password. It is easy to add new column family members as and when required, if a table column family is defined upfront as a part of the table schema. HBase portions tables horizontally into regions by default where each region consists of a subset of table rows. The subset takes the first row into account and the last row is excluded from being considered. So, this way the last row will act as the first row for the second region and so on. All the data is stored initially in only one region server but once data reaches the configurable size threshold then it is spread across two region servers.

how hbash works

HBase provides strong data consistency on reads and writes making it different from other NoSQL databases. HBase uses master nodes to manage various region servers that distribute and process different parts of data tables. The master node is responsible for orchestrating a cluster of one or more region servers as workers. HBase depends on Zookeeper and by default it is responsible for managing a Zookeeper instance.

Features of HBase

  • It is easy add new columns on the fly by defining table and column families in advance.
  • Provides high scalability, reliability and schema flexibility.
  • HBase has in-built versioning and coprocessors that act like triggers and stored procedures in SQL.
  • HBase has no sense of data types as the entire data is stored as an array of bytes.
  • Offers strong row-level consistency.
  • For non-java front-ends, HBase supports Thrift or REST API's and for programmatic access it provides an easy to use Java API.
  • Provides high volume query optimization through bloom filters and block cache.
  • When You Should Use HBase?

    It is ideal to use HBase only when there are millions or billions of rows and columns in a table.

HBase Blogs

NoSQL vs SQL- 4 Reasons Why NoSQL is better for Big Data applications?
HBase for Hadoop, a popular NoSQL database is used extensively by Facebook for its messaging infrastructure. HBase is used by Twitter for generating data, storing, logging, and monitoring data around people search.HBase is used by the discovery engine Stumble upon for data analytics and storage. Click to read more.
Innovation in Big Data Technologies aides Hadoop Adoption
To provide timely search results across the Internet, Google has to cache the web. This paved path for a novel technology that could search the huge cache quickly- a distributed storage system for managing structured data that could scale to petabytes of data across thousands of commodity servers. Mike Cafarella then released the open source code for big table implementation after which it was popularly known as Hadoop Database(HBase). Click to read more.
Hadoop Components and Architecture:Big Data and Hadoop Training
HBase is a column-oriented database that uses HDFS for underlying storage of data. HBase supports random reads and also batch computations using MapReduce. With HBase NoSQL database enterprise can create large tables with millions of rows and columns on hardware machine. Click to read more.

HBase Tutorials

Hadoop HBase Tutorial
HBase runs on top of HDFS to provide Hadoop with similar capabilities like Bigtable - it provides fault tolerant method for storing massive amounts of sparse data sets for big data use cases. HBase is sensitive to the loss of its master node. HBase does not support structured query language unlike relational data stores. Click to read more.

HBase Interview Questions

  1. When should you use HBase and what are the key components of HBase?

    • HBase should be used when the big data application has -
      1. A variable schema
      2. When data is stored in the form of collections
      3. If the application demands key based access to data while retrieving.
      Read more
  2. What are the different operational commands in HBase at record level and table level?

    • Record Level Operational Commands in HBase are - put, get, increment, scan and delete. Table Level Operational Commands in HBase are - describe, list, drop, disable and scan. Read more.
  3. What is Row Key?

    • Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array. Read more.

HBase Slides

HBase Videos

HBase Q&A

  1. How to run hbase on the amazon aws?

    • How can i get to the hbase shell after connecting as ec2-user
    • So is there any gui available for hbase? Click to read answer
  2. How to run hbase on the amazon aws?

  3. How can we fetch the data for a given condition in HBase?

    • In hbase i have created the data for table called 'Student':
    • put 'student', 'first', 'details:name', 'prashanth'
      put 'student', 'first', 'details:age', '39'
      put 'student', 'first', 'details:gender', 'M'
      put 'student', 'second', 'details:name', 'Mary'
      put 'student', 'second', 'details:name', '35'
    • How can i fetch the data for 'Mary' only or anybody who has age less than '38'? Click to read answer
  4. HBase not able to start problem

    • Alread set the environment variables in hbase-env.sh file under /hbase/conf/
    • Please uncomment the line:
    • #export JAVA_HOME=/usr/lib/-----------------------------------------
    • Enter the new line as:
    • export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386 Click to read answer
  5. HBase and Hue not opening

    • I get this error when I try to open Hue or Hbase on Cloudera; does anyone know why this is? Click to read answer

HBase Assignments

Install HBase on a Single Node Cluster;

processing person-icon