Pinterest runs 38 different HBase clusters with some of them doing up to 5 million operations every second. Goibibo uses HBase for customer profiling. Facebook Messenger uses HBase architecture and many other companies like Flurry, Adobe Explorys use HBase in production. In spite of a few rough edges, HBase has become a shining sensation within the white hot Hadoop market. The NOSQL column oriented database has experienced incredible popularity in the last few years. You might have come across several resources that explain HBase architecture and guide you through HBase installation process. However, this blog post focuses on the need for HBase, which data structure is used in HBase, data model and the high level functioning of the components in the apache HBase architecture.
Need for HBase
Apache Hadoop has gained popularity in the big data space for storing, managing and processing big data as it can handle high volume of multi-structured data. However, Hadoop cannot handle high velocity of random writes and reads and also cannot change a file without completely rewriting it. HBase is a NoSQL, column oriented database built on top of hadoop to overcome the drawbacks of HDFS as it allows fast random writes and reads in an optimized way. Also, with exponentially growing data, relational databases cannot handle the variety of data to render better performance. HBase provides scalability and partitioning for efficient storage and retrieval.
“Anybody who wants to keep data within an HDFS environment and wants to do anything other than brute-force reading of the entire file system [with MapReduce] needs to look at HBase. If you need random access, you have to have HBase."- said Gartner analyst Merv Adrian.
If you would like more information about Big Data careers, please click the orange "Request Info" button on top of this page.
HBase –Understanding the Basics
HBase is a data model similar to Google’s big table that is designed to provide random access to high volume of structured or unstructured data. HBase is an important component of the Hadoop ecosystem that leverages the fault tolerance feature of HDFS. HBase provides real-time read or write access to data in HDFS. HBase can be referred to as a data store instead of a database as it misses out on some important features of traditional RDBMs like typed columns, triggers, advanced query languages and secondary indexes.
For the complete list of big data companies and their salaries- CLICK HERE
HBase Data Model
HBase data model stores semi-structured data having different data types, varying column size and field size. The layout of HBase data model eases data partitioning and distribution across the cluster. HBase data model consists of several logical components- row key, column family, table name, timestamp, etc. Row Key is used to uniquely identify the rows in HBase tables. Column families in HBase are static whereas the columns, by themselves, are dynamic.
- HBase Tables – Logical collection of rows stored in individual partitions known as Regions.
- HBase Row – Instance of data in a table.
- RowKey -Every entry in an HBase table is identified and indexed by a RowKey.
- Columns - For every RowKey an unlimited number of attributes can be stored.
- Column Family – Data in rows is grouped together as column families and all columns are stored together in a low level storage file known as HFile.
HBase Use Cases- When to use HBase
HBase is an ideal platform with ACID compliance properties making it a perfect choice for high-scale, real-time applications. It does not require a fixed schema, so developers have the provision to add new data as and when required without having to conform to a predefined model.
It provides users with database like access to Hadoop-scale storage, so developers can perform read or write on subset of data efficiently, without having to scan through the complete dataset. HBase is the best choice as a NoSQL database, when your application already has a hadoop cluster running with huge amount of data. HBase helps perform fast read/writes.
HBase Architecture Explained
HBase provides low-latency random reads and writes on top of HDFS. In HBase, tables are dynamically distributed by the system whenever they become too large to handle (Auto Sharding). The simplest and foundational unit of horizontal scalability in HBase is a Region. A continuous, sorted set of rows that are stored together is referred to as a region (subset of table data). HBase architecture has a single HBase master node (HMaster) and several slaves i.e. region servers. Each region server (slave) serves a set of regions, and a region can be served only by a single region server. Whenever a client sends a write request, HMaster receives the request and forwards it to the corresponding region server.
Image Credit : Cloudera
HBase can be run in a multiple master setup, wherein there is only single active master at a time. HBase tables are partitioned into multiple regions with every region storing multiple table’s rows.
Components of Apache HBase Architecture
HBase architecture has 3 important components- HMaster, Region Server and ZooKeeper.
HBase HMaster is a lightweight process that assigns regions to region servers in the Hadoop cluster for load balancing. Responsibilities of HMaster –
- Manages and Monitors the Hadoop Cluster
- Performs Administration (Interface for creating, updating and deleting tables.)
- Controlling the failover
- DDL operations are handled by the HMaster
- Whenever a client wants to change the schema and change any of the metadata operations, HMaster is responsible for all these operations.
These are the worker nodes which handle read, write, update, and delete requests from clients. Region Server process, runs on every node in the hadoop cluster. Region Server runs on HDFS DataNode and consists of the following components –
- Block Cache – This is the read cache. Most frequently read data is stored in the read cache and whenever the block cache is full, recently used data is evicted.
- MemStore- This is the write cache and stores new data that is not yet written to the disk. Every column family in a region has a MemStore.
- Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent storage.
- HFile is the actual storage file that stores the rows as sorted key values on a disk.
HBase uses ZooKeeper as a distributed coordination service for region assignments and to recover any region server crashes by loading them onto other region servers that are functioning. ZooKeeper is a centralized monitoring server that maintains configuration information and provides distributed synchronization. Whenever a client wants to communicate with regions, they have to approach Zookeeper first. HMaster and Region servers are registered with ZooKeeper service, client needs to access ZooKeeper quorum in order to connect with region servers and HMaster. In case of node failure within an HBase cluster, ZKquoram will trigger error messages and start repairing failed nodes.
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster- tracking information about how many region servers are there and which region servers are holding which DataNode. HMaster contacts ZooKeeper to get the details of region servers. Various services that Zookeeper provides include –
- Establishing client communication with region servers.
- Tracking server failure and network partitions.
- Maintain Configuration Information
- Provides ephemeral nodes, which represent different region servers.
Understanding the fundamental of HBase architecture is easy but running HBase on top of HDFS in production is challenging when it comes to monitoring compactions, row key designs manual splitting, etc. If you would like to learn how to design a proper schema, derive query patterns and achieve high throughput with low latency then enrol now for comprehensive hands-on Hadoop Training.