Hadoop Zookeeper Tutorial for Beginners

Apache Zookeeper Tutorial

What is Apache Zookeeper?

Apache Zookeeper is a coordination service for distributed application that enables synchronization across a cluster. Zookeeper in Hadoop can be viewed as centralized repository where distributed applications can put data and get data out of it. It is used to keep the distributed system functioning together as a single unit, using its synchronization, serialization and coordination goals. For simplicity's sake Zookeeper can be thought of as a file system where we have znodes that store data instead of files or directories storing data. Zookeeper is a Hadoop Admin tool used for managing the jobs in the cluster.

What is Apache Zookeeper?

Introduction to Apache Zookeeper

The formal definition of Apache Zookeeper says that it is a distributed, open-source configuration, synchronization service along with naming registry for distributed applications. Apache Zookeeper is used to manage and coordinate large cluster of machines. For example Apache Storm which is used by Twitter for storing machine state data has Apache Zookeeper as the coordinator between machines.

Read in detail about the Hadoop Workflow and Cluster Manager - Apache Zookeeper.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

Why do we need Zookeeper in the Hadoop?

Distributed applications are difficult to coordinate and work with as they are much more error prone due to huge number of machines attached to network. As many machines are involved, race condition and deadlocks are common problems when implementing distributed applications. Race condition occurs when a machine tries to perform two or more operations at a time and this can be taken care by serialization property of ZooKeeper. Deadlocks are when two or more machines try to access same shared resource at the same time. More precisely they try to access each other’s resources which leads to lock of system as none of the system is releasing the resource but waiting for other system to release it. Synchronization in Zookeeper helps to solve the deadlock. Another major issue with distributed application can be partial failure of process, which can lead to inconsistency of data. Zookeeper handles this through atomicity, which means either whole of the process will finish or nothing will persist after failure. Thus Zookeeper is an important part of Hadoop that take care of these small but important issues so that developer can focus more on functionality of the application.

Zookeeper Services

How ZooKeeper in Hadoop Works?

Hadoop ZooKeeper, is a distributed application that follows a simple client-server model where clients are nodes that make use of the service, and servers are nodes that provide the service. Multiple server nodes are collectively called ZooKeeper ensemble. At any given time, one ZooKeeper client is connected to at least one ZooKeeper server. A master node is dynamically chosen in consensus within the ensemble; thus usually, an ensemble of Zookeeper is an odd number so that there is a majority of vote. If the master node fails, another master is chosen in no time and it takes over the previous master. Other than master and slaves there are also observers in Zookeeper. Observers were brought in to address the issue of scaling. With the addition of slaves the write performance is going to be affected as voting process is expensive. So observers are slaves that do not take part into voting process but have similar duties as other slaves.

Writes in Zookeeper

All the writes in Zookeeper go through the Master node, thus it is guaranteed that all writes will be sequential. On performing write operation to the Zookeeper, each server attached to that client persists the data along with master. Thus, this makes all the servers updated about the data. However this also means that concurrent writes cannot be made. Linear writes guarantee can be problematic if Zookeeper is used for write dominant workload. Zookeeper in Hadoop, is ideally used for coordinating message exchanges between clients, which involves less writes and more reads. Zookeeper is helpful till the time the data is shared but if application has concurrent data writing then Zookeeper can come in way of the application and impose strict ordering of operations.

Reads in Zookeeper

Zookeeper is best at reads as reads can be concurrent. Concurrent reads are done as each client is attached to different server and all clients can read from the servers simultaneously, although having concurrent reads leads to eventual consistency as master is not involved. There can be cases where client may have an outdated view, which gets updated with a little delay.

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

How to Use Apache ZooKeeper to Build Distributed Apps?

All the details mentioned above are done by the Zookeeper and the user does not have to do anything. The master is elected, the observers are set and the stage is made ready for the user to use the Zookeeper.

As compared earlier user can use Zookeeper like a file system where directories can be created and data can be stored inside it. The directories made above can also have children and grandchildren like any other file system. This file system is stored centrally thus giving access from any spot. Example of Apache Zookeeper can be a data model. Each directory in our example is called znode in Zookeeper. They are containers for data and other nodes. It stores statistical data like version details and user data up to 1Mb. This tiny space available to store information makes it clear that Zookeeper is not used for data storage like database but instead it is used for storing small amount of data like configuration data that needs to be shared.

There are 2 types of znodes:

Persistent: This is the default type of znode in any Zookeeper. Persistent nodes are always present and they contain the important configuration details. When a new node is added to the Zookeeper it goes to persistent znode and gets the configuration information.
Ephemeral: They are session nodes which gets created when an application fire ups and get deleted when the application has finished. This is mainly useful to keep check on client applications in case of failures. As the application fails the znode dies.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

Installing Apache ZooKeeper

Steps for downloading and installing Zookeeper 3.4.6 with configuration for 3 nodes Zookeeper:

Download and install JDK from http://www.oracle.com/technetwork/java/javase/downloads/index.html or from http://www.guru99.com/install-java.html - if not already installed. Apache ZooKeeper server runs on JVM so this is an important prerequisite.
Go to http://zookeeper.apache.org/ and download the Zookeeper from release page.
Choose to download from mirrors and select the first mirror.
Go to stable folder and download zookeeper-3.4.6.tar.gz
Unpack the tar ball with tar –zxvf zookeeper-3.4.6.tar.gz
Make a directory using mkdir /usr/local/zookeeper/data. You can make this directory as root and then change the owner to any user needed.
Create a zookeeper configuration file using sudo vi / usr/local/zookeeper/conf/zoo.cfg and place the following code:

tickTime = 2000
syncLimit = 5
dataDir = /usr/local/zookeeper/data
clientPor t= 2181
server.1 = Master : 2888 : 3888
server.2 = Slave1 : 2888 : 3888
server.3 = Slave2 : 2888 : 3888

Create a file called myid in data folder using sudo vi / usr/local/zookeeper/data/myid and write “1” in this file without quotes and save it.
Do the same steps from 1 to 7 for other 2 servers but change myid data to 2 for server 2 and 3 for server 3.
Use the command zkServer.sh start to start the Zookeeper on all servers
To confirm that the Zookeeper has started type jps and check for QuorumPeerMain.
To start a client use command zkCli.sh -server Slave1:2181

Using Zookeeper

Store configuration data and settings in a centralized repository so that it can be accessed from anywhere.
Message Queue for asynchronous communication, for example a user clicks on place order button on a website and that order is taking sometime to generate so rather than making the user wait, the order can be placed in a message queue and user can continue with purchase.

Zookeeper Hadoop can be kept as a watch guard and on change of data in one node other nodes can be informed about it through notifications.