Installing Hadoop cluster in production is just half the battle won. It is extremely important for a Hadoop admin to tune the Hadoop cluster setup to gain maximum performance. During Hadoop installation, the cluster is configured with default configuration settings which are on par with the minimal hardware configuration. It is extremely important for Hadoop admins to be familiar with various hardware specifications like – the number of disks mounted on datanodes, RAM capacity, the number of virtual and physical cores, the number of CPU cores, NIC Cards, etc.
There is no single performance tuning technique that can fit all hadoop jobs because it is very difficult to obtain equilibrium among the various resources whilst solving the big data problem. The performance tuning tips and tricks vary based on the amount of data that is being moved and also on the type of Hadoop job being run in production. This blog highlights some of the best performance tuning tips for Hadoop jobs to achieve maximum performance.
The biggest selling point for Apache Hadoop driving enterprise adoption, as a big data processing framework is - the cost effectiveness in setting up data centers for processing huge amounts of structured and unstructured data. However, the major roadblock in obtaining maximum performance from a hadoop cluster is its core hardware stack. Considering commodity hardware as the major thing, it is extremely necessary for a Hadoop admin to make the best use of a Hadoop cluster’s capability to achieve best performance from the hardware stack. Hadoop being horizontally scalable, most of the hadoop administrators continue adding nodes or instances into a hadoop cluster to enhance the performance. However, doing so leads to massive hadoop clusters which do not run on optimal configurations leading to huge operational costs.
Let’s explore some of the best and most effective performance tuning techniques, to set up hadoop clusters in production with commodity hardware, to enhance performance with minimal operational cost:
The foremost step to ensure maximum performance for a Hadoop job, is to tune the best configuration parameters for memory, by monitoring the memory usage on the server. Apache Hadoop has various options on memory, disk, CPU and network that helps optimize the performance of the hadoop cluster. Every Hadoop MapReduce job collects information about various input records read, number of records pipelined for further execution, number of reducer records, heap size set, swap memory, etc. Generally, hadoop tasks are not bounded by CPU- the prime concern should be to optimize the memory usage and disk spills.
A thumb rule when tuning the memory is to ensure that the jobs don’t trigger swapping. Swap memory usage can be monitored using software like Ganglia, Nagios or Cloudera Manager. Whenever there is excess swap memory utilization, memory usage should be optimized by configuring the mapred.child.java.opts property by reducing the amount of RAM that is allotted to each task in mapred.child.java.opts. The memory for the task can be adjusted by setting the mapred.child.java.opts to -Xmx2048M in the mapred-site.xml file as shown below-
Here are some key points to be followed to optimize the MapReduce performance by ensuring that the Hadoop cluster configuration is tuned-
Learn Hadoop Online and Get IBM Certification to validate your hadoop skills.
Disk IO is one of the major performance bottleneck and here are two ways that help minimize disk spilling:
When the Map Output is very large, intermediate data size should be reduced using various compression techniques like LZO, BZIP, Snappy, etc. Map Output is not compressed by default and to enable compression of Map Output - mapreduce.map.output.compress should be set to true. ‘mapreduce.map.output.compress.codec’ should be set based on whatever compression technique is used LZO, BZIP or Snappy.
Every MapReduce job that produces large Map Output is likely to benefit from intermediate data compression with LZO. It initially might seem to be an overhead for the CPU but minimal number of disk IO operations during the shuffle phase will considerably improve the performance. With LZO compression every 1GB of output data save approximately 3GB of disk writes.
For the complete list of big data companies and their salaries- CLICK HERE
Every map or reduce task usually takes 40 seconds to complete execution. Also when there is a large job to be executed it does not make the best use of all the slots in the available cluster. Thus, it is extremely important to tune the number of map or reduce tasks using the following techniques-
Based on the Hadoop environment, apart from data compression technique, writing a combiner to reduce the amount of data to be transferred can also prove to be beneficial. If the job performs a large shuffle wherein the map output is several GBs per node or if the job performs an aggregation sort – writing a combiner can help optimize the performance. Combiner acts as an optimizer for the MapReduce job. It runs on the output of the Map phase to reduce the number of intermediate keys being passed to the reducers. This reduces the load of the reduce task in processing the business logic.
Using standard joins in the transformation logic with Pig or Hive tools can at times result in weird performance of the MapReduce jobs, as the data being processed might have some skewness - meaning 80% of the data is going to a single reducer. If there is a huge amount of data for a single key, then one of the reducer will be held up with processing majority of the data –this is when Skewed join comes to the rescue. Skewed join computes a histogram to find out which key is dominant and then data is split based on its various reducers to achieve optimal performance.
The performance of MapReduce jobs is seriously impacted when tasks take a long time to finish execution. Speculative execution is a common approach to solve this problem by backing up slow tasks on alternate machines. Setting the configuration parameters ‘mapreduce.map.tasks.speculative.execution’ and ‘mapreduce.reduce.tasks.speculative.execution’ to true will enable speculative execution so that the job execution time is reduced if the task progress is slow due to memory unavailability.
There are several performance tuning tips and tricks for a Hadoop Cluster and we have highlighted some of the important ones. We request the Hadoop community to share some of the best performance tuning tips that they have experimented with, to help developers and Hadoop admins get maximum performance from the Hadoop cluster in production.