Establishing effective configuration management is an important step for building a distributed system. It is a complex process which helps in planning, identifying, tracking and verifying changes in the software. It is important to maintain configuration integrity throughout the life cycle of the system, this can be done by having good configuration management system.
Zookeeper Apache is a distributed coordination service for distributed applications. It is designed to help users focus more on the functionality of the distributed application rather than worrying about the architecture. The centralized infrastructure and services provide synchronization across a Hadoop cluster.
Sqoop is a combination of SQL and Hadoop. Sqoop is a data transfer command line utility designed for efficiently importing and exporting data between RDBMS and HDFS. The data can be imported from any RDBMS like Oracle, MySQL into HDFS.
Apache Pig is designed to handle any kind of data. Apache Pig is a high level extensible language designed to reduce the complexities of coding MapReduce applications. Pig was developed at Yahoo to help people use Hadoop to emphasize on analysing large unstructured data sets by minimizing the time spent on writing Mapper and Reducer functions.
Oozie is a server based job coordination system and workflow engine that runs in Java servlet-container. It is designed for executing workflow jobs with actions that trigger Pig jobs or MapReduce jobs. Oozie helps you string together a workflow of various coordinated jobs like Pig job, MapReduce job and a Hive Query.
A database that can be modelled through any other means apart from the traditional tabular relations is generally referred to as a NoSQL database. A NoSQL database organizes large distributed data sets into tuples - key value pairs and objects.
Apache Hive is a Hadoop run time component developed at Facebook. The data warehouse infrastructure is built on top of Hadoop stack to help users with querying, analysis and summarization. Apache Hive is a subset of SQL-92 plus Hive specific extensions.
Hadoop HDFS is a java based distributed file system for storing large unstructured data sets. Hadoop HDFS is designed to provide high performance access to data across large Hadoop clusters of commodity servers. It is referred to as the “Secret Sauce” of Apache Hadoop components as the data can be stored in blocks on the file system until the organization’s wants to leverage it for big data analytics.
Hadoop HBase is a real time, open source, column oriented, distributed database written in Java. HBase is modelled after Google’s BigTable and represents a key value column family store. It is built on top of Apache Hadoop and Zookeeper.
Apache Flume is an agent for data collection. It is generally used for log data. Flume takes data from several sources like Avro, Syslog’s, and files and delivers to various destinations like Hadoop HDFS or HBase.