Hadoop WikiApache Hadoop
Hadoop is an open source distributed processing framework based on Java programming language for storing and processing large volumes of structured/unstructured data on clusters of commodity hardware. It is the big data platform with huge processing power and the ability to handle limitless concurrent jobs.
Hadoop Wiki References
Apache PigApache Pig is a hadoop component that provides abstraction over MapReduce so that programmers can analyse large volumes of data using the procedural language Pig Latin. All Pig Latin scripts are converted to Hadoop MapReduce jobs internally by the Pig Engine. Apache Pig can execute jobs also in Apache Spark or Apache Tez.
Apache HiveApache Hive is a data warehouse like infrastructure built on top of Hadoop for data querying, data summarization and data analysis. It provides SQL like interface for execution of MapReduce jobs through Hive Query Language (HiveQL). All Hive queries are split by the Hive service into simple MapReduce jobs and then executed across the Hadoop cluster.
Apache HBaseHBase is an open source NoSQL column-oriented distributed database for real-time read/write access of large datasets built on top of HDFS. It is a horizontally scalable database and provides low latency so that even larger tables can be looked up faster. HBase works well for sparse datasets and provides Google's Big Table like features for Hadoop.
Apache HBase Wiki References
Apache SqoopSqoop, got its name from two different and famous technologies SQL and Hadoop i.e. "Sq." from SQL and "oop" from Hadoop. Sqoop is a tool, primarily used for bulk transfer of data, so that data from various relational databases, data warehouses, or even from NoSQL data stores, can be imported/exported easily. Based on connector based architecture, other tools can also be connected to Sqoop, and Sqoop can also be connected to other tools like plugins very easily. For e.g., Sqoop can be connected to Apache Oozie, a work flow managing tool, and import/export tasks can be automated.
Apache Sqoop Wiki References
Apache FlumeFlume is a data ingestion tool used to send streaming data such as log files, events, etc. from different sources to HDFS. It is an efficient, reliable distributed tool for collecting, aggregating and transporting data from multiple web servers to a centralized data store.
Apache Flume Wiki References
Apache OozieOozie is a java based web application used for scheduling Hadoop jobs. Hadoop developers can run a series of jobs at a given schedule by arranging them in an ordered pipeline in the distributed environment. Oozie is tightly coupled with other components of Hadoop like Pig, Hive and Sqoop and thus can support the execution of various hadoop jobs.
Apache Oozie Wiki References
Big DataBig Data refers to large and complex datasets (structured and unstructured) which cannot be computed and processed using traditional applications. Big data is characterized by 3 important V's - Volume, Velocity and Variety :
- Volume of big data can be measured in terms or several megabytes, gigabytes, terabytes or petabytes
- Variety - Big data can exists in different file formats, SQL database stores, sensor data, social media data or data in any other form.
- Velocity of big data refers to the speed with which the data can be analysed to gain meaningful business gains.