Apache Flume

Flume - Service for streaming event log data to Hadoop

Flume is a distributed and reliable service for collecting and aggregating event log data from various sources into a central data store such as HDFS. Flume is mostly used to transfer unstructured data. It was primarily designed to transfer streaming log data from web servers to Hadoop. Flume is not only restricted to log data. Since the data sources in Flume are customizable, Flume service is also used to load event data, data from social media platforms, emails, images, videos and various other such sources. Because Flume supports multiple data sources and scales horizontally, ecommerce companies such as Amazon, eBay, and social media giants like Facebook, Twitter - rely on Flume to transfer data to Hadoop.

How Flume Works?

The data that is stored in the web servers is either stored as a log or an event. A log is any event or action that takes place in the operating systems and an event is a byte payload with an optional string attributes. Take a look at the data flow in Flume.

Apache Flume

Image Source: flume.apache.org

Flume source will consume the events that are transferred to Flume from an outside source like web server log or social media data - in a format that is understood by the Flume source. Flume has 'collectors' - which will collect data from the different sources and will push it towards a centralized storage such as HDFS or HBase. These collectors are known as agents. The data flow in Flume progresses from web server logs to channels which store the data until the data is consumed by Flume Sinks - which then deposit the data to a final repository like HDFS. Flume allows for better reliability by building multi hop flows, where the data is stored in multiple channels before reaching HDFS. Flume uses a transactional approach to increase reliability. It means that data that is stored in the channels have very little risk of getting lost before getting transferred to its final destination.

Advantages of Flume

  1. Acts as a facilitator of data flows between the source and the destination.
    When the rate of data that is ingested is less that the rate of data generated, Flume steps in to maintain a steady flow of data through its channels.
  2. Data is easily collected from multiple sources.
    Flume's collectors are able to connect to various data sources to collect the data in different formats and store them in a centralized location.
  3. Flume is reliable.
    As in its different channels allows absolutely no loss of data, while it is being transferred from source to destination.
  4. Flume is robust in recoverability.
    Flume's file channel is backed up by a local filing system. There is also an in-memory channel that stores the events in a queue, and is fast in recovering the events in case of an agent failure.

Flume Blogs

Sqoop vs. Flume - Battle of the Hadoop ETL tools
Apache Flume is service designed for streaming logs into Hadoop environment. Flume is a distributed and reliable service for collecting and aggregating huge amounts of log data. With a simple and easy to use architecture based on streaming data flows, it also has tunable reliability mechanisms and several recovery and failover mechanisms. Click to read more.
Hadoop Components and Architecture:Big Data and Hadoop Training
Flume component is used to gather and aggregate large amounts of data. Apache Flume is used for collecting data from its origin and sending it back to the resting location (HDFS).Flume accomplishes this by outlining data flows that consist of 3 primary structures channels, sources and sinks. Click to read more.

Flume Tutorials

Fundamentals of Apache Flume
Apache Flume is an agent for data collection. It is generally used for log data. Flume takes data from several sources like Avro, Syslog's, and files and delivers to various destinations like Hadoop HDFS or HBase. Click to read more.
Flume Case Study: Twitter Data Extraction
In this case study, a flume agent is configured to retrieve data from Twitter. We know that Twitter is a huge source of data with people's opinions and preferences. The data can be used to analyse the public opinion or review on a specific topic or a product. Various types of analysis can be done based on the tweet data and location. Click to read more.
Flume Case Study: Website Log Aggregation
This case study focuses on a multi hop flume agent to aggregate the log reports from various web servers which have to be analyzed with the help of Hadoop. Consider a scenario we have multiple servers located in various locations serving from different data centers. The objective is to distribute the log files based on the device type and store a backup of all logs. Click to read more.

Flume Interview Questions

  1. Explain about the core components of Flume.

    The core components of Flume are:

    • Event- The single log entry or unit of data that is transported.
    • Source - This is the component through which data enters Flume workflows. Read more.
  2. Does Flume provide 100% reliability to the data flow?

    • Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow. Read more.
  3. How can Flume be used with HBase?

    Apache Flume can be used with HBase using one of the two HBase sinks:

    • HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
    • AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. Read more.

Flume Slides

Flume Videos

Flume Questions & Answers

  1. is flume installed in cloudera vm?

    • Is flume installed in cloudera VM ? If yes, please let me know how to use it? Click to read answer.
  2. flume is not able to start event

    • I'm running below command in flume, but getting error as mentioned in below. please check and help to resolve this issue. Click to read answer.
  3. Unable to copy flume-conf.properties file

    • I was trying to upload the flume-conf.properties file which is there is in the google drive to the /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf folder from my local using filezila, but it gives the following error: "Error: /usr/lib/flume-ng/apache-flume-1.4.0-bin/conf/flume-conf.properties: open for write: permission denied Click to read answer.

Flume Assignments

Installing Flume.

processing person-icon