How Flume Works?
The data that is stored in the web servers is either stored as a log or an event. A log is any event or action that takes place in the operating systems and an event is a byte payload with an optional string attributes. Take a look at the data flow in Flume.
Image Source: flume.apache.org
Flume source will consume the events that are transferred to Flume from an outside source like web server log or social media data - in a format that is understood by the Flume source. Flume has 'collectors' - which will collect data from the different sources and will push it towards a centralized storage such as HDFS or HBase. These collectors are known as agents. The data flow in Flume progresses from web server logs to channels which store the data until the data is consumed by Flume Sinks - which then deposit the data to a final repository like HDFS. Flume allows for better reliability by building multi hop flows, where the data is stored in multiple channels before reaching HDFS. Flume uses a transactional approach to increase reliability. It means that data that is stored in the channels have very little risk of getting lost before getting transferred to its final destination.
Advantages of Flume
- Acts as a facilitator of data flows between the source and the destination.
- When the rate of data that is ingested is less that the rate of data generated, Flume steps in to maintain a steady flow of data through its channels.
- Data is easily collected from multiple sources.
- Flume's collectors are able to connect to various data sources to collect the data in different formats and store them in a centralized location.
- Flume is reliable.
- As in its different channels allows absolutely no loss of data, while it is being transferred from source to destination.
- Flume is robust in recoverability.
- Flume's file channel is backed up by a local filing system. There is also an in-memory channel that stores the events in a queue, and is fast in recovering the events in case of an agent failure.
- Sqoop vs. Flume - Battle of the Hadoop ETL tools
- Apache Flume is service designed for streaming logs into Hadoop environment. Flume is a distributed and reliable service for collecting and aggregating huge amounts of log data. With a simple and easy to use architecture based on streaming data flows, it also has tunable reliability mechanisms and several recovery and failover mechanisms. Click to read more.
- Hadoop Components and Architecture:Big Data and Hadoop Training
- Flume component is used to gather and aggregate large amounts of data. Apache Flume is used for collecting data from its origin and sending it back to the resting location (HDFS).Flume accomplishes this by outlining data flows that consist of 3 primary structures channels, sources and sinks. Click to read more.
- Fundamentals of Apache Flume
- Apache Flume is an agent for data collection. It is generally used for log data. Flume takes data from several sources like Avro, Syslog's, and files and delivers to various destinations like Hadoop HDFS or HBase. Click to read more.
- Flume Case Study: Twitter Data Extraction
- In this case study, a flume agent is configured to retrieve data from Twitter. We know that Twitter is a huge source of data with people's opinions and preferences. The data can be used to analyse the public opinion or review on a specific topic or a product. Various types of analysis can be done based on the tweet data and location. Click to read more.
- Flume Case Study: Website Log Aggregation
- This case study focuses on a multi hop flume agent to aggregate the log reports from various web servers which have to be analyzed with the help of Hadoop. Consider a scenario we have multiple servers located in various locations serving from different data centers. The objective is to distribute the log files based on the device type and store a backup of all logs. Click to read more.
Flume Interview Questions
Explain about the core components of Flume.
The core components of Flume are:
Event- The single log entry or unit of data that is transported.
- Read more.
Source - This is the component through which data enters Flume workflows.
Does Flume provide 100% reliability to the data flow?
- Read more.
Yes, Apache Flume provides end to end reliability because of its transactional approach in data flow.
How can Flume be used with HBase?
Apache Flume can be used with HBase using one of the two HBase sinks:
- HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
- AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. Read more.