Hadoop Flume Tutorial

Fundamentals of Apache Flume

Apache Flume is an agent for data collection. It is generally used for log data. Flume takes data from several sources like Avro, Syslog’s, and files and delivers to various destinations like Hadoop HDFS or HBase.

Apache Flume is composed of 6 important components-

Events- The data units that are transferred over a channel from source to sink. The size of events is usually 4 KB.
Sources- They accept data from a server or an application. Sources listen for events and write events to a channel.
Sinks receive data and store it in HDFS repository or transmit the data to another source. The basically write event data to a target and remove the event from the queue.
Channels connect between sources and sink by queuing event data for transactions.
Interceptors drop data or transfer data as it flows into the system.
Agents are used to run the sinks and sources in flume.

What is Apache Flume?

Apache Flume is a distributed system used for aggregating the files to a single location. In simple terms, It is used to move the data from one location to another in a reliable and efficient manner. It has in-built features such as reliability, failover and recovery mechanisms and is fault tolerant. It is an important component in the Hadoop ecosystem.

In Flume, each unit data is considered as an event. The data source is considered to be a source of various set of events. For example, each log event saved in a web server can be considered as an event. Flume system is configured to hear to a specific data source. As soon as the data is written by the specific source, Flume system consumes the event and transfers it to the destination as configured. Flume can be considered as a real time data transfer distributed system to prevent any data loss.

Let’s consider a scenario where logs of various web servers has to be analyzed. The logs get stored in the server as configured. The files have to be transferred to HDFS for analysis using Hadoop. Transferring of such huge files over the network is not reliable and can lead to total loss of data or data breach. So such classic methods cannot be used in case of huge data. The other alternative is transfer the logs as soon as they are created and storing them in the required file system. This is where Apache Flume comes into play. It take the data from the source and writes it into the destination as configured.It can be used to transfer files also. It is useful when a data file has to be written in multiple location or data from multiple sources have to be written in a single destination. We also have the feature of storing the data in various location based upon various parameters.

Apache Flume Architecture

Flume basically consists of three main components, Source, Channel and Sink. A flume agent is basically a JVM process which consists of these three components through which data flow occurs. Given below is representation of a simple Flume agent listening to a webserver and writing the data to HDFS.

The source component receives the data from an external data source or a flume sink. There are different formats in which the data can be transferred. Therefore the source has to be configured based on the format of the input data. The source can be configured to listen on various sources. This helps in aggregating data from various sources and store them in a single location. The channel is a temporary storage for the events. It is similar to a queue before being passed to the sink. The storage can be of two types, memory storage or disk storage. By using memory based storage high throughput can be achieved but in case of a failure all the data is lost. A disk based storage offers less throughput and more reliability.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

The sink retrieves the events from the channel and then either writes it into a file system or passes it to the next agent. The source and sink work asynchronously which leads to the necessity of channel. The architecture of the Flume agent is very flexible. Source component can be designed to listen to various sources. Same goes with the channel and sink. Various sinks can be configured to send data to various destinations. We can develop a multi-hop architecture for the data transfer which is a combination of various flume agents linked either to aggregate data from various sources or distribute data to various destinations.

The flume system follows a transaction approach for the data transfer. An event is requested from the channel by the sink. An event from the channel is sent to the sink. The event is removed from the channel after it receives a confirmation from the sink. It helps in reliable transfer and prevents data loss. The same is followed in between sink-source in case of multi-hop system. Flume supports various types of data formats/mechanisms which can be passed through its agents. It supports Avro, Thrift, Syslog and Netcat. These are various formats used in data transfer.

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

Apache Flume Setup

To start a flume agent we need to set the configuration of the components. The configuration consists of the component name, type, format of data, source address and other properties. As we know the data transfer takes place in a flow, the component names are used as references to define which components are connected.

The basic format to set a certain property of a component is:

... = value

For example,

agent1.sources.reader1.bind = localhost

The agent name is set as ‘agent1’. The source component ‘reader1’ is configured to read data from localhost. Let’s look at the code to develop a simple Flume agent. We define the names of the various components in the flume agent. As mentioned a simple agent will contain a source, channel and a sink.

agent.sources = input1
agent.sinks = output1
agent.channels = memory1

The source component ‘input1’ is configured to listen on the localhost address and port 44444. The source data is defined as netcat type. netcat format is a very common in which each line in the document is taken as an event.

agent.sources.input1.type = netcat
agent.sources.input1.bind = localhost
agent.sources.input1.port = 44444

The sink component ‘output1’ is configured to write the output in the logger format. logger format is used to write the output/events at INFO level. This is mainly used in development/debugging

agent.sinks.output1.type = logger

The channel component ‘memory1’ is defined to use memory based storage which we know has a high throughput. The capacity defines the number of events that can be stored in the channel through the ‘capacity’ property and number of events passed during each transaction through the ‘transactionCapacity’ property.

agent.channels.memory1.type = memory
agent.channels.memory1.capacity = 1000
agent.channels.memory1.transactionCapacity = 100

In flume system, the data transfer follows a specific path from one component to another. Therefore it is essential to define the path i.e: which component attached to which. We have defined that the channel for the ‘input1’ source is ‘memory1’ and in the same way ‘memory1’ channel for the sink ‘output1’

agent.sources.input1.channels = memory1
agent.sinks.output1.channel = memory1

To start the flume agent, the following command is used

$ bin/flume-ng agent -c  -f

 -n

For example ,

$ bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n

agent

Get More Practice, More Big Data and Analytics Projects, and More guidance.Fast-Track Your Career Transition with ProjectPro

The flume agent with multiple components can be initiated in the same way by stating its properties and defining the path. Flume system also additional features to alter and modify the events. Interceptor class is one such type. They are used to modify or drop the events based upon the given criteria selected. These get implemented when the data is being transferred from source to the channel. Some of the events can also be dropped by using certain interceptors. As discussed, we can design a flume agent with multiple sinks having their respective channels. We have two options, either to send all the events to all the channels or send specific channels. This can be defined by using Channel selector class. You can either replicate all the data in all channels or divide them based on the header values. A custom channel selector can also be written based on the requirement.

We can also design an agent with multiple sinks to achieve better throughput and reliability. These multiple sinks are considered as a single entity and are used in case of failover and for load balancing. Thus increasing the reliability of the system.