Flume Hadoop Tutorial: Website Log Aggregation

Flume Case Study: Website Log Aggregation

Problem Statement

This case study focuses on a multi hop flume agent to aggregate the log reports from various web servers which have to be analyzed with the help of Hadoop. Consider a scenario we have multiple servers located in various locations serving from different data centers. The objective is to distribute the log files based on the device type and store a backup of all logs. For example logs of US server has to be transferred based on the data center the server is located and copy all of the logs in the master database.

In this case every server flume agent has a single source and two channels and sinks. One sending the data to the main database flume agent and other to the flume agent that is dividing the data based on the user agent present in the logs.

Learn Hadoop by working on interesting Big Data and Hadoop Projects

Proposed Solution

Before aggregating the logs, configuration has to be set for various components in our architecture. In case of the web server flume system, as discussed two sinks are needed, channels to distribute the same data to both the destinations. In the beginning, define the names of the various components that are going to be used:

server_agent.sources = apache_server
server_agent.channels = storage1 storage2
server_agent.sinks= sink1 sink2

The source is configured to execute the command in the shell to retrieve the data. It is assumed as a Apache2 server and the logs location hasn’t been changed. The location can be varied based on the requirement. Then introduce a header for each event defining the data center from which it originated

server_agent.sources.apache_server.type = exec
server_agent.sources.apache_server.command = tail -f 
     /var/log/apache2/access.log
server_agent.sources.apache_server.batchSize = 1
server_agent.sources.apache_server.interceptors = int1
server_agent.sources.apache_server.interceptors.int1.type = static
server_agent.sources.apache_server.interceptors.int1.key = datacenter
server_agent.sources.apache_server.interceptors.int1.value = US

Get FREE Access to Data Analytics Example Codes for Data Cleaning, Data Munging, and Data Visualization

The sink is considered to be of avro sink retrieving events from the channel ‘storage1’. It is connected to the source of the master database flume agent which has to be of avro type. It has been defined to replicate all the events received by source to all the sources in the agent. The same goes with another sink with a different channel, IP and port number. We have choose two different channel as an event is successfully sent to one sink it is immediately deleted and can’t be sent to other sink.

source_agent.sinks.sink1.type = avro
source_agent.sinks.sink1.channel = storage1
source_agent.sinks.sink1.hostname = 
source_agent.sinks.sink1.port = 
source_agent.sinks.sink2.type = avro
source_agent.sinks.sink2.channel = storage2
source_agent.sinks.sink2.hostname = 
source_agent.sinks.sink2.port = 
source_agent.sources.apache_server.selector.type = replicating

The same configuration for all the server flume agents with just a variation in the datacenter value. The sink1 sends the data to be stored in the master database while sink2 sends data to divide the data and store them in different databases. The code for the user agent based flume agent is similar to the server agent code with additional feature of Multiplexing as different events have to be sent to different channels based on the header value. Select the header ‘datacenter’ and divide the data between channels c1 and c2 based on the value of the header.

database_agent.sources.r1.selector.type = multiplexing
database_agent.sources.r1.selector.header = datacenter
database_agent.sources.r1.selector.mapping.ASIA = c1
database_agent.sources.r1.selector.mapping.US = c2

Executing Solution

Start individual agents on each server using the following command:

$ bin/flume-ng agent -n $agent_name -c conf -f
conf/flume-conf.properties.template

As the log reports are generated in the apache log file, the log reports are transferred to various servers as required without bandwidth or security concerns with better reliability.