Flume Case Study: Website Log Aggregation
This case study focuses on a multi hop flume agent to aggregate the log reports from various web servers which have to be analyzed with the help of Hadoop. Consider a scenario we have multiple servers located in various locations serving from different data centers. The objective is to distribute the log files based on the device type and store a backup of all logs. For example logs of US server has to be transferred based on the data center the server is located and copy all of the logs in the master database.
In this case every server flume agent has a single source and two channels and sinks. One sending the data to the main database flume agent and other to the flume agent that is dividing the data based on the user agent present in the logs.
Before aggregating the logs, configuration has to be set for various components in our architecture. In case of the web server flume system, as discussed two sinks are needed, channels to distribute the same data to both the destinations. In the beginning, define the names of the various components that are going to be used:
server_agent.sources = apache_server server_agent.channels = storage1 storage2 server_agent.sinks= sink1 sink2
The source is configured to execute the command in the shell to retrieve the data. It is assumed as a Apache2 server and the logs location hasn’t been changed. The location can be varied based on the requirement. Then introduce a header for each event defining the data center from which it originated
server_agent.sources.apache_server.type = exec server_agent.sources.apache_server.command = tail -f /var/log/apache2/access.log server_agent.sources.apache_server.batchSize = 1 server_agent.sources.apache_server.interceptors = int1 server_agent.sources.apache_server.interceptors.int1.type = static server_agent.sources.apache_server.interceptors.int1.key = datacenter server_agent.sources.apache_server.interceptors.int1.value = US
The sink is considered to be of avro sink retrieving events from the channel ‘storage1’. It is connected to the source of the master database flume agent which has to be of avro type. It has been defined to replicate all the events received by source to all the sources in the agent. The same goes with another sink with a different channel, IP and port number. We have choose two different channel as an event is successfully sent to one sink it is immediately deleted and can’t be sent to other sink.
source_agent.sinks.sink1.type = avro source_agent.sinks.sink1.channel = storage1 source_agent.sinks.sink1.hostname = source_agent.sinks.sink1.port = source_agent.sinks.sink2.type = avro source_agent.sinks.sink2.channel = storage2 source_agent.sinks.sink2.hostname = source_agent.sinks.sink2.port = source_agent.sources.apache_server.selector.type = replicating
The same configuration for all the server flume agents with just a variation in the datacenter value. The sink1 sends the data to be stored in the master database while sink2 sends data to divide the data and store them in different databases. The code for the user agent based flume agent is similar to the server agent code with additional feature of Multiplexing as different events have to be sent to different channels based on the header value. Select the header ‘datacenter’ and divide the data between channels c1 and c2 based on the value of the header.
database_agent.sources.r1.selector.type = multiplexing database_agent.sources.r1.selector.header = datacenter database_agent.sources.r1.selector.mapping.ASIA = c1 database_agent.sources.r1.selector.mapping.US = c2
Start individual agents on each server using the following command:
$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template
As the log reports are generated in the apache log file, the log reports are transferred to various servers as required without bandwidth or security concerns with better reliability.