Web Server Log Processing using Hadoop

Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Mohamed Yusef Ahmed

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

Mike Vogt

Information Architect at Bank of America

I have had a very positive experience. The platform is very rich in resources, and the expert was thoroughly knowledgeable on the subject matter - real world hands-on experience. I wish I had this... Read More

What will you learn

Understanding the problem statement
What are log files and different types of log file
How to process log files and importance of processing them
What are a referrer and user agent
What are the contents of a log file and uses of a log file
Why flume and how does flume work, flume agent and its role
Processing and ingestion of log data using Flume
Processing data in the map-reduce file , and using Spark for data processing
Downloading the dataset and Installing Scala on Quickstart VM ware
What is DoS attack, performing the DoS attack, performing and preventing it
Using Apache Kafka for processing complex files
What is Oozie , using it to co-ordinate tasks understanding data flow
What are Lambda Architecture and its use during Batch and Streaming Processing
Dividing data into Batch Layer, SPeedLayer and Serving layer
Implementing and Troubleshooting Flume Agent
Accumulating and Executing Hive table
Using Impala for the low-latency query of processed log data
Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.


  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.


Similar Projects

Explore hive usage efficiently in this hadoop hive project using various file formats such as JSON, CSV, ORC, AVRO and compare their relative performances

Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Curriculum For This Mini Project

What are log files and types of log files
Contents of a log file
Uses of log files
Process log file using Flume
Ingest log data using Flume
Using Spark to process data
Downloads and Installations
DoS Attacks and log files
Using Apache Kafka for complex event processing
Using Oozie to coordinate tasks
Log file use-case
Clone github repository and summary overview
Lambda Architecture for Data Infrastructure
Solution Architecture overview
Implement Flume Agent
Troubleshooting Flume
Spark Scale Execution
Accumulator and execute hive table
Impala execution
Coordination tasks using Oozie
Hue Workflow
Running Oozie on command line