Web Server Log Processing using Hadoop

Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

Arvind Sodhi

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

What will you learn

Understanding the problem statement
What are log files and different types of log file
How to process log files and importance of processing them
What are a referrer and user agent
What are the contents of a log file and uses of a log file
Why flume and how does flume work, flume agent and its role
Processing and ingestion of log data using Flume
Processing data in the map-reduce file , and using Spark for data processing
Downloading the dataset and Installing Scala on Quickstart VM ware
What is DoS attack, performing the DoS attack, performing and preventing it
Using Apache Kafka for processing complex files
What is Oozie , using it to co-ordinate tasks understanding data flow
What are Lambda Architecture and its use during Batch and Streaming Processing
Dividing data into Batch Layer, SPeedLayer and Serving layer
Implementing and Troubleshooting Flume Agent
Accumulating and Executing Hive table
Using Impala for the low-latency query of processed log data
Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.


  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.


Similar Projects

In this hive project, you will design a data warehouse for e-commerce environments.

Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.

In this big data spark project, we will do Twitter sentiment analysis using spark streaming on the incoming streaming data.

Curriculum For This Mini Project

What are log files and types of log files
Contents of a log file
Uses of log files
Process log file using Flume
Ingest log data using Flume
Using Spark to process data
Downloads and Installations
DoS Attacks and log files
Using Apache Kafka for complex event processing
Using Oozie to coordinate tasks
Log file use-case
Clone github repository and summary overview
Lambda Architecture for Data Infrastructure
Solution Architecture overview
Implement Flume Agent
Troubleshooting Flume
Spark Scale Execution
Accumulator and execute hive table
Impala execution
Coordination tasks using Oozie
Hue Workflow
Running Oozie on command line