Web Server Log Processing using Hadoop

Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Shailesh Kurdekar linkedin profile url

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

profile image

Mohamed Yusef Ahmed linkedin profile url

Software Developer at Taske

Recently I became interested in Hadoop as I think its a great platform for storing and analyzing large structured and unstructured data sets. The experts did a great job not only explaining the... Read More

What will you learn

Understanding the problem statement
What are log files and different types of log file
How to process log files and importance of processing them
What are a referrer and user agent
What are the contents of a log file and uses of a log file
Why flume and how does flume work, flume agent and its role
Processing and ingestion of log data using Flume
Processing data in the map-reduce file , and using Spark for data processing
Downloading the dataset and Installing Scala on Quickstart VM ware
What is DoS attack, performing the DoS attack, performing and preventing it
Using Apache Kafka for processing complex files
What is Oozie , using it to co-ordinate tasks understanding data flow
What are Lambda Architecture and its use during Batch and Streaming Processing
Dividing data into Batch Layer, SPeedLayer and Serving layer
Implementing and Troubleshooting Flume Agent
Accumulating and Executing Hive table
Using Impala for the low-latency query of processed log data
Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.


  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.


Similar Projects

The goal of this IoT project is to build an argument for generalized streaming architecture for reactive data ingestion based on a microservice architecture. 

In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.

In this project, we will show how to build an ETL pipeline on streaming datasets using Kafka.

Curriculum For This Mini Project

What are log files and types of log files
Contents of a log file
Uses of log files
Process log file using Flume
Ingest log data using Flume
Using Spark to process data
Downloads and Installations
DoS Attacks and log files
Using Apache Kafka for complex event processing
Using Oozie to coordinate tasks
Log file use-case
Clone github repository and summary overview
Lambda Architecture for Data Infrastructure
Solution Architecture overview
Implement Flume Agent
Troubleshooting Flume
Spark Scale Execution
Accumulator and execute hive table
Impala execution
Coordination tasks using Oozie
Hue Workflow
Running Oozie on command line