Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Understanding the problem statement

  • What are log files and different types of log file

  • How to process log files and importance of processing them

  • What are a referrer and user agent

  • What are the contents of a log file and uses of a log file

  • Why flume and how does flume work, flume agent and its role

  • Processing and ingestion of log data using Flume

  • Processing data in the map-reduce file , and using Spark for data processing

  • Downloading the dataset and Installing Scala on Quickstart VM ware

  • What is DoS attack, performing the DoS attack, performing and preventing it

  • Using Apache Kafka for processing complex files

  • What is Oozie , using it to co-ordinate tasks understanding data flow

  • What are Lambda Architecture and its use during Batch and Streaming Processing

  • Dividing data into Batch Layer, SPeedLayer and Serving layer

  • Implementing and Troubleshooting Flume Agent

  • Accumulating and Executing Hive table

  • Using Impala for the low-latency query of processed log data

  • Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.


  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.


Similar Projects

Big Data Project Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.
Big Data Project Work with Streaming Data using Twitter API to Build a JobPortal
In this spark streaming project, we are going to build the backend of a IT job ad website by streaming data from twitter for analysis in spark.
Big Data Project Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.
Big Data Project Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Curriculum For This Mini Project

  What are log files and types of log files
  Contents of a log file
  Uses of log files
  Process log file using Flume
  Ingest log data using Flume
  Using Spark to process data
  Downloads and Installations
  DoS Attacks and log files
  Using Apache Kafka for complex event processing
  Using Oozie to coordinate tasks
  Log file use-case
  Clone github repository and summary overview
  Lambda Architecture for Data Infrastructure
  Solution Architecture overview
  Implement Flume Agent
  Troubleshooting Flume
  Spark Scale Execution
  Accumulator and execute hive table
  Impala execution
  Coordination tasks using Oozie
  Hue Workflow
  Running Oozie on command line