Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.
Videos
Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Project Experience
Add project experience to your Linkedin/Github profiles.

What will you learn

  • The benefits of log-mining in certain industries
  • A full log-mining application use-case
  • Using Flume to ingest log data
  • Using Spark to process data
  • Integrating Kafka to complex event alert
  • Using Impala for the low-latency query of processed log data.
  • Coordinating the data processing pipeline with Oozie.

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.

Prerequisite:

  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.

 

Curriculum For This Mini Project

 
  What are log files and types of log files
08m
  Contents of a log file
09m
  Uses of log files
19m
  Process log file using Flume
10m
  Ingest log data using Flume
07m
  Using Spark to process data
07m
  Downloads and Installations
02m
  DoS Attacks and log files
07m
  Using Apache Kafka for complex event processing
06m
  Using Oozie to coordinate tasks
16m
  Log file use-case
21m
  Clone github repository and summary overview
06m
  Lambda Architecture for Data Infrastructure
05m
  Solution Architecture overview
06m
  Implement Flume Agent
27m
  Troubleshooting Flume
29m
  Spark Scale Execution
20m
  Accumulator and execute hive table
14m
  Impala execution
15m
  Coordination tasks using Oozie
16m
  Hue Workflow
02m
  Running Oozie on command line
05m