Web Server Log Processing using Hadoop

Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Understanding the problem statement
What are log files and different types of log file
How to process log files and importance of processing them
What are a referrer and user agent
What are the contents of a log file and uses of a log file
Why flume and how does flume work, flume agent and its role
Processing and ingestion of log data using Flume
Processing data in the map-reduce file , and using Spark for data processing
Downloading the dataset and Installing Scala on Quickstart VM ware
What is DoS attack, performing the DoS attack, performing and preventing it
Using Apache Kafka for processing complex files
What is Oozie , using it to co-ordinate tasks understanding data flow
What are Lambda Architecture and its use during Batch and Streaming Processing
Dividing data into Batch Layer, SPeedLayer and Serving layer
Implementing and Troubleshooting Flume Agent
Accumulating and Executing Hive table
Using Impala for the low-latency query of processed log data
Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.

Prerequisite:

  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.

 

Similar Projects

This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security

In this hive project, you will design a data warehouse for e-commerce environments.

Curriculum For This Mini Project

What are log files and types of log files
08m
Contents of a log file
09m
Uses of log files
19m
Process log file using Flume
10m
Ingest log data using Flume
07m
Using Spark to process data
07m
Downloads and Installations
02m
DoS Attacks and log files
07m
Using Apache Kafka for complex event processing
06m
Using Oozie to coordinate tasks
16m
Log file use-case
21m
Clone github repository and summary overview
06m
Lambda Architecture for Data Infrastructure
05m
Solution Architecture overview
06m
Implement Flume Agent
27m
Troubleshooting Flume
29m
Spark Scale Execution
20m
Accumulator and execute hive table
14m
Impala execution
15m
Coordination tasks using Oozie
16m
Hue Workflow
02m
Running Oozie on command line
05m