Web Server Log Processing using Hadoop

In this hadoop project, you will be using a sample application log file from an application server to a demonstrated scaled-down server log processing pipeline.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Understanding the problem statement

  • What are log files and different types of log file

  • How to process log files and importance of processing them

  • What are a referrer and user agent

  • What are the contents of a log file and uses of a log file

  • Why flume and how does flume work, flume agent and its role

  • Processing and ingestion of log data using Flume

  • Processing data in the map-reduce file , and using Spark for data processing

  • Downloading the dataset and Installing Scala on Quickstart VM ware

  • What is DoS attack, performing the DoS attack, performing and preventing it

  • Using Apache Kafka for processing complex files

  • What is Oozie , using it to co-ordinate tasks understanding data flow

  • What are Lambda Architecture and its use during Batch and Streaming Processing

  • Dividing data into Batch Layer, SPeedLayer and Serving layer

  • Implementing and Troubleshooting Flume Agent

  • Accumulating and Executing Hive table

  • Using Impala for the low-latency query of processed log data

  • Coordinating the data processing pipeline with Oozie

Project Description

Storing, processing and mining data from web server logs has become mainstream for a lot of companies today. Industry giants have used this engineering and the accompany science of machine learning to extract information that has helped in ads targeting, improved search, application optimization and general improvement in application's user experience.
In this hadoop project, we will be using a sample application log file from an application server to demonstrated a scaled-down server log processing pipeline. From ingestion to insight usually require Hadoop-ecosystem tools like Flume, Pig, Spark, Hive/Impala, Kafka, Oozie and HDFS for storage and this is what we will be looking at but holistically and specifically at each stage of the pipeline.

Prerequisite:

  1. It is expected that students have a fair knowledge of Big Data and Hadoop.
  2. Installation of the Cloudera quickstart vm is super-essential to get the best from this class. Instruction on how to setup a scala SDK and runtime can be found from here.

 

Similar Projects

Big Data Project Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.
Big Data Project Work with Streaming Data using Twitter API to Build a JobPortal
In this spark streaming project, we are going to build the backend of a IT job ad website by streaming data from twitter for analysis in spark.
Big Data Project Spark Project -Real-time data collection and Spark Streaming Aggregation
In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.
Big Data Project Tough engineering choices with large datasets in Hive Part - 2
This is in continuation of the previous Hive project "Tough engineering choices with large datasets in Hive Part - 1", where we will work on processing big data sets using Hive.

Curriculum For This Mini Project

 
  What are log files and types of log files
08m
  Contents of a log file
09m
  Uses of log files
19m
  Process log file using Flume
10m
  Ingest log data using Flume
07m
  Using Spark to process data
07m
  Downloads and Installations
02m
  DoS Attacks and log files
07m
  Using Apache Kafka for complex event processing
06m
  Using Oozie to coordinate tasks
16m
  Log file use-case
21m
  Clone github repository and summary overview
06m
  Lambda Architecture for Data Infrastructure
05m
  Solution Architecture overview
06m
  Implement Flume Agent
27m
  Troubleshooting Flume
29m
  Spark Scale Execution
20m
  Accumulator and execute hive table
14m
  Impala execution
15m
  Coordination tasks using Oozie
16m
  Hue Workflow
02m
  Running Oozie on command line
05m