Real-Time Log Processing using Spark Streaming Architecture

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Concept of Layover and batch processing for the webserver log Processing

  • Downloading the necessary Dataset

  • Understanding the dataset and its variables

  • Integrating the complete system for Real-time Log tracking

  • Fetching Real-time Log files using Fume Log4j appenders

  • Using Kafka for Log Aggregation

  • Real-Time Log Processing using Flume and integrating it with Kafka

  • Performing Data Analysis before storing the data in HBase in order of time

  • Understanding Cassandra and HBase, difference , similarities and its use in different scenarios

  • , Understanding components of a database and related terminologies

  • Understanding an HBase design

  • Variables of EDGAR data files and its description

  • Storing the EDGAR log file dataset

  • Selecting the Role key by combining different variables for saving in the database

  • Understanding the Streaming Application Code

  • Integrating Hive and HBase for data retrieval using query

  • Using the same created Architecture in different sectors

Project Description

A while back, we did web server access log processing using spark and hive. However, that processing was batch processing and in the lambda architecture, we will only be able to operate in the batch and serving layer.

In this big data project, we are going one step further by bringing processing to the speed layer of the lambda architecture which opens up more capabilities. One of such capability will be ability monitor application real time perform or measure real time comfort with applications or real time alert in case of security breach.

The abilities and functionalities will be explored using Spark Streaming in a streaming architecture. 

Note: It is worthy of note that the Cloudera QuickStart VM does not have Kafka. However, like in our objective, we will make the case for using Kafka but our implementation will not be using Kafka. Instead, we will integrate the log agent with Spark streaming in this big data project.

Similar Projects

Big Data Project Real-Time Log Processing in Kafka for Streaming Architecture
The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.
Big Data Project Airline Dataset Analysis using Hadoop, Hive, Pig and Impala
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Impala.
Big Data Project Yelp Data Processing Using Spark And Hive Part 1
In this big data project, we will continue from a previous hive project "Data engineering on Yelp Datasets using Hadoop tools" and do the entire data processing using spark.
Big Data Project SQL vs NoSQL-Choosing the right DBMS for your Project
In this project, we will walk through all the various classes of NoSQL database and try to establish where they are the best fit.

Curriculum For This Mini Project

  Web Server Log Processing in Batch Mode and the Concept of Rollover
  Downloading NASA Dataset
  Understading the Contents of the Log File -Common and Combined Log Format
  Making a case for real-time processing of log file
  Getting logs at real-time using Flume Log4j Appenders
  Making a case for Kafka for Log Aggregation
  Starting Flume Agent for Log Processing in Real-Time
  Analyse Data before Storing to HBase -Cracking the Design
  Discussion on the topics for next session
  Recap of previous session
  Difference between Cassandra and HBase
  Agenda for the Session
  Why HBase?
  HBase Design
  How to store EDGAR log file dataset?
  Understanding the Streaming Application Code
  Hive and HBase Integration
  Architectural Extensions