Real-Time Log Processing using Spark Streaming Architecture

Real-Time Log Processing using Spark Streaming Architecture

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security
explanation image


Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Ray Han linkedin profile url

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

profile image

Hiren Ahir linkedin profile url

Microsoft Azure SQL Sever Developer, BI Developer

I'm a Graduate student and came into the job market and found a university degree wasn't sufficient to get a good paying job. I aimed at hottest technology in the market Big Data but the word BigData... Read More

What will you learn

Concept of Layover and batch processing for the webserver log Processing
Downloading the necessary Dataset
Understanding the dataset and its variables
Integrating the complete system for Real-time Log tracking
Fetching Real-time Log files using Fume Log4j appenders
Using Kafka for Log Aggregation
Real-Time Log Processing using Flume and integrating it with Kafka
Performing Data Analysis before storing the data in HBase in order of time
Understanding Cassandra and HBase, difference , similarities and its use in different scenarios
, Understanding components of a database and related terminologies
Understanding an HBase design
Variables of EDGAR data files and its description
Storing the EDGAR log file dataset
Selecting the Role key by combining different variables for saving in the database
Understanding the Streaming Application Code
Integrating Hive and HBase for data retrieval using query
Using the same created Architecture in different sectors

Project Description

A while back, we did web server access log processing using spark and hive. However, that processing was batch processing and in the lambda architecture, we will only be able to operate in the batch and serving layer.

In this big data project, we are going one step further by bringing processing to the speed layer of the lambda architecture which opens up more capabilities. One of such capability will be ability monitor application real time perform or measure real time comfort with applications or real time alert in case of security breach.

The abilities and functionalities will be explored using Spark Streaming in a streaming architecture. 

Note: It is worthy of note that the Cloudera QuickStart VM does not have Kafka. However, like in our objective, we will make the case for using Kafka but our implementation will not be using Kafka. Instead, we will integrate the log agent with Spark streaming in this big data project.

Similar Projects

In this big data project, we will see how data ingestion and loading is done with Kafka connect APIs while transformation will be done with Kafka Streaming API.

In this spark streaming project, we are going to build the backend of a IT job ad website by streaming data from twitter for analysis in spark.

In this big data project, we'll work through a real-world scenario using the Cortana Intelligence Suite tools, including the Microsoft Azure Portal, PowerShell, and Visual Studio.

Curriculum For This Mini Project

Web Server Log Processing in Batch Mode and the Concept of Rollover
Downloading NASA Dataset
Understading the Contents of the Log File -Common and Combined Log Format
Making a case for real-time processing of log file
Getting logs at real-time using Flume Log4j Appenders
Making a case for Kafka for Log Aggregation
Starting Flume Agent for Log Processing in Real-Time
Analyse Data before Storing to HBase -Cracking the Design
Discussion on the topics for next session
Recap of previous session
Difference between Cassandra and HBase
Agenda for the Session
Why HBase?
HBase Design
How to store EDGAR log file dataset?
Understanding the Streaming Application Code
Hive and HBase Integration
Architectural Extensions