Real-Time Log Processing using Spark Streaming Architecture

In this Spark project, we are going to bring processing to the speed layer of the lambda architecture which opens up capabilities to monitor application real time performance, measure real time comfort with applications and real time alert in case of security
Videos
Each project comes with 2-5 hours of micro-videos explaining the solution.
Code & Dataset
Get access to 50+ solved projects with iPython notebooks and datasets.
Project Experience
Add project experience to your Linkedin/Github profiles.

What will you learn

  • Making a case for real time processing of log files
  • Getting logs at real time using Flume Log4J appenders
  • Making a case for Kafka for log aggregation.
  • Storing log event as a time series datasets in HBase
  • Integrating Hive and HBase for data retrieval using query.
  • Troubleshooting

Project Description

A while back, we did web server access log processing using spark and hive. However, that processing was batch processing and in the lambda architecture, we will only be able to operate in the batch and serving layer.

In this big data project, we are going one step further by bringing processing to the speed layer of the lambda architecture which opens up more capabilities. One of such capability will be ability monitor application real time perform or measure real time comfort with applications or real time alert in case of security breach.

The abilities and functionalities will be explored using Spark Streaming in a streaming architecture. 

Note: It is worthy of note that the Cloudera QuickStart VM does not have Kafka. However, like in our objective, we will make the case for using Kafka but our implementation will not be using Kafka. Instead, we will integrate the log agent with Spark streaming in this big data project.

Curriculum For This Mini Project

 
  Web Server Log Processing in Batch Mode and the Concept of Rollover
22m
  Downloading NASA Dataset
02m
  Understading the Contents of the Log File -Common and Combined Log Format
08m
  Making a case for real-time processing of log file
08m
  Getting logs at real-time using Flume Log4j Appenders
51m
  Making a case for Kafka for Log Aggregation
14m
  Starting Flume Agent for Log Processing in Real-Time
07m
  Analyse Data before Storing to HBase -Cracking the Design
05m
  Discussion on the topics for next session
01m
  Recap of previous session
04m
  Difference between Cassandra and HBase
02m
  Agenda for the Session
04m
  Why HBase?
01m
  HBase Design
13m
  How to store EDGAR log file dataset?
26m
  Understanding the Streaming Application Code
28m
  Hive and HBase Integration
14m
  Architectural Extensions
02m