Spark Project -Real-time data collection and Spark Streaming Aggregation

Spark Project -Real-time data collection and Spark Streaming Aggregation

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Ray Han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Camille St. Omer

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

What will you learn

Understanding the problem statement
Understanding what is real-time data processing
Architecture and data flow in Big data project
Basic EDA of the dataset and understanding the required format of the output
Understanding the tools required for Big Data project
Kafka's role as the messenger and the use of zookeeper
Setting up a virtual environment in your computer and connecting Kafka, Spark, HBase, and Hadoop
Creating Data simulation demo and running the demo
Creating and using your won zookeeper
Testing Hbase and streaming directly to Hbase using Spark Shell
Initiating Spark Steaming to fetch data
Analyzing the data on Spark steaming using the grouping method to fetch insights
Visualizing dashboard after Kafkas sends the message and realtime change in the Dasboard
Visualizing the final output using Pie Charts
Understanding other alternatives for Real tie Data analytics like Apache Hadoop and Spark RDD
Understanding Kafka consumer, how it works and creating parallel threads for the Kafka consumer

Project Description

In this spark project, we will embark on real-time data collection and aggregation from a simulated real-time system.

The dataset for the project which will simulate our sensor data delivery is from Microsoft Research Asia GeoLife project. According to the paper, the dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling. This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.

As a part of this big data project, we will use the data to provide real time aggregates of the movements along a number of dimension like effective distance, duration, trajectories and more. All streamed data will be stored in the NoSQL database - HBase.

Similar Projects

The goal of this apache kafka project is to process log entries from applications in real-time using Kafka for the streaming architecture in a microservice sense.

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

In this Databricks Azure tutorial project, you will use Spark Sql to analyse the movielens dataset to provide movie recommendations. As part of this you will deploy Azure data factory, data pipelines and visualise the analysis.

Curriculum For This Mini Project

Download code from github and dataset from Microsoft Research
01m
Project Agenda
03m
What is real-time data processing?
06m
Explore the Geolife Trajectories dataset
11m
Discuss outputs of the project
07m
Data formats of the data set
05m
Tools used in the project solution
02m
Data flow architecture
08m
Understanding kafkas role as a message broker
10m
How does kafka use zookeeper
01m
Geolife Trajectories dashboard
13m
Setup environment
06m
Data streaming simulation demo
15m
Run the simulation demo
12m
Produce streaming data using the application
05m
Test Hbase
01m
Start spark-shell
07m
Streaming to hbase
25m
Starting the services
16m
Data Analysis - distribution of user trajectories
21m
Move data to kafka
09m
Troubleshooting
05m
Run the spark streaming application
12m
Running analysis on the spark stream
06m
How to integrate with the dashboard
10m
Other tools that can be used for distributed streaming analysis
03m
Code walkthrough of kafka consumer
05m
Data Analysis - user by period
17m