1-844-696-6465 (US)        +91 77600 44484        help@dezyre.com
real-time-data-collection-aggregation-using-spark-streaming.jpg

Spark Project -Real-time data collection and Spark Streaming Aggregation

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.
4.44.4

Users who bought this project also bought

What will you learn

  • Spark Streaming using scala
  • Stateful transformation of streams
  • HBase and Spark Integration using the spark HBase connector
  • Decoupling the real-time data pipeline with a message broker
  • Display of information on a dashboard

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Prerequisites

  • It is expected that students have a fair knowledge of Big Data and hadoop

Project Description

In this spark project, we will embark on real-time data collection and aggregation from a simulated real-time system.

The dataset for the project which will simulate our sensor data delivery is from Microsoft Research Asia GeoLife project. According to the paper, the dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling. This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.

As a part of this big data project, we will use the data to provide real time aggregates of the movements along a number of dimension like effective distance, duration, trajectories and more. All streamed data will be stored in the NoSQL database - HBase.

Instructors

 
Michael

Big Data & Enterprise Software Engineer

I am passionate about software development, databases, data analysis and the android platform. My native language is java but no one has stopped me so far from learning and using angular and node.js. Data and data analysis is thrilling and so are my experiences with SQL on Oracle, Microsoft SQL Server, Postgres and MyS see more...

Curriculum For This Mini Project

 
  Download code from github and dataset from Microsoft Research
00:01:24
  Project Agenda
00:03:31
  What is real-time data processing?
00:06:05
  Explore the Geolife Trajectories dataset
00:11:01
  Discuss outputs of the project
00:07:02
  Data formats of the data set
00:05:34
  Tools used in the project solution
00:02:55
  Data flow architecture
00:08:51
  Understanding kafkas role as a message broker
00:10:17
  How does kafka use zookeeper
00:01:14
  Geolife Trajectories dashboard
00:13:49
  Setup environment
00:06:27
  Data streaming simulation demo
00:15:39
  Run the simulation demo
00:12:00
  Produce streaming data using the application
00:05:44
  Test Hbase
00:01:24
  Start spark-shell
00:07:20
  Streaming to hbase
00:25:31
  Starting the services
00:16:30
  Data Analysis - distribution of user trajectories
00:21:29
  Move data to kafka
00:09:25
  Troubleshooting
00:05:24
  Run the spark streaming application
00:12:24
  Running analysis on the spark stream
00:06:50
  How to integrate with the dashboard
00:10:36
  Other tools that can be used for distributed streaming analysis
00:03:56
  Code walkthrough of kafka consumer
00:05:24
  Data Analysis - user by period
00:17:04