Spark Project -Real-time data collection and Spark Streaming Aggregation

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.

Users who bought this project also bought

What will you learn

  • Spark Streaming using scala
  • Stateful transformation of streams
  • HBase and Spark Integration using the spark HBase connector
  • Decoupling the real-time data pipeline with a message broker
  • Display of information on a dashboard

What will you get

  • Access to recording of the complete project
  • Access to all material related to project like data files, solution files etc.

Project Description

In this spark project, we will embark on real-time data collection and aggregation from a simulated real-time system.

The dataset for the project which will simulate our sensor data delivery is from Microsoft Research Asia GeoLife project. According to the paper, the dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling. This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.

As a part of this big data project, we will use the data to provide real time aggregates of the movements along a number of dimension like effective distance, duration, trajectories and more. All streamed data will be stored in the NoSQL database - HBase.

Curriculum For This Mini Project

 
  Download code from github and dataset from Microsoft Research
01m
  Project Agenda
03m
  What is real-time data processing?
06m
  Explore the Geolife Trajectories dataset
11m
  Discuss outputs of the project
07m
  Data formats of the data set
05m
  Tools used in the project solution
02m
  Data flow architecture
08m
  Understanding kafkas role as a message broker
10m
  How does kafka use zookeeper
01m
  Geolife Trajectories dashboard
13m
  Setup environment
06m
  Data streaming simulation demo
15m
  Run the simulation demo
12m
  Produce streaming data using the application
05m
  Test Hbase
01m
  Start spark-shell
07m
  Streaming to hbase
25m
  Starting the services
16m
  Data Analysis - distribution of user trajectories
21m
  Move data to kafka
09m
  Troubleshooting
05m
  Run the spark streaming application
12m
  Running analysis on the spark stream
06m
  How to integrate with the dashboard
10m
  Other tools that can be used for distributed streaming analysis
03m
  Code walkthrough of kafka consumer
05m
  Data Analysis - user by period
17m