Spark Project -Real-time data collection and Spark Streaming Aggregation

In this big data project, we will embark on real-time data collection and aggregation from a simulated real-time system using Spark Streaming.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

  • Understanding the problem statement

  • Understanding what is real-time data processing

  • Architecture and data flow in Big data project

  • Basic EDA of the dataset and understanding the required format of the output

  • Understanding the tools required for Big Data project

  • Kafka's role as the messenger and the use of zookeeper

  • Setting up a virtual environment in your computer and connecting Kafka, Spark, HBase, and Hadoop

  • Creating Data simulation demo and running the demo

  • Creating and using your won zookeeper

  • Testing Hbase and streaming directly to Hbase using Spark Shell

  • Initiating Spark Steaming to fetch data

  • Analyzing the data on Spark steaming using the grouping method to fetch insights

  • Visualizing dashboard after Kafkas sends the message and realtime change in the Dasboard

  • Visualizing the final output using Pie Charts

  • Understanding other alternatives for Real tie Data analytics like Apache Hadoop and Spark RDD

  • Understanding Kafka consumer, how it works and creating parallel threads for the Kafka consumer

Project Description

In this spark project, we will embark on real-time data collection and aggregation from a simulated real-time system.

The dataset for the project which will simulate our sensor data delivery is from Microsoft Research Asia GeoLife project. According to the paper, the dataset recoded a broad range of users’ outdoor movements, including not only life routines like go home and go to work but also some entertainments and sports activities, such as shopping, sightseeing, dining, hiking, and cycling. This trajectory dataset can be used in many research fields, such as mobility pattern mining, user activity recognition, location-based social networks, location privacy, and location recommendation.

As a part of this big data project, we will use the data to provide real time aggregates of the movements along a number of dimension like effective distance, duration, trajectories and more. All streamed data will be stored in the NoSQL database - HBase.

Similar Projects

Big Data Project Implementing OLAP  on Hadoop using Apache Kylin
In this big data project, we will be performing an OLAP cube design using AdventureWorks database. The deliverable for this session will be to design a cube, build and implement it using Kylin, query the cube and even connect familiar tools (like Excel) with our new cube.
Big Data Project Data Analysis and Visualisation using Spark and Zeppelin
In this big data project, we will talk about Apache Zeppelin. We will write code, write notes, build charts and share all in one single data analytics environment using Hive, Spark and Pig.
Big Data Project Design a Network Crawler by Mining Github Social Profiles
In this big data project, we will look at how to mine and make sense of connections in a simple way by building a Spark GraphX Algorithm and a Network Crawler.
Big Data Project Hive Project- Denormalize JSON Data and analyse it with HIVE Scripts
In this hive project, you will work on denormalizing the JSON data and create HIVE scripts with ORC file format.

Curriculum For This Mini Project

  Download code from github and dataset from Microsoft Research
  Project Agenda
  What is real-time data processing?
  Explore the Geolife Trajectories dataset
  Discuss outputs of the project
  Data formats of the data set
  Tools used in the project solution
  Data flow architecture
  Understanding kafkas role as a message broker
  How does kafka use zookeeper
  Geolife Trajectories dashboard
  Setup environment
  Data streaming simulation demo
  Run the simulation demo
  Produce streaming data using the application
  Test Hbase
  Start spark-shell
  Streaming to hbase
  Starting the services
  Data Analysis - distribution of user trajectories
  Move data to kafka
  Run the spark streaming application
  Running analysis on the spark stream
  How to integrate with the dashboard
  Other tools that can be used for distributed streaming analysis
  Code walkthrough of kafka consumer
  Data Analysis - user by period