HANDS-ON-LAB

Stream Kafka data to Cassandra and HDFS

Problem Statement

This hands-on process Stream Kafka data to Cassandra and HDFS code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog.

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

Set up Spark Streaming: Ensure that Spark and Spark Streaming are installed and configured properly in your environment.
Create a Spark Streaming application: Write a Spark Streaming application to consume data from the "airbnb_data" Kafka topic.
Define Cassandra schema: Define the schema for the Cassandra table with the required columns (Id, host_id, host_identity_verified, host_name, neighbourhood_group, neighbourhood, lat, long, country, country_code, instant_bookable, cancellation_policy, room_type, construction_year, price, service_fee, minimum_nights, number_of_reviews, last_review, reviews_per_month, review_rate_number, calculated_host_listings_count, availability_days, time_added).
Read data from Kafka: Use the Spark Streaming application to read data from the "airbnb_data" Kafka topic.
Process and transform data: Apply any necessary processing or transformations to the data as per your requirements.
Store data in Cassandra: Save the processed data to the Cassandra table using the defined schema.
Store data in HDFS: Save the processed data to HDFS, specifying the appropriate file format and location.
Run the Spark Streaming application: Start the Spark Streaming application to initiate the data streaming and processing.
Monitor data storage: Monitor Cassandra and HDFS to ensure that the data is being successfully stored in both locations.

Join our hands-on lab and master the art of streaming data from Kafka to Cassandra and HDFS using Spark. Gain practical experience in setting up Spark Streaming, defining schemas, and monitoring data storage.

Learnings

Spark Streaming setup and configuration: Set up and configure Spark Streaming for real-time data processing.
Integration with Kafka: Integrate Spark Streaming with Kafka to consume data from a Kafka topic.
Cassandra data storage: Define the Cassandra schema and save the processed data to Cassandra.
HDFS data storage: Save the processed data to HDFS in the desired file format and location.
Monitoring data streaming and storage: Monitor the Spark Streaming application, Cassandra, and HDFS to ensure the successful streaming and storage of data.

FAQs

Q1. What is Spark Streaming?

Spark Streaming is a real-time data processing engine in Apache Spark that enables high-throughput, fault-tolerant stream processing of live data streams. It provides easy integration with various data sources, including Kafka.

Q2. Why use Cassandra for data storage?

Cassandra is a highly scalable and distributed NoSQL database that offers high write throughput and low latency. It is suitable for storing large amounts of data and handling high-speed data ingestion, making it ideal for streaming applications.

Q3. What is HDFS?

HDFS (Hadoop Distributed File System) is a distributed file system designed to store and process large datasets across multiple machines in a Hadoop cluster. It provides fault tolerance, high availability, and scalability for storing big data.