HANDS-ON-LAB

Real-Time Data Ingestion with Kinesis Firehose

Problem Statement

This hands-on process Real-Time Data Ingestion with Kinesis Firehose code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Set up an EC2 machine: Create a new EC2 instance and install the Kinesis Agent on it.

  2. Configure Kinesis Firehose: Set up a Kinesis Firehose Delivery Stream named "SacramentoRealEstateTransactions" to receive the data from the Kinesis Agent.

  3. Define the S3 destination: Specify a new S3 bucket and the appropriate folder structure ("sacramento/realEstate/landing/") as the destination for the data ingested through Kinesis Firehose.

  4. Validate data ingestion: Ensure that the data from the CSV file in the "Data" folder is successfully ingested and landed in the specified location in the S3 bucket.


Dive into real-time data ingestion with Kinesis Firehose and S3. Start setting up your EC2 instance and installing the Kinesis Agent today.

Learnings

By completing this exercise, you will gain experience in:

  • Setting up an EC2 instance and installing the Kinesis Agent.

  • Configuring a Kinesis Firehose Delivery Stream to receive data.

  • Defining the destination in S3 for the ingested data.

  • Validating successful data ingestion into the specified S3 bucket.

FAQs

Q1. What is the role of Kinesis Firehose in real-time data ingestion?

Kinesis Firehose is a fully managed service that simplifies the process of ingesting and delivering streaming data at scale. It seamlessly collects, transforms, and loads data from various sources, such as the CSV file in the Data folder, into an S3 bucket.

 

Q2. How does the EC2 machine and Kinesis Agent fit into the data ingestion process?

The EC2 machine serves as the host for the Kinesis Agent, which is responsible for collecting and sending data to Kinesis Firehose. By setting up an EC2 instance and installing the Kinesis Agent on it, you enable the continuous ingestion of real estate transaction data.

 

Q3. Why is S3 chosen as the destination for the ingested data?

S3 provides scalable storage capabilities, high durability, and easy accessibility for data. By defining the appropriate folder structure within the S3 bucket ("sacramento/realEstate/landing/"), the ingested data can be efficiently organized and made available for further processing and analysis.