HANDS-ON-LAB

Kinesis Firehose to OpenSearch

Problem Statement

This hands-on process Kinesis Firehose to OpenSearch code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog. 

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

  1. Create a new Kinesis Firehose: Set up a new Kinesis Firehose delivery stream to receive data from the Spark job and deliver it to the OpenSearch index. Configure the delivery stream with the appropriate settings, including the OpenSearch destination.

  2. Set up a new OpenSearch index: Create a new OpenSearch index where the GPS data will be stored. Define the appropriate index mapping and settings to accommodate the data format and desired query operations.

  3. Enhance the Spark job: Modify the Spark job from the previous exercise to include the functionality for writing data to the new OpenSearch index via Kinesis Firehose. Configure the necessary parameters, such as the delivery stream name and the OpenSearch endpoint.

  4. Execute the Spark job: Run the Spark job to read the modified GPS data files, process the data, and send it to the Kinesis Firehose delivery stream for ingestion into the OpenSearch index.

  5. Query the OpenSearch index: Use OpenSearch query operations to validate the data in the index and calculate the total number of records inserted. Perform a query to retrieve all the records and count the number of results. The expected total number of records should be 224.

Join the hands-on lab to learn how to process GPS data using Kinesis Firehose, OpenSearch, and Spark.

Learnings

  • Configuring and utilizing Kinesis Firehose for data delivery.

  • Setting up and managing an OpenSearch index.

  • Enhancing a Spark job to write data to an OpenSearch index via Kinesis Firehose.

  • Querying an OpenSearch index to validate data and perform calculations.

  • Verifying the successful ingestion of data into the OpenSearch index.

FAQs

Q1. What is the purpose of using Kinesis Firehose in this exercise?

Kinesis Firehose is used as a delivery stream to ingest the processed GPS data from the Spark job into the OpenSearch index. It simplifies the data delivery process and ensures efficient and reliable data transfer.

 

Q2. How is the data validation performed in the OpenSearch index?

To validate the data, OpenSearch query operations are used. A query is executed to retrieve all the records in the index, and the count of the results is calculated. The expected total number of records should be 224.

 

Q3. What are the learning outcomes of this exercise?

By completing this exercise, you will gain experience in configuring and utilizing Kinesis Firehose for data delivery, setting up and managing an OpenSearch index, enhancing a Spark job to write data to the index via Kinesis Firehose, and querying the index for data validation and calculations.