HANDS-ON-LAB

Master Real-Time Data Processing with AWS

Problem Statement

This hands-on Stream Data Pipeline using Flink and Kinesis example code aims to create a Lambda function to cleanse YouTube statistics reference data and store it in an S3 bucket in CSV format. Additionally, the cleansed data should be exposed in the Glue catalog.

The statistics reference data (the JSON files) is placed in the raw S3 bucket:

s3://<raw_bucket_name>/youtube/raw_statistics_reference_data/

Tasks

Create a new S3 bucket and upload the dataset file containing the total number of edits made on different Wikipedia pages.
Develop a Python script to read the dataset file, convert the necessary columns to appropriate data types (string for all columns except delta, added, and deleted which should be integers), and push the records to a Kinesis data stream. Include an extra column called "txn_timestamp" to track the time of record insertion.
Set up a Flink SQL environment.
Write a Flink SQL query to create a table with appropriate column types matching the dataset structure.
Configure Flink to consume records from the Kinesis data stream and insert them into the created table.
Execute a select statement in Flink SQL to display the contents of the table.

Level up your data engineering skills with Flink and Kinesis. Create a real-time data pipeline in no time.

Learnings

Knowledge of setting up and utilizing AWS services such as S3, Kinesis, and Flink.
Proficiency in Python scripting for data ingestion to Kinesis.
Understanding of Flink SQL syntax for creating tables and querying data.
Experience in working with different data types in Flink SQL.
Familiarity with real-time data processing pipelines using AWS services.

FAQs

Q1. What is the purpose of the Python script in this exercise?

The Python script reads the dataset file, converts columns to appropriate data types, and pushes the records to a Kinesis data stream for further processing.

Q2. How can I display the contents of the table in Flink SQL?

You can execute a select statement in Flink SQL to retrieve and display the contents of the table created from the Kinesis data stream.

Q3. Can I use different column types in the Flink SQL table?

Yes, in this exercise, all columns are inserted as strings except for "delta," "added," and "deleted," which are taken as integers. The Flink SQL table should reflect the appropriate column types to match the dataset structure.