GCP Data Ingestion with SQL using Google Cloud Dataflow

GCP Data Ingestion with SQL using Google Cloud Dataflow

In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.
explanation image

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Ray Han linkedin profile url

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

profile image

SUBHABRATA BISWAS linkedin profile url

Lead Consultant, ITC Infotech

The project orientation is very much unique and it helps to understand the real time scenarios most of the industries are dealing with. And there is no limit, one can go through as many projects... Read More

What will you learn

Understanding the project and how to use Google Cloud Storage
Visualizing the complete Architecture of the system
Introduction to Google Cloud SDK
Usage of Google cloud SDK and connecting it to the service account
Exploring Yelp dataset and Using its JSON stream and JSON file
Creating Google Cloud Bucket
Understanding PubSub and using it for data ingestion by creating topics
Understanding Apache Beam and using it for executing data processing pipelines
Creating data flow stream jobs and understanding it
Creating data flow batch jobs and understanding it
Understanding Cloud Composer/Airflow and using it for orchestrating batch workloads
Creating Cloud Composer Instance and using for scheduling a job
Tracking Dag runs and other configurations in Cloud Composer
Integrating PubSub, Cloud Composer/Airflow to Apache beam
Integrating Google Cloud Data flow and Google BigQuery
Understanding Google BigQuery and using it as a data warehouse
Integrating BigQuery and Data Studio
Displaying live stream results using Google Data Studio

Project Description

What is Data Ingestion?

Data Ingestion is defined as the transportation of data from various assorted sources to the storage medium where it can be thoroughly accessed and analyzed by any organization. The storage medium acts as a destination which is typically the data warehouse, database, data mart or any document store. The data can come from various sources such as RDBMS and other different types of databases like S3 buckets, CSVs files etc.


Data Pipeline:

It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real-time (or streaming) instead of batches. Right from extracting or capturing data using various tools, storing raw data, cleaning, validating data, transforming data into query worthy format, visualisation of KPIs including Orchestration of the above process is data pipeline.

Data Architecture:

 

What is the Agenda of the project?

The agenda of the project involves Data ingestion and processing pipeline on Google cloud platform with real-time streaming and batch loads. .Yelp dataset, which is used for academics and research purposes is used. We first create a service account on GCP followed by downloading Google Cloud SDK(Software developer kit). Then, Python software and all other dependencies are downloaded and connected to the GCP account for further processes. Then, the Yelp dataset is downloaded in JSON format, is connected to Cloud SDK following connections to Cloud storage which is then connected with Cloud Composer and Yelp dataset JSON stream is published to PubSub topic. Cloud composer and PubSub outputs are Apache Beam and connecting to Google Dataflow. Google BigQuery receives the structured data from workers. Finally., the data is passed to Google Data studio for visualization.

Usage of Dataset:

Here we are going to use Yelp data in JSON format in the following ways:

- Yelp dataset File: In Yelp dataset File, JSON file is connected to Cloud storage Fuse or Cloud SDK to the Google cloud storage which stores the incoming raw data followed by connections to Google Cloud Composer or Airflow to the Google cloud storage for scheduling and orchestration to batch workloads.

- Yelp dataset Stream: In Yelp dataset Stream, JSON Streams are published to Google PubSub topic for real-time data ingestion followed by connections to Apache beam for further processing.

 

Data Analysis:

- From the given website, the Yelp dataset is downloaded in JSON format. The Yelp JSON file is connected to Google SDK or GcsFuse for transfer of data to Google cloud storage which is connected to Google Cloud composer/Airflow for scheduling and orchestration of batch workloads.

- Yelp dataset JSON streams are published to Google PubSub which is used for real-time ingestion or streaming datasets.

- Data pipeline is created by apache beam which receives the real-time data from Google PubSub and the data from Google cloud storage as inputs which are followed by creating Google dataflow stream job and batch job scaling the compute based on throughput.

- Apache beam orchestrates stream and batch jobs following the output of Google Dataflow to workers.

- Google BigQuery acts as a Data warehouse storing structured data which receives the input from workers and queries the data.

- Finally data is visualized using different graphs and table definitions in Google Data Studio.

Similar Projects

In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Curriculum For This Mini Project

Overview of Architecture and Problem
08m
Creating Service account and setting of Cloud SDK
08m
Installing Python and other Dependencies
05m
Downloading and Understanding Yelp Dataset
05m
Creating Cloud bucket and transferring data to Cloud bucket
06m
Understanding Google Cloud PubSub for Streaming Data
07m
Creating PubSub topic subscription and publishing messages Part 1
08m
Creating PubSub topic subscription and publishing messages Part 2
02m
Overview of Apache beam with Dataflow
05m
Creation of dataflow stream jobs and explanation of parameters
09m
Understanding the Dataflow job in Console
06m
Creation of Dataflow Batch job with its execution
09m
Overview of Google Cloud Composer Airflow for Scheduling
06m
Creation of Google Cloud composer instance
05m
Scheduling or manually runnng a job from a Google Cloud Composer instance
09m
Overview of Google CloudComposer UI monitoring dag runs and other configurations
05m
Overview of Google BigQuery as Datawarehouse with datatypes supported in it
06m
Querying and Visualisation of data in Google BigQuery
08m