Each project comes with 2-5 hours of micro-videos explaining the solution.
Get access to 50+ solved projects with iPython notebooks and datasets.
Add project experience to your Linkedin/Github profiles.
I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More
The project orientation is very much unique and it helps to understand the real time scenarios most of the industries are dealing with. And there is no limit, one can go through as many projects... Read More
What is Data Ingestion?
Data Ingestion is defined as the transportation of data from various assorted sources to the storage medium where it can be thoroughly accessed and analyzed by any organization. The storage medium acts as a destination which is typically the data warehouse, database, data mart or any document store. The data can come from various sources such as RDBMS and other different types of databases like S3 buckets, CSVs files etc.
It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real-time (or streaming) instead of batches. Right from extracting or capturing data using various tools, storing raw data, cleaning, validating data, transforming data into query worthy format, visualisation of KPIs including Orchestration of the above process is data pipeline.
What is the Agenda of the project?
The agenda of the project involves Data ingestion and processing pipeline on Google cloud platform with real-time streaming and batch loads. .Yelp dataset, which is used for academics and research purposes is used. We first create a service account on GCP followed by downloading Google Cloud SDK(Software developer kit). Then, Python software and all other dependencies are downloaded and connected to the GCP account for further processes. Then, the Yelp dataset is downloaded in JSON format, is connected to Cloud SDK following connections to Cloud storage which is then connected with Cloud Composer and Yelp dataset JSON stream is published to PubSub topic. Cloud composer and PubSub outputs are Apache Beam and connecting to Google Dataflow. Google BigQuery receives the structured data from workers. Finally., the data is passed to Google Data studio for visualization.
Usage of Dataset:
Here we are going to use Yelp data in JSON format in the following ways:
- Yelp dataset File: In Yelp dataset File, JSON file is connected to Cloud storage Fuse or Cloud SDK to the Google cloud storage which stores the incoming raw data followed by connections to Google Cloud Composer or Airflow to the Google cloud storage for scheduling and orchestration to batch workloads.
- Yelp dataset Stream: In Yelp dataset Stream, JSON Streams are published to Google PubSub topic for real-time data ingestion followed by connections to Apache beam for further processing.
- From the given website, the Yelp dataset is downloaded in JSON format. The Yelp JSON file is connected to Google SDK or GcsFuse for transfer of data to Google cloud storage which is connected to Google Cloud composer/Airflow for scheduling and orchestration of batch workloads.
- Yelp dataset JSON streams are published to Google PubSub which is used for real-time ingestion or streaming datasets.
- Data pipeline is created by apache beam which receives the real-time data from Google PubSub and the data from Google cloud storage as inputs which are followed by creating Google dataflow stream job and batch job scaling the compute based on throughput.
- Apache beam orchestrates stream and batch jobs following the output of Google Dataflow to workers.
- Google BigQuery acts as a Data warehouse storing structured data which receives the input from workers and queries the data.
- Finally data is visualized using different graphs and table definitions in Google Data Studio.
In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.
This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.