A data pipeline is a set of actions that are performed from the time data is available for ingestion till value is derived from that data. Such kind of actions is Extraction (getting value field from the dataset), Transformation and Loading (putting the data of value in a form that is useful for upstream use).
In this big data project, we will simulate a simple batch data pipeline. Our dataset of interest we will get from https://www.githubarchive.org/ that records over 20 kinds of events.
The objective of this spark project will be to create a small but real-world pipeline that downloads this dataset as they become available, initiated the various form of transformation and load them into forms of storage that will need further use.
This spark project involves some form of software development, however, we will explore Spark and Hive while discussing some design decisions along the way.
Stay updated in technology trends by working on projects
Live online coding sessions led by industry experts
Build 2-4 projects a month each lasting 6 hours designed to teach you advanced concepts
Code in groups and connect with your community