Data processing with Spark SQL

Data processing with Spark SQL

In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.
explanation image

Videos

Each project comes with 2-5 hours of micro-videos explaining the solution.

ipython image

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

project experience

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews
profile image

Dhiraj Tandon linkedin profile url

Solution Architect-Cyber Security at ColorTokens

My Interaction was very short but left a positive impression. I enrolled and asked for a refund since I could not find the time. What happened next: They initiated Refund immediately. Their... Read More

profile image

Camille St. Omer linkedin profile url

Artificial Intelligence Researcher, Quora 'Most Viewed Writer in 'Data Mining'

I came to the platform with no experience and now I am knowledgeable in Machine Learning with Python. No easy thing I must say, the sessions are challenging and go to the depths. I looked at graduate... Read More

What will you learn

Introduction to the project and its roadmap
Setting up your VIrtual environment in Cloudera Quickstart
Spark as Big Data distributed processing engine
Basics of Graph theory and Directed Acyclic Graphs
Basic data unit Resilient Distributed Dataset in Spark
RDD as working unit and performing transformation in RDD
Understanding the working of Spark with example
Introduction to Spark Streaming Module and Spark MLib
Introduction to GraphX for Graphs and Graphs parallel computation
Introduction to Spark SQL and understanding its functionality
Setting up the spark SQL thrift server
Reading JSON file and Creating Resilient Distributed Dataframe
Understanding and defining Schema for Spark SQL
Creating RDD for our Dataset and converting them to dataframe
Performance tuning the model for optimum output
Benchmarking queries in Hive, Spark SQL, and impala

Project Description

Spark SQL offers the platform to provide a structured data to any dataset regardless its source or form. And once that structured data is formed, it can be queried using tools like Hive, Impala, and other Hadoop data warehouse tools.

In this spark project, we will go through Spark SQL syntax to process the dataset, perform some joins with other supplementary data as well as make the data available for the query using the Spark SQL thrift server. On provision of the data, we will perform some interesting query and other go through some performance tuning technique for Spark SQL.

Similar Projects

The goal of this spark project for students is to explore the features of Spark SQL in practice on the latest version of Spark i.e. Spark 2.0.

Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

This Elasticsearch example deploys the AWS ELK stack to analyse streaming event data. Tools used include Nifi, PySpark, Elasticsearch, Logstash and Kibana for visualisation.

Curriculum For This Mini Project

Introduction to the Project
01m
Starting Cloudera Quickstart VM
00m
Overview on the Datasets used for the Project
04m
What is Spark?
17m
Introduction to Directed Acyclic Graph (DAG)
03m
Introduction to RDD's in Spark
06m
RDD's in Action
04m
Transformations in RDD's
09m
Example on how Spark works
05m
Introduction to Spark Streaming Module
04m
Introduction to Spark MLlib
01m
Introduction to GraphX
04m
Introduction to Spark SQL
04m
Example on how Spark SQL works
05m
Read JSON File and Create RDD's
09m
How to define schema in Spark SQL?
07m
Creating an RDD from Movie Dataset
03m
Converting the RDD into a Dataframe
03m
Defining the Dataframe Schema-Build Schemas using Million Song Dataset
17m
Read Million Song Dataset CSV File
21m
Loading Data
05m
Working with Dataframes
01h 00m
Working with Crime Dataset
01m
Start Spark Shell and Connect to Hive
08m
Hive Querying using Spart Context
03m
Read a file and Save data as Parquet
08m
Load Data
03m
Streaming Data
04m
Setting up the Spark SQL Thrift Server
09m