Data processing with Spark SQL

Data processing with Spark SQL

In this Apache Spark SQL project, we will go through provisioning data for retrieval using Spark SQL.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Shailesh Kurdekar

Solutions Architect at Capital One

I have worked for more than 15 years in Java and J2EE and have recently developed an interest in Big Data technologies and Machine learning due to a big need at my workspace. I was referred here by a... Read More

Swati Patra

Systems Advisor , IBM

I have 11 years of experience and work with IBM. My domain is Travel, Hospitality and Banking - both sectors process lots of data. The way the projects were set up and the mentors' explanation was... Read More

What will you learn

Introduction to the project and its roadmap
Setting up your VIrtual environment in Cloudera Quickstart
Spark as Big Data distributed processing engine
Basics of Graph theory and Directed Acyclic Graphs
Basic data unit Resilient Distributed Dataset in Spark
RDD as working unit and performing transformation in RDD
Understanding the working of Spark with example
Introduction to Spark Streaming Module and Spark MLib
Introduction to GraphX for Graphs and Graphs parallel computation
Introduction to Spark SQL and understanding its functionality
Setting up the spark SQL thrift server
Reading JSON file and Creating Resilient Distributed Dataframe
Understanding and defining Schema for Spark SQL
Creating RDD for our Dataset and converting them to dataframe
Performance tuning the model for optimum output
Benchmarking queries in Hive, Spark SQL, and impala

Project Description

Spark SQL offers the platform to provide a structured data to any dataset regardless its source or form. And once that structured data is formed, it can be queried using tools like Hive, Impala, and other Hadoop data warehouse tools.

In this spark project, we will go through Spark SQL syntax to process the dataset, perform some joins with other supplementary data as well as make the data available for the query using the Spark SQL thrift server. On provision of the data, we will perform some interesting query and other go through some performance tuning technique for Spark SQL.

Similar Projects

In this hadoop project, we are going to be continuing the series on data engineering by discussing and implementing various ways to solve the hadoop small file problem.

In this big data project, we'll work with Apache Airflow and write scheduled workflow, which will download data from Wikipedia archives, upload to S3, process them in HIVE and finally analyze on Zeppelin Notebooks.

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.

Curriculum For This Mini Project

Introduction to the Project
Starting Cloudera Quickstart VM
Overview on the Datasets used for the Project
What is Spark?
Introduction to Directed Acyclic Graph (DAG)
Introduction to RDD's in Spark
RDD's in Action
Transformations in RDD's
Example on how Spark works
Introduction to Spark Streaming Module
Introduction to Spark MLlib
Introduction to GraphX
Introduction to Spark SQL
Example on how Spark SQL works
Read JSON File and Create RDD's
How to define schema in Spark SQL?
Creating an RDD from Movie Dataset
Converting the RDD into a Dataframe
Defining the Dataframe Schema-Build Schemas using Million Song Dataset
Read Million Song Dataset CSV File
Loading Data
Working with Dataframes
01h 00m
Working with Crime Dataset
Start Spark Shell and Connect to Hive
Hive Querying using Spart Context
Read a file and Save data as Parquet
Load Data
Streaming Data
Setting up the Spark SQL Thrift Server