PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

Customer Love

Read All Reviews

Arvind Sodhi

VP - Data Architect, CDO at Deutsche Bank

I have extensive experience in data management and data processing. Over the past few years I saw the data management technology transition into the Big Data ecosystem and I needed to follow suit. I... Read More

Nathan Elbert

Senior Data Scientist at Tiger Analytics

This was great. The use of Jupyter was great. Prior to learning Python I was a self taught SQL user with advanced skills. I hold a Bachelors in Finance and have 5 years of business experience.. I... Read More

What will you learn

Overview of the project, its motive and expected output
What is Pyspark
Spark as a Bigdata Cluster Computing framework
Installing Anaconda and Spark
Interaction with Spark Shell using Python API
Understanding Transformation and Actions using Spark
Establishing Spark Environment and creating a handshake function between Python and Spark
What is Resilient Distributed Data-RDD and performing RDD operation
Creating RDD partitions and Instances
Performing Basic Descriptive Statistics using PySpark
Performing Basic Statistical Test in PySpark
Understanding Linear Relation and calculating Correlation
Performing the Chi-Squared test for non-linear relation
Importing necessary library for implementing model on datapoints
Using Map and lambda function to read a dataset
Applying the Logistic Regression model for training and making final predictions

Project Description

This series of PySpark project will look at installing Apache Spark on the cluster and explore various data analysis tasks using PySpark for various big data and data science applications.

This video PySpark tutorial explains various transformations and actions that can be performed using PySpark with multiple examples.

Similar Projects

In this NoSQL project, we will use two NoSQL databases(HBase and MongoDB) to store Yelp business attributes and learn how to retrieve this data for processing or query.

In this data science project with Python, we will complete the analysis of what sorts of people were likely to survive.You will learn to use various machine learning tools to predict which passengers survived the tragedy.

The goal of this data science project is to build a predictive model and find out the sales of each product at a given Big Mart store.

Curriculum For This Mini Project

Overview of Project
What is PySpark
Install PySpark
Handshake between Python and Spark
RDD - Resilient Distributed Data
RDD operations
Basic Statistics using PySpark
Basic Statistical Test
Calculate Correlation
Chi Squared Test
Implement Machine Learning
Logistic Regression Model