PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Overview of the project, its motive and expected output
What is Pyspark
Spark as a Bigdata Cluster Computing framework
Installing Anaconda and Spark
Interaction with Spark Shell using Python API
Understanding Transformation and Actions using Spark
Establishing Spark Environment and creating a handshake function between Python and Spark
What is Resilient Distributed Data-RDD and performing RDD operation
Creating RDD partitions and Instances
Performing Basic Descriptive Statistics using PySpark
Performing Basic Statistical Test in PySpark
Understanding Linear Relation and calculating Correlation
Performing the Chi-Squared test for non-linear relation
Importing necessary library for implementing model on datapoints
Using Map and lambda function to read a dataset
Applying the Logistic Regression model for training and making final predictions

Project Description

This series of PySpark project will look at installing Apache Spark on the cluster and explore various data analysis tasks using PySpark for various big data and data science applications.

This video PySpark tutorial explains various transformations and actions that can be performed using PySpark with multiple examples.

Similar Projects

In this machine learning project, we will implement Back-propagation Algorithm from scratch for classification problems.

In this machine learning churn project, we implement a churn prediction model in python using ensemble techniques.

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Curriculum For This Mini Project

Overview of Project
What is PySpark
Install PySpark
Handshake between Python and Spark
RDD - Resilient Distributed Data
RDD operations
Basic Statistics using PySpark
Basic Statistical Test
Calculate Correlation
Chi Squared Test
Implement Machine Learning
Logistic Regression Model