PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Tutorial - Learn to use Apache Spark with Python

PySpark Project-Get a handle on using Python with Spark through this hands-on data processing spark python tutorial.


Each project comes with 2-5 hours of micro-videos explaining the solution.

Code & Dataset

Get access to 50+ solved projects with iPython notebooks and datasets.

Project Experience

Add project experience to your Linkedin/Github profiles.

What will you learn

Overview of the project, its motive and expected output
What is Pyspark
Spark as a Bigdata Cluster Computing framework
Installing Anaconda and Spark
Interaction with Spark Shell using Python API
Understanding Transformation and Actions using Spark
Establishing Spark Environment and creating a handshake function between Python and Spark
What is Resilient Distributed Data-RDD and performing RDD operation
Creating RDD partitions and Instances
Performing Basic Descriptive Statistics using PySpark
Performing Basic Statistical Test in PySpark
Understanding Linear Relation and calculating Correlation
Performing the Chi-Squared test for non-linear relation
Importing necessary library for implementing model on datapoints
Using Map and lambda function to read a dataset
Applying the Logistic Regression model for training and making final predictions

Project Description

This series of PySpark project will look at installing Apache Spark on the cluster and explore various data analysis tasks using PySpark for various big data and data science applications.

This video PySpark tutorial explains various transformations and actions that can be performed using PySpark with multiple examples.

Similar Projects

The goal of this machine learning project is to predict which products existing customers will use next month based on their past behaviour and that of similar customers.

In this project, we will be building and querying an OLAP Cube for Flight Delays on the Hadoop platform.

In this project, we will look at running various use cases in the analysis of crime data sets using Apache Spark.

Curriculum For This Mini Project

Overview of Project
What is PySpark
Install PySpark
Handshake between Python and Spark
RDD - Resilient Distributed Data
RDD operations
Basic Statistics using PySpark
Basic Statistical Test
Calculate Correlation
Chi Squared Test
Implement Machine Learning
Logistic Regression Model