How to Save a PySpark Dataframe to a CSV File?

Discover the quickest and most effective way to export PySpark DataFrames to CSV files with this comprehensive recipe guide. | ProjectPro

Recipe Objective: How to Save a PySpark Dataframe to a CSV File? 

Are you working with PySpark and looking for a seamless way to save your DataFrame to a CSV file? You're in the right place! Check out this recipe to explore various methods and optimizations for efficiently saving your PySpark DataFrame as a CSV file.

Prerequisites - Saving PySpark Dataframe to CSV 

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up an environment 

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.

  • Type "&ltyour public IP&gt:7180" in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.

  • If they are not visible in the Cloudera cluster, you may add them by clicking on the "Add Services" in the cluster to add the required services in your local instance.

Build Classification and Clustering Models with PySpark and MLlib

How to Save a PySpark Dataframe as a CSV File - Step-by-Step Guide 

Here is a step-by-step implementation on saving a PySpark DataFrame to CSV file - 

Step 1: Set up the environment 

This step involves setting up the variables for Pyspark, Java, Spark, and python libraries

Spark DataFrame to CSV

Please note that these paths may vary in one's EC2 instance. Provide the full path where these are stored in your instance.

Step 2: Import the Spark session 

This step involves importing the spark session and initializing it. You can name your application and master program at this step. We provide appName as "demo," and the master program is set as "local" in this recipe.

PySpark save dataframe to csv in one file

Step 3: Create a DataFrame 

Let’s demonstrate this recipe by creating a dataframe using the "users_json.json" file. Make sure that the file is present in the HDFS. Check for the same using the command: 

hadoop fs -ls &ltfull path to the location of file in HDFS&gt 

The JSON file "users_json.json" used in this recipe to create the DataFrame is as below.

Create DataFrame example

Step 4: Read the JSON File 

The next step involves reading the JSON file into a dataframe (here, "df") using the code spark.read.json("users_json.json) and checking the data present in this dataframe.

Read the JSON File into a DataFrame

PySpark Project-Build a Data Pipeline using Hive and Cassandra

Step 5: Store the DataFrame as a CSV File 

Store this DataFrame as a CSV file using the code df.write.csv("csv_users.csv") where "df" is our dataframe, and "csv_users.csv" is the name of the CSV file we create upon saving this dataframe.

bigdata_5

Step 6: Check the Schema 

Now check the schema and data in the dataframe upon saving it as a CSV file.

Check the schema in the DataFrame

This is how a dataframe can be saved as a CSV file using PySpark. 

Best PySpark Tutorial For Beginners With Examples 

Elevate Your PySpark Skills with ProjectPro!

Saving a PySpark DataFrame to a CSV file is a fundamental operation in data processing. By understanding the methods available and incorporating optimizations, you can streamline this process and enhance the efficiency of your PySpark workflows. However, real-world projects provide the practical experience needed to navigate the complexities of PySpark effectively. ProjectPro, with its repository of over 270+ data science and big data projects, offers a dynamic platform for learners to apply their knowledge in a hands-on, professional context. Redirect your learning journey to ProjectPro to gain practical expertise, build a compelling portfolio, and stand out in the competitive landscape of big data analytics. 

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Real-Time Streaming of Twitter Sentiments AWS EC2 NiFi
Learn to perform 1) Twitter Sentiment Analysis using Spark Streaming, NiFi and Kafka, and 2) Build an Interactive Data Visualization for the analysis using Python Plotly.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Getting Started with Azure Purview for Data Governance
In this Microsoft Azure Purview Project, you will learn how to consume the ingested data and perform analysis to find insights.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs