How to save a dataframe as a JSON file using PySpark

This recipe helps you save a dataframe as a JSON file using PySpark
Last Updated: 22 Dec 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: How to save a dataframe as a JSON file using PySpark?

In this recipe, we learn how to save a dataframe as a JSON file using PySpark.

Recipe Objective: How to save a dataframe as a JSON file using PySpark?

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Single node Hadoop - click here
Apache Hive - click here
Apache Spark -click here
PySpark - click here

Steps to set up an environment:

In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.
Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.
If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Steps to save a dataframe as a JSON file:

Step 1: Set up the environment variables for Pyspark, Java, Spark, and python library. As shown below:

bigdata_1

Please note that these paths may vary in one’s EC2 instance. Provide the full path where these are stored in your instance.

Step 2: Import the Spark session and initialize it. You can name your application and master program at this step. We provide appName as “demo,” and the master program is set as “local” in this recipe.

bigdata_2

Step 3: We demonstrated this recipe using the “user.csv” file. Make sure that the file is present in the HDFS. Check for the same using the command:

hadoop fs -ls &ltfull path to the location of file in HDFS&gt

Read the CSV file into a dataframe using the function spark.read.load().

bigdata_3

Step 4: Call the method dataframe.write.json() and pass the name you wish to store the file as the argument.

bigdata_4

Now check the JSON file created in the HDFS and read the “users_json.json” file.

bigdata_5

This is how a dataframe can be converted to JSON file format and stored in the HDFS.

Download Materials

users

bigdata_1

bigdata_2

bigdata_3

bigdata_4

bigdata_5

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM

Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 3

In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

View Project Details

SQL Project for Data Analysis using Oracle Database-Part 7

In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

View Project Details

Flask API Big Data Project using Databricks and Unity Catalog

In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

View Project Details

How to save a dataframe as a JSON file using PySpark

Recipe Objective: How to save a dataframe as a JSON file using PySpark?

Table of Contents

Prerequisites:

Steps to set up an environment:

Steps to save a dataframe as a JSON file:

Ray han

Relevant Projects

You might also like

Relevant Projects