How to save a dataframe as a JSON file using PySpark

This recipe helps you save a dataframe as a JSON file using PySpark

Recipe Objective: How to save a dataframe as a JSON file using PySpark?

In this recipe, we learn how to save a dataframe as a JSON file using PySpark.

Prerequisites:

Before proceeding with the recipe, make sure the following installations are done on your local EC2 instance.

Steps to set up an environment:

  • In the AWS, create an EC2 instance and log in to Cloudera Manager with your public IP mentioned in the EC2 instance. Login to putty/terminal and check if PySpark is installed. If not installed, please find the links provided above for installations.
  • Type “&ltyour public IP&gt:7180” in the web browser and log in to Cloudera Manager, where you can check if Hadoop, Hive, and Spark are installed.
  • If they are not visible in the Cloudera cluster, you may add them by clicking on the “Add Services” in the cluster to add the required services in your local instance.

Steps to save a dataframe as a JSON file:

Step 1: Set up the environment variables for Pyspark, Java, Spark, and python library. As shown below:

bigdata_1

Please note that these paths may vary in one’s EC2 instance. Provide the full path where these are stored in your instance.

Step 2: Import the Spark session and initialize it. You can name your application and master program at this step. We provide appName as “demo,” and the master program is set as “local” in this recipe.

bigdata_2

Step 3: We demonstrated this recipe using the “user.csv” file. Make sure that the file is present in the HDFS. Check for the same using the command:

hadoop fs -ls &ltfull path to the location of file in HDFS&gt

Read the CSV file into a dataframe using the function spark.read.load().

bigdata_3

Step 4: Call the method dataframe.write.json() and pass the name you wish to store the file as the argument.

bigdata_4

Now check the JSON file created in the HDFS and read the “users_json.json” file.

bigdata_5

This is how a dataframe can be converted to JSON file format and stored in the HDFS.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Deploying auto-reply Twitter handle with Kafka, Spark and LSTM
Deploy an Auto-Reply Twitter Handle that replies to query-related tweets with a trackable ticket ID generated based on the query category predicted using LSTM deep learning model.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Build a Real-Time Dashboard with Spark, Grafana, and InfluxDB
Use Spark , Grafana, and InfluxDB to build a real-time e-commerce users analytics dashboard by consuming different events such as user clicks, orders, demographics

Build an ETL Pipeline with DBT, Snowflake and Airflow
Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing