Explain JSON functions in PySpark in Databricks

This recipe explains what JSON functions in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain JSON functions in PySpark in Databricks?

The JSON functions in Apache Spark are popularly used to query or extract elements from the JSON string of the DataFrame column by the path and further convert it to the struct, map type e.t.c. The from_json() function in PySpark is converting the JSON string into the Struct type or Map type. The to_json() function in PySpark is defined as to converts the MapType or Struct type to JSON string. The json_tuple() function in PySpark is defined as extracting the Data from JSON and then creating them as the new columns. The get_json_object() function in PySpark is defined as removing the JSON element from the JSON string based on the JSON path specified. The schema_of_json() function is determined to create the JSON string schema string.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain JSON functions in PySpark in Databricks?
- System Requirements
- Implementing the JSON functions in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains JSON functions and how to perform them in PySpark.

Implementing the JSON functions in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import from_json, to_json, col from pyspark.sql.functions import json_tuple, get_json_object from pyspark.sql.functions import schema_of_json, lit Databricks-1

The Sparksession, Row, MapType, StringType, from_json, to_json, col, json_tuple, get_json_object, schema_of_json, lit packages are imported in the environment so as to demonstrate dense_rank and percent_rank window functions in PySpark.

# Implementing the JSON functions in Databricks in PySpark spark = SparkSession.builder.appName('PySpark JSON').getOrCreate() Sample_Json_String = """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}""" dataframe = spark.createDataFrame([(1, Sample_Json_String)],["id","value"]) dataframe.show(truncate=False) # Using from_json() function dataframe2 = dataframe.withColumn("value", from_json(dataframe.value,MapType(StringType(), StringType()))) dataframe2.printSchema() dataframe2.show(truncate=False) # Using to_json() function dataframe2.withColumn("value", to_json(col("value"))) \ .show(truncate=False) # Using json_tuple() function dataframe.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City")) \ .toDF("id","Zipcode","ZipCodeType","City") \ .show(truncate=False) # Using get_json_object() function dataframe.select(col("id"), get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \ .show(truncate=False) # Using schema_of_json() function Schema_Str = spark.range(1) \ .select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \ .collect()[0][0] print(Schema_Str) Databricks-2
Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_Json_String is defined. Using the from_json() function, it converts JSON string to the Map key-value pair and defining "dataframe2" value. The to_json() function converts the DataFrame columns MapType or Struct type to the JSON string. The json_tuple() function returns the query or extracts the present elements from the JSON column and creates the new columns. The get_json_object() function extracts the JSON string based on the path from the JSON column. The schema_of_json() function creates the schema string from the JSON string column.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

What Users are saying..

Ed Godalle

Director Data Analytics at EY / EY Tech

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

EMR Serverless Example to Build a Search Engine for COVID19

In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

View Project Details

Building Real-Time AWS Log Analytics Solution

In this AWS Project, you will build an end-to-end log analytics solution to collect, ingest and process data. The processed data can be analysed to monitor the health of production systems on AWS.

View Project Details

GCP Project to Learn using BigQuery for Exploring Data

Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

View Project Details

Build an ETL Pipeline with DBT, Snowflake and Airflow

Data Engineering Project to Build an ETL pipeline using technologies like dbt, Snowflake, and Airflow, ensuring seamless data extraction, transformation, and loading, with efficient monitoring through Slack and email notifications via SNS

View Project Details

dbt Snowflake Project to Master dbt Fundamentals in Snowflake

DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

How to deal with slowly changing dimensions using snowflake?

Implement Slowly Changing Dimensions using Snowflake Method - Build Type 1 and Type 2 SCD in Snowflake using the Stream and Task Functionalities

View Project Details

Build a Real-Time Spark Streaming Pipeline on AWS using Scala

In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

View Project Details

AWS Snowflake Data Pipeline Example using Kinesis and Airflow

Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

View Project Details

Explain JSON functions in PySpark in Databricks

Recipe Objective - Explain JSON functions in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the JSON functions in Databricks in PySpark

Ed Godalle

Relevant Projects

You might also like

Relevant Projects