Explain JSON functions in PySpark in Databricks

This recipe explains what JSON functions in PySpark in Databricks

Recipe Objective - Explain JSON functions in PySpark in Databricks?

The JSON functions in Apache Spark are popularly used to query or extract elements from the JSON string of the DataFrame column by the path and further convert it to the struct, map type e.t.c. The from_json() function in PySpark is converting the JSON string into the Struct type or Map type. The to_json() function in PySpark is defined as to converts the MapType or Struct type to JSON string. The json_tuple() function in PySpark is defined as extracting the Data from JSON and then creating them as the new columns. The get_json_object() function in PySpark is defined as removing the JSON element from the JSON string based on the JSON path specified. The schema_of_json() function is determined to create the JSON string schema string.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains JSON functions and how to perform them in PySpark.

Implementing the JSON functions in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import from_json, to_json, col
from pyspark.sql.functions import json_tuple, get_json_object
from pyspark.sql.functions import schema_of_json, lit
Databricks-1

The Sparksession, Row, MapType, StringType, from_json, to_json, col, json_tuple, get_json_object, schema_of_json, lit packages are imported in the environment so as to demonstrate dense_rank and percent_rank window functions in PySpark.

# Implementing the JSON functions in Databricks in PySpark
spark = SparkSession.builder.appName('PySpark JSON').getOrCreate()
Sample_Json_String = """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
dataframe = spark.createDataFrame([(1, Sample_Json_String)],["id","value"])
dataframe.show(truncate=False)
# Using from_json() function
dataframe2 = dataframe.withColumn("value", from_json(dataframe.value,MapType(StringType(), StringType())))
dataframe2.printSchema()
dataframe2.show(truncate=False)
# Using to_json() function
dataframe2.withColumn("value", to_json(col("value"))) \
.show(truncate=False)
# Using json_tuple() function
dataframe.select(col("id"),json_tuple(col("value"),"Zipcode","ZipCodeType","City")) \
.toDF("id","Zipcode","ZipCodeType","City") \
.show(truncate=False)
# Using get_json_object() function
dataframe.select(col("id"), get_json_object(col("value"),"$.ZipCodeType").alias("ZipCodeType")) \
.show(truncate=False)
# Using schema_of_json() function
Schema_Str = spark.range(1) \
.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))) \
.collect()[0][0]
print(Schema_Str)
Databricks-2

Databricks-3
Databricks-4

The "dataframe" value is created in which the Sample_Json_String is defined. Using the from_json() function, it converts JSON string to the Map key-value pair and defining "dataframe2" value. The to_json() function converts the DataFrame columns MapType or Struct type to the JSON string. The json_tuple() function returns the query or extracts the present elements from the JSON column and creates the new columns. The get_json_object() function extracts the JSON string based on the path from the JSON column. The schema_of_json() function creates the schema string from the JSON string column.

What Users are saying..

profile image

Ed Godalle

Director Data Analytics at EY / EY Tech
linkedin profile url

I am the Director of Data Analytics with over 10+ years of IT experience. I have a background in SQL, Python, and Big Data working with Accenture, IBM, and Infosys. I am looking to enhance my skills... Read More

Relevant Projects

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Hadoop Project to Perform Hive Analytics using SQL and Scala
In this hadoop project, learn about the features in Hive that allow us to perform analytical queries over large datasets.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

SQL Project for Data Analysis using Oracle Database-Part 7
In this SQL project, you will learn to perform various data wrangling activities on an ecommerce database.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

SQL Project for Data Analysis using Oracle Database-Part 4
In this SQL Project for Data Analysis, you will learn to efficiently write queries using WITH clause and analyse data using SQL Aggregate Functions and various other operators like EXISTS, HAVING.

Build an ETL Pipeline for Financial Data Analytics on GCP-IaC
In this GCP Project, you will learn to build an ETL pipeline on Google Cloud Platform to maximize the efficiency of financial data analytics with GCP-IaC.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.