Explain ArrayType functions in PySpark in Databricks

This recipe explains what ArrayType functions in PySpark in Databricks

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

The PySpark ArrayType is widely used and is defined as the collection data type that extends the DataType class which is the superclass of all types in the PySpark. All elements of ArrayType should have the same type of elements. The PySpark "pyspark.sql.types.ArrayType" (i.e. ArrayType extends DataType class) is widely used to define an array data type column on the DataFrame which holds the same type of elements. The explode() function of ArrayType is used to create the new row for each element in the given array column. The split() SQL function as an ArrayType function returns an array type after splitting the string column by the delimiter. Further, the array() ArrayType function is used to create a new array column by merging the data from multiple columns. Further, All the input columns must have the same data type. The array_contains() SQL function is further used to check if the array column contains a value and it returns null if the array is null, true if the array contains the value, and false otherwise.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are ArrayType functions and how to perform them in PySpark.

Implementing the ArrayType functions in Databricks in PySpark

# Importing packages
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, ArrayType,StructType,StructField
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import array
from pyspark.sql.functions import array_contains
Databricks-1

The Sparksession, StringType, ArrayType, StructType, StructField, Explode, Split, Array and Array_Contains are imported to perform ArrayType functions in PySpark.

# Implementing the ArrayType functions in Databricks in PySpark
arrayCol = ArrayType(StringType(),False)
Sample_data = [
("Rahul,, Gupta",["C","C++","Python"],["Spark","C"],"RJ","DL"),
("Manan,, Aggarwal",["Spark","C","C++"],["Spark","C"],"MH","UK"),
("Hemant,, Singh",["Scala","Go"],["Spark","Matlab"],"AP","JH")
]
Sample_schema = StructType([
StructField("Name",StringType(),True),
StructField("Languages_at_School",ArrayType(StringType()),True),
StructField("Languages_at_Home",ArrayType(StringType()),True),
StructField("Current_State", StringType(), True),
StructField("Previous_Travelled_State", StringType(), True)
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show()
# Using explode() function
dataframe.select(dataframe.Name, explode(dataframe.Languages_at_School)).show()
# Using split() function
dataframe.select(split(dataframe.Name,",").alias("NameAsArray")).show()
# Using array() function
dataframe.select(dataframe.Name,array(dataframe.Current_State, dataframe.Previous_Travelled_State).alias("States")).show()
# Using array_contains() function
dataframe.select(dataframe.Name, array_contains(dataframe.Languages_at_School,"C")
.alias("Array_Contains")).show()

Databricks-2
Databricks-3
Databricks-4
Databricks-5

The "data frame" value is created in which the Sample_data and Sample_schema are defined. Using the explode() function returns the new row for each element in the given array column. The split() SQL function returns the array type after splitting the string column by the delimiter. The array() function create the new array column by merging the data from multiple columns and all input columns must have the same data type. The array_contains() SQL function returns null if the array is null, true if the array contains the value and otherwise false.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena
Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

GCP Project to Explore Cloud Functions using Python Part 1
In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

GCP Project-Build Pipeline using Dataflow Apache Beam Python
In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Build a Scalable Event Based GCP Data Pipeline using DataFlow
In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

Databricks Real-Time Streaming with Event Hubs and Snowflake
In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda