Explain ArrayType functions in PySpark in Databricks

This recipe explains what ArrayType functions in PySpark in Databricks

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

The PySpark ArrayType is widely used and is defined as the collection data type that extends the DataType class which is the superclass of all types in the PySpark. All elements of ArrayType should have the same type of elements. The PySpark "pyspark.sql.types.ArrayType" (i.e. ArrayType extends DataType class) is widely used to define an array data type column on the DataFrame which holds the same type of elements. The explode() function of ArrayType is used to create the new row for each element in the given array column. The split() SQL function as an ArrayType function returns an array type after splitting the string column by the delimiter. Further, the array() ArrayType function is used to create a new array column by merging the data from multiple columns. Further, All the input columns must have the same data type. The array_contains() SQL function is further used to check if the array column contains a value and it returns null if the array is null, true if the array contains the value, and false otherwise.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what are ArrayType functions and how to perform them in PySpark.

Implementing the ArrayType functions in Databricks in PySpark

# Importing packages
import pyspark.sql
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, ArrayType,StructType,StructField
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import array
from pyspark.sql.functions import array_contains
Databricks-1

The Sparksession, StringType, ArrayType, StructType, StructField, Explode, Split, Array and Array_Contains are imported to perform ArrayType functions in PySpark.

# Implementing the ArrayType functions in Databricks in PySpark
arrayCol = ArrayType(StringType(),False)
Sample_data = [
("Rahul,, Gupta",["C","C++","Python"],["Spark","C"],"RJ","DL"),
("Manan,, Aggarwal",["Spark","C","C++"],["Spark","C"],"MH","UK"),
("Hemant,, Singh",["Scala","Go"],["Spark","Matlab"],"AP","JH")
]
Sample_schema = StructType([
StructField("Name",StringType(),True),
StructField("Languages_at_School",ArrayType(StringType()),True),
StructField("Languages_at_Home",ArrayType(StringType()),True),
StructField("Current_State", StringType(), True),
StructField("Previous_Travelled_State", StringType(), True)
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show()
# Using explode() function
dataframe.select(dataframe.Name, explode(dataframe.Languages_at_School)).show()
# Using split() function
dataframe.select(split(dataframe.Name,",").alias("NameAsArray")).show()
# Using array() function
dataframe.select(dataframe.Name,array(dataframe.Current_State, dataframe.Previous_Travelled_State).alias("States")).show()
# Using array_contains() function
dataframe.select(dataframe.Name, array_contains(dataframe.Languages_at_School,"C")
.alias("Array_Contains")).show()

Databricks-2
Databricks-3
Databricks-4
Databricks-5

The "data frame" value is created in which the Sample_data and Sample_schema are defined. Using the explode() function returns the new row for each element in the given array column. The split() SQL function returns the array type after splitting the string column by the delimiter. The array() function create the new array column by merging the data from multiple columns and all input columns must have the same data type. The array_contains() SQL function returns null if the array is null, true if the array contains the value and otherwise false.

What Users are saying..

profile image

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd
linkedin profile url

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Migration of MySQL Databases to Cloud AWS using AWS DMS
IoT-based Data Migration Project using AWS DMS and Aurora Postgres aims to migrate real-time IoT-based data from an MySQL database to the AWS cloud.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Azure Data Factory and Databricks End-to-End Project
Azure Data Factory and Databricks End-to-End Project to implement analytics on trip transaction data using Azure Services such as Data Factory, ADLS Gen2, and Databricks, with a focus on data transformation and pipeline resiliency.

AWS Project-Website Monitoring using AWS Lambda and Aurora
In this AWS Project, you will learn the best practices for website monitoring using AWS services like Lambda, Aurora MySQL, Amazon Dynamo DB and Kinesis.

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.