Explain ArrayType functions in PySpark in Databricks

This recipe explains what ArrayType functions in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

The PySpark ArrayType is widely used and is defined as the collection data type that extends the DataType class which is the superclass of all types in the PySpark. All elements of ArrayType should have the same type of elements. The PySpark "pyspark.sql.types.ArrayType" (i.e. ArrayType extends DataType class) is widely used to define an array data type column on the DataFrame which holds the same type of elements. The explode() function of ArrayType is used to create the new row for each element in the given array column. The split() SQL function as an ArrayType function returns an array type after splitting the string column by the delimiter. Further, the array() ArrayType function is used to create a new array column by merging the data from multiple columns. Further, All the input columns must have the same data type. The array_contains() SQL function is further used to check if the array column contains a value and it returns null if the array is null, true if the array contains the value, and false otherwise.

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are ArrayType functions and how to perform them in PySpark.

Implementing the ArrayType functions in Databricks in PySpark

# Importing packages import pyspark.sql from pyspark.sql import SparkSession from pyspark.sql.types import StringType, ArrayType,StructType,StructField from pyspark.sql.functions import explode from pyspark.sql.functions import split from pyspark.sql.functions import array from pyspark.sql.functions import array_contains Databricks-1

The Sparksession, StringType, ArrayType, StructType, StructField, Explode, Split, Array and Array_Contains are imported to perform ArrayType functions in PySpark.

# Implementing the ArrayType functions in Databricks in PySpark arrayCol = ArrayType(StringType(),False) Sample_data = [ ("Rahul,, Gupta",["C","C++","Python"],["Spark","C"],"RJ","DL"), ("Manan,, Aggarwal",["Spark","C","C++"],["Spark","C"],"MH","UK"), ("Hemant,, Singh",["Scala","Go"],["Spark","Matlab"],"AP","JH") ] Sample_schema = StructType([ StructField("Name",StringType(),True), StructField("Languages_at_School",ArrayType(StringType()),True), StructField("Languages_at_Home",ArrayType(StringType()),True), StructField("Current_State", StringType(), True), StructField("Previous_Travelled_State", StringType(), True) ]) dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema) dataframe.printSchema() dataframe.show() # Using explode() function dataframe.select(dataframe.Name, explode(dataframe.Languages_at_School)).show() # Using split() function dataframe.select(split(dataframe.Name,",").alias("NameAsArray")).show() # Using array() function dataframe.select(dataframe.Name,array(dataframe.Current_State, dataframe.Previous_Travelled_State).alias("States")).show() # Using array_contains() function dataframe.select(dataframe.Name, array_contains(dataframe.Languages_at_School,"C") .alias("Array_Contains")).show()
Databricks-2
Databricks-3
Databricks-4
Databricks-5

The "data frame" value is created in which the Sample_data and Sample_schema are defined. Using the explode() function returns the new row for each element in the given array column. The split() SQL function returns the array type after splitting the string column by the delimiter. The array() function create the new array column by merging the data from multiple columns and all input columns must have the same data type. The array_contains() SQL function returns null if the array is null, true if the array contains the value and otherwise false.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

What Users are saying..

Anand Kumpatla

Sr Data Scientist @ Doubleslash Software Solutions Pvt Ltd

ProjectPro is a unique platform and helps many people in the industry to solve real-life problems with a step-by-step walkthrough of projects. A platform with some fantastic resources to gain... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

GCP Project to Learn using BigQuery for Exploring Data

Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

View Project Details

Azure Stream Analytics for Real-Time Cab Service Monitoring

Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

View Project Details

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

Explain ArrayType functions in PySpark in Databricks

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

System Requirements

Implementing the ArrayType functions in Databricks in PySpark

Anand Kumpatla

Relevant Projects

You might also like

Relevant Projects