Explain ArrayType functions in PySpark in Databricks

This recipe explains what ArrayType functions in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

The PySpark ArrayType is widely used and is defined as the collection data type that extends the DataType class which is the superclass of all types in the PySpark. All elements of ArrayType should have the same type of elements. The PySpark "pyspark.sql.types.ArrayType" (i.e. ArrayType extends DataType class) is widely used to define an array data type column on the DataFrame which holds the same type of elements. The explode() function of ArrayType is used to create the new row for each element in the given array column. The split() SQL function as an ArrayType function returns an array type after splitting the string column by the delimiter. Further, the array() ArrayType function is used to create a new array column by merging the data from multiple columns. Further, All the input columns must have the same data type. The array_contains() SQL function is further used to check if the array column contains a value and it returns null if the array is null, true if the array contains the value, and false otherwise.

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what are ArrayType functions and how to perform them in PySpark.

Implementing the ArrayType functions in Databricks in PySpark

# Importing packages import pyspark.sql from pyspark.sql import SparkSession from pyspark.sql.types import StringType, ArrayType,StructType,StructField from pyspark.sql.functions import explode from pyspark.sql.functions import split from pyspark.sql.functions import array from pyspark.sql.functions import array_contains Databricks-1

The Sparksession, StringType, ArrayType, StructType, StructField, Explode, Split, Array and Array_Contains are imported to perform ArrayType functions in PySpark.

# Implementing the ArrayType functions in Databricks in PySpark arrayCol = ArrayType(StringType(),False) Sample_data = [ ("Rahul,, Gupta",["C","C++","Python"],["Spark","C"],"RJ","DL"), ("Manan,, Aggarwal",["Spark","C","C++"],["Spark","C"],"MH","UK"), ("Hemant,, Singh",["Scala","Go"],["Spark","Matlab"],"AP","JH") ] Sample_schema = StructType([ StructField("Name",StringType(),True), StructField("Languages_at_School",ArrayType(StringType()),True), StructField("Languages_at_Home",ArrayType(StringType()),True), StructField("Current_State", StringType(), True), StructField("Previous_Travelled_State", StringType(), True) ]) dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema) dataframe.printSchema() dataframe.show() # Using explode() function dataframe.select(dataframe.Name, explode(dataframe.Languages_at_School)).show() # Using split() function dataframe.select(split(dataframe.Name,",").alias("NameAsArray")).show() # Using array() function dataframe.select(dataframe.Name,array(dataframe.Current_State, dataframe.Previous_Travelled_State).alias("States")).show() # Using array_contains() function dataframe.select(dataframe.Name, array_contains(dataframe.Languages_at_School,"C") .alias("Array_Contains")).show()
Databricks-2
Databricks-3
Databricks-4
Databricks-5

The "data frame" value is created in which the Sample_data and Sample_schema are defined. Using the explode() function returns the new row for each element in the given array column. The split() SQL function returns the array type after splitting the string column by the delimiter. The array() function create the new array column by merging the data from multiple columns and all input columns must have the same data type. The array_contains() SQL function returns null if the array is null, true if the array contains the value and otherwise false.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

What Users are saying..

Gautam Vermani

Data Consultant at Confidential

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Learn Data Processing with Spark SQL using Scala on AWS

In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Airline Dataset Analysis using Hadoop, Hive, Pig and Athena

Hadoop Project- Perform basic big data analysis on airline dataset using big data tools -Pig, Hive and Athena.

View Project Details

Web Server Log Processing using Hadoop in Azure

In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

View Project Details

GCP Project to Explore Cloud Functions using Python Part 1

In this project we will explore the Cloud Services of GCP such as Cloud Storage, Cloud Engine and PubSub

View Project Details

GCP Project-Build Pipeline using Dataflow Apache Beam Python

In this GCP Project, you will learn to build a data pipeline using Apache Beam Python on Google Dataflow.

View Project Details

AWS CDK Project for Building Real-Time IoT Infrastructure

AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

View Project Details

Build a Scalable Event Based GCP Data Pipeline using DataFlow

In this GCP project, you will learn to build and deploy a fully-managed(serverless) event-driven data pipeline on GCP using services like Cloud Composer, Google Cloud Storage (GCS), Pub-Sub, Cloud Functions, BigQuery, BigTable

View Project Details

Databricks Real-Time Streaming with Event Hubs and Snowflake

In this Azure Databricks Project, you will learn to use Azure Databricks, Event Hubs, and Snowflake to process and analyze real-time data, specifically in monitoring IoT devices.

View Project Details

Build an AWS ETL Data Pipeline in Python on YouTube Data

AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

View Project Details

Explain ArrayType functions in PySpark in Databricks

Recipe Objective - Explain ArrayType functions in PySpark in Databricks?

System Requirements

Implementing the ArrayType functions in Databricks in PySpark

Gautam Vermani

Relevant Projects

You might also like

Relevant Projects