Explain StructType and StructField in PySpark in Databricks

This recipe explains what StructType and StructField in PySpark in Databricks

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

The StructType and the StructField classes in PySpark are popularly used to specify the schema to the DataFrame programmatically and further create the complex columns like the nested struct, array, and map columns. The StructType in PySpark is defined as the collection of the StructField’s that further defines the column name, column data type, and boolean to specify if field and metadata can be nullable or not. The StructField in PySpark represents the field in the StructType. An Object in StructField comprises of the three areas that are, name (a string), dataType (a DataType), and the nullable (a bool), where the field of the word is the name of the StructField. The area of dataType specifies the data type of a StructField, and the nullable field specifies if the values of the StructField can contain the None values.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains StructType and StructField and how to perform them in PySpark.

Implementing the StructType and StructField in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col,struct,when
Databricks-1

The Sparksession, StructField, StructType, IntegerType, col, struct, and when packages are imported in the environment to demonstrate StructType and StructField in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the StructType and StructField in Databricks in PySpark
spark = SparkSession.builder.master("local[1]") \
.appName('StructType and StructField') \
.getOrCreate()
# Creating StructType and StructField on dataframe
Sample_data = [("Ram","","Aggarwal","45458","M",4000),
("Shyam","Gupta","","45698","M",5000),
("Vijay","","Pandit","42365","M",5000),
("Roshni","Singh","kaur","36987","F",5000),
("Ishwar","Goel","Brown","","F",-2)
]
Sample_schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Nested StructType
Structure_Data = [
(("Ram","","Aggarwal"),"45458","M",4100),
(("Shyam","Gupta",""),"45698","M",5300),
(("Vijay","","Pandit"),"42365","M",2400),
(("Roshni","Singh","Kaur"),"36987","F",6500),
(("Ishwar","Goel","Brown"),"","F",-2)
]
Structure_Schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('id', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])
dataframe2 = spark.createDataFrame(data = Structure_Data, schema = Structure_Schema)
dataframe2.printSchema()
dataframe2.show(truncate=False)
# Updating struct of a dataframe using struct() function
Updated_DF = dataframe2.withColumn("OtherInfo",
struct(col("id").alias("identifier"),
col("gender").alias("gender"),
col("salary").alias("salary"),
when(col("salary").cast(IntegerType()) < 3000,"Low")
.when(col("salary").cast(IntegerType()) < 4000,"Medium")
.otherwise("High").alias("SalaryGrade")
)).drop("id","gender","salary")
Updated_DF.printSchema()
Updated_DF.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The "dataframe2" value in which Nested StructType is defined is created in which the Structure_Data and Structure_Schema are defined. Using the struct() function, updation of struct of the existing dataFrame2 takes place and some additions of new StructType to it. Further, the copy of the columns from one structure to another and adding a new column takes place using the cast() function.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Talend Real-Time Project for ETL Process Automation
In this Talend Project, you will learn how to build an ETL pipeline in Talend Open Studio to automate the process of File Loading and Processing.

PySpark Project to Learn Advanced DataFrame Concepts
In this PySpark Big Data Project, you will gain hands-on experience working with advanced functionalities of PySpark Dataframes and Performance Optimization.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Learn Efficient Multi-Source Data Processing with Talend ETL
In this Talend ETL Project , you will create a multi-source ETL Pipeline to load data from multiple sources such as MySQL Database, Azure Database, and API to Snowflake cloud using Talend Jobs.

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

GCP Project to Learn using BigQuery for Exploring Data
Learn using GCP BigQuery for exploring and preparing data for analysis and transformation of your datasets.

AWS CDK and IoT Core for Migrating IoT-Based Data to AWS
Learn how to use AWS CDK and various AWS services to replicate an On-Premise Data Center infrastructure by ingesting real-time IoT-based.

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.