Explain StructType and StructField in PySpark in Databricks

This recipe explains what StructType and StructField in PySpark in Databricks

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

The StructType and the StructField classes in PySpark are popularly used to specify the schema to the DataFrame programmatically and further create the complex columns like the nested struct, array, and map columns. The StructType in PySpark is defined as the collection of the StructField’s that further defines the column name, column data type, and boolean to specify if field and metadata can be nullable or not. The StructField in PySpark represents the field in the StructType. An Object in StructField comprises of the three areas that are, name (a string), dataType (a DataType), and the nullable (a bool), where the field of the word is the name of the StructField. The area of dataType specifies the data type of a StructField, and the nullable field specifies if the values of the StructField can contain the None values.

Learn Spark SQL for Relational Big Data Procesing

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains StructType and StructField and how to perform them in PySpark.

Implementing the StructType and StructField in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import col,struct,when
Databricks-1

The Sparksession, StructField, StructType, IntegerType, col, struct, and when packages are imported in the environment to demonstrate StructType and StructField in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the StructType and StructField in Databricks in PySpark
spark = SparkSession.builder.master("local[1]") \
.appName('StructType and StructField') \
.getOrCreate()
# Creating StructType and StructField on dataframe
Sample_data = [("Ram","","Aggarwal","45458","M",4000),
("Shyam","Gupta","","45698","M",5000),
("Vijay","","Pandit","42365","M",5000),
("Roshni","Singh","kaur","36987","F",5000),
("Ishwar","Goel","Brown","","F",-2)
]
Sample_schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
# Nested StructType
Structure_Data = [
(("Ram","","Aggarwal"),"45458","M",4100),
(("Shyam","Gupta",""),"45698","M",5300),
(("Vijay","","Pandit"),"42365","M",2400),
(("Roshni","Singh","Kaur"),"36987","F",6500),
(("Ishwar","Goel","Brown"),"","F",-2)
]
Structure_Schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('id', StringType(), True),
StructField('gender', StringType(), True),
StructField('salary', IntegerType(), True)
])
dataframe2 = spark.createDataFrame(data = Structure_Data, schema = Structure_Schema)
dataframe2.printSchema()
dataframe2.show(truncate=False)
# Updating struct of a dataframe using struct() function
Updated_DF = dataframe2.withColumn("OtherInfo",
struct(col("id").alias("identifier"),
col("gender").alias("gender"),
col("salary").alias("salary"),
when(col("salary").cast(IntegerType()) < 3000,"Low")
.when(col("salary").cast(IntegerType()) < 4000,"Medium")
.otherwise("High").alias("SalaryGrade")
)).drop("id","gender","salary")
Updated_DF.printSchema()
Updated_DF.show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The "dataframe2" value in which Nested StructType is defined is created in which the Structure_Data and Structure_Schema are defined. Using the struct() function, updation of struct of the existing dataFrame2 takes place and some additions of new StructType to it. Further, the copy of the columns from one structure to another and adding a new column takes place using the cast() function.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Create A Data Pipeline based on Messaging Using PySpark Hive
In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack
In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

GCP Data Ingestion with SQL using Google Cloud Dataflow
In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

COVID-19 Data Analysis Project using Python and AWS Stack
COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

Yelp Data Processing using Spark and Hive Part 2
In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

Build Serverless Pipeline using AWS CDK and Lambda in Python
In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.