Explain StructType and StructField in PySpark in Databricks

This recipe explains what StructType and StructField in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

The StructType and the StructField classes in PySpark are popularly used to specify the schema to the DataFrame programmatically and further create the complex columns like the nested struct, array, and map columns. The StructType in PySpark is defined as the collection of the StructField’s that further defines the column name, column data type, and boolean to specify if field and metadata can be nullable or not. The StructField in PySpark represents the field in the StructType. An Object in StructField comprises of the three areas that are, name (a string), dataType (a DataType), and the nullable (a bool), where the field of the word is the name of the StructField. The area of dataType specifies the data type of a StructField, and the nullable field specifies if the values of the StructField can contain the None values.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?
- System Requirements
- Implementing the StructType and StructField in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains StructType and StructField and how to perform them in PySpark.

Implementing the StructType and StructField in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.sql.functions import col,struct,when Databricks-1

The Sparksession, StructField, StructType, IntegerType, col, struct, and when packages are imported in the environment to demonstrate StructType and StructField in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the StructType and StructField in Databricks in PySpark spark = SparkSession.builder.master("local[1]") \ .appName('StructType and StructField') \ .getOrCreate() # Creating StructType and StructField on dataframe Sample_data = [("Ram","","Aggarwal","45458","M",4000), ("Shyam","Gupta","","45698","M",5000), ("Vijay","","Pandit","42365","M",5000), ("Roshni","Singh","kaur","36987","F",5000), ("Ishwar","Goel","Brown","","F",-2) ] Sample_schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Nested StructType Structure_Data = [ (("Ram","","Aggarwal"),"45458","M",4100), (("Shyam","Gupta",""),"45698","M",5300), (("Vijay","","Pandit"),"42365","M",2400), (("Roshni","Singh","Kaur"),"36987","F",6500), (("Ishwar","Goel","Brown"),"","F",-2) ] Structure_Schema = StructType([ StructField('name', StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ])), StructField('id', StringType(), True), StructField('gender', StringType(), True), StructField('salary', IntegerType(), True) ]) dataframe2 = spark.createDataFrame(data = Structure_Data, schema = Structure_Schema) dataframe2.printSchema() dataframe2.show(truncate=False) # Updating struct of a dataframe using struct() function Updated_DF = dataframe2.withColumn("OtherInfo", struct(col("id").alias("identifier"), col("gender").alias("gender"), col("salary").alias("salary"), when(col("salary").cast(IntegerType()) < 3000,"Low") .when(col("salary").cast(IntegerType()) < 4000,"Medium") .otherwise("High").alias("SalaryGrade") )).drop("id","gender","salary") Updated_DF.printSchema() Updated_DF.show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The "dataframe2" value in which Nested StructType is defined is created in which the Structure_Data and Structure_Schema are defined. Using the struct() function, updation of struct of the existing dataFrame2 takes place and some additions of new StructType to it. Further, the copy of the columns from one structure to another and adding a new column takes place using the cast() function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Explain StructType and StructField in PySpark in Databricks

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the StructType and StructField in Databricks in PySpark

Ray han

Relevant Projects

You might also like

Relevant Projects