Explain StructType and StructField in PySpark in Databricks

This recipe explains what StructType and StructField in PySpark in Databricks
Last Updated: 12 May 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

The StructType and the StructField classes in PySpark are popularly used to specify the schema to the DataFrame programmatically and further create the complex columns like the nested struct, array, and map columns. The StructType in PySpark is defined as the collection of the StructField’s that further defines the column name, column data type, and boolean to specify if field and metadata can be nullable or not. The StructField in PySpark represents the field in the StructType. An Object in StructField comprises of the three areas that are, name (a string), dataType (a DataType), and the nullable (a bool), where the field of the word is the name of the StructField. The area of dataType specifies the data type of a StructField, and the nullable field specifies if the values of the StructField can contain the None values.

Learn Spark SQL for Relational Big Data Procesing

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?
- System Requirements
- Implementing the StructType and StructField in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains StructType and StructField and how to perform them in PySpark.

Implementing the StructType and StructField in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import StructType,StructField, StringType, IntegerType from pyspark.sql.functions import col,struct,when Databricks-1

The Sparksession, StructField, StructType, IntegerType, col, struct, and when packages are imported in the environment to demonstrate StructType and StructField in PySpark.

Explore PySpark Machine Learning Tutorial to take your PySpark skills to the next level!

# Implementing the StructType and StructField in Databricks in PySpark spark = SparkSession.builder.master("local[1]") \ .appName('StructType and StructField') \ .getOrCreate() # Creating StructType and StructField on dataframe Sample_data = [("Ram","","Aggarwal","45458","M",4000), ("Shyam","Gupta","","45698","M",5000), ("Vijay","","Pandit","42365","M",5000), ("Roshni","Singh","kaur","36987","F",5000), ("Ishwar","Goel","Brown","","F",-2) ] Sample_schema = StructType([ \ StructField("firstname",StringType(),True), \ StructField("middlename",StringType(),True), \ StructField("lastname",StringType(),True), \ StructField("id", StringType(), True), \ StructField("gender", StringType(), True), \ StructField("salary", IntegerType(), True) \ ]) dataframe = spark.createDataFrame(data = Sample_data, schema = Sample_schema) dataframe.printSchema() dataframe.show(truncate=False) # Nested StructType Structure_Data = [ (("Ram","","Aggarwal"),"45458","M",4100), (("Shyam","Gupta",""),"45698","M",5300), (("Vijay","","Pandit"),"42365","M",2400), (("Roshni","Singh","Kaur"),"36987","F",6500), (("Ishwar","Goel","Brown"),"","F",-2) ] Structure_Schema = StructType([ StructField('name', StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ])), StructField('id', StringType(), True), StructField('gender', StringType(), True), StructField('salary', IntegerType(), True) ]) dataframe2 = spark.createDataFrame(data = Structure_Data, schema = Structure_Schema) dataframe2.printSchema() dataframe2.show(truncate=False) # Updating struct of a dataframe using struct() function Updated_DF = dataframe2.withColumn("OtherInfo", struct(col("id").alias("identifier"), col("gender").alias("gender"), col("salary").alias("salary"), when(col("salary").cast(IntegerType()) < 3000,"Low") .when(col("salary").cast(IntegerType()) < 4000,"Medium") .otherwise("High").alias("SalaryGrade") )).drop("id","gender","salary") Updated_DF.printSchema() Updated_DF.show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6

Learn to Transform your data pipeline with Azure Data Factory!

The "dataframe" value is created in which the Sample_data and Sample_schema are defined. The "dataframe2" value in which Nested StructType is defined is created in which the Structure_Data and Structure_Schema are defined. Using the struct() function, updation of struct of the existing dataFrame2 takes place and some additions of new StructType to it. Further, the copy of the columns from one structure to another and adding a new column takes place using the cast() function.

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Create A Data Pipeline based on Messaging Using PySpark Hive

In this PySpark project, you will simulate a complex real-world data pipeline based on messaging. This project is deployed using the following tech stack - NiFi, PySpark, Hive, HDFS, Kafka, Airflow, Tableau and AWS QuickSight.

View Project Details

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Build a Data Pipeline in AWS using NiFi, Spark, and ELK Stack

In this AWS Project, you will learn how to build a data pipeline Apache NiFi, Apache Spark, AWS S3, Amazon EMR cluster, Amazon OpenSearch, Logstash and Kibana.

View Project Details

Build an ETL Pipeline on EMR using AWS CDK and Power BI

In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

View Project Details

GCP Data Ingestion with SQL using Google Cloud Dataflow

In this GCP Project, you will learn to build a data processing pipeline With Apache Beam, Dataflow & BigQuery on GCP using Yelp Dataset.

View Project Details

COVID-19 Data Analysis Project using Python and AWS Stack

COVID-19 Data Analysis Project using Python and AWS to build an automated data pipeline that processes COVID-19 data from Johns Hopkins University and generates interactive dashboards to provide insights into the pandemic for public health officials, researchers, and the general public.

View Project Details

Yelp Data Processing using Spark and Hive Part 2

In this spark project, we will continue building the data warehouse from the previous project Yelp Data Processing Using Spark And Hive Part 1 and will do further data processing to develop diverse data products.

View Project Details

Hive Mini Project to Build a Data Warehouse for e-Commerce

In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.

View Project Details

Build Serverless Pipeline using AWS CDK and Lambda in Python

In this AWS Data Engineering Project, you will learn to build a serverless pipeline using AWS CDK and other AWS serverless technologies like AWS Lambda and Glue.

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

Explain StructType and StructField in PySpark in Databricks

Recipe Objective - Explain StructType and StructField in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the StructType and StructField in Databricks in PySpark

Jingwei Li

Relevant Projects

You might also like

Relevant Projects