Explain the selection of columns from Dataframe in PySpark in Databricks

This recipe explains what the selection of columns from Dataframe in PySpark in Databricks

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

In PySpark, the select() function is mostly used to select the single, multiple, column by the index, all columns from the list and also the nested columns from the DataFrame. The PySpark select() is the transformation function that is it returns the new DataFrame with the selected columns. Using the select() function, the single or multiple columns of the DataFrame can be selected by passing column names that were selected to select to select() function. Further, the DataFrame is immutable hence this creates the new DataFrame with the selected columns. The show() function is also used to show the Dataframe contents.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is select() function is and explains the selection of columns from the data frame in PySpark.

Implementing the selection of columns from DataFrame in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, col, StringType, StructField, IntegerType are imported in the environment to make a selection of columns from Dataframe.

# Implementing the selection of columns from DataFrame in Databricks in PySpark
spark = SparkSession.builder.appName('Select Column PySpark').getOrCreate()
sample_data = [("Ram","Gupta","India","Delhi"),
("Shyam","Aggarwal","India","Delhi"),
("Amit","Kabuliwala","India","Uttar Pradesh"),
("Babu","Dabbawala","India","Rajasthan")]
sample_columns = ["firstname","lastname","country","state"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.show(truncate=False)
# Selecting Single column and Multiple columns
dataframe.select("firstname").show()
dataframe.select("firstname","lastname").show()
#Using Dataframe object name to select column
dataframe.select(dataframe.firstname, dataframe.lastname).show()
# Using col function
dataframe.select(col("firstname"),col("lastname")).show()
# Selecting the Nested Struct Columns in PySpark
sample_data1 = [(("Rame",None,"Gupta"),"Rajasthan","M"),
(("Anita","Garg",""),"Delhi","F"),
(("Pooja","","Aggarwal"),"Delhi","F"),
(("Saurabh","Anne","Jones"),"Jammu","M"),
(("Shahrukh","Khan","Brown"),"Maharashtra","M"),
(("Salman","Gupta","Williams"),"Delhi","M")
]
sample_schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])
dataframe2 = spark.createDataFrame(data = sample_data1, schema = sample_schema)
dataframe2.printSchema()
dataframe2.show(truncate=False)
dataframe2.select("name").show(truncate=False)
dataframe2.select("name.firstname","name.lastname").show(truncate=False)
dataframe2.select("name.*").show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8

The Spark Session is defined. The "sample_columns", "sample_data" are defined using which the "data frame" is defined. The single or multiple columns of the DataFrame bypassing the column names can be selected using the select() function and since the DataFrame is immutable, it further creates a new DataFrame with the selected columns. Also, the show() function is used to show Dataframe contents. The "sample_data1", "sample_schema1" are defined using which the "dataframe2" is defined. Using the dataframe2, the Nested struct columns are selected

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Build a Streaming Pipeline with DBT, Snowflake and Kinesis
This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

Spark Project-Analysis and Visualization on Yelp Dataset
The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

Build Streaming Data Pipeline using Azure Stream Analytics
In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

Hands-On Real Time PySpark Project for Beginners
In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

AWS Snowflake Data Pipeline Example using Kinesis and Airflow
Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

Retail Analytics Project Example using Sqoop, HDFS, and Hive
This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

Flask API Big Data Project using Databricks and Unity Catalog
In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.