Explain the selection of columns from Dataframe in PySpark in Databricks

This recipe explains what the selection of columns from Dataframe in PySpark in Databricks

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

In PySpark, the select() function is mostly used to select the single, multiple, column by the index, all columns from the list and also the nested columns from the DataFrame. The PySpark select() is the transformation function that is it returns the new DataFrame with the selected columns. Using the select() function, the single or multiple columns of the DataFrame can be selected by passing column names that were selected to select to select() function. Further, the DataFrame is immutable hence this creates the new DataFrame with the selected columns. The show() function is also used to show the Dataframe contents.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is select() function is and explains the selection of columns from the data frame in PySpark.

Implementing the selection of columns from DataFrame in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, col, StringType, StructField, IntegerType are imported in the environment to make a selection of columns from Dataframe.

# Implementing the selection of columns from DataFrame in Databricks in PySpark
spark = SparkSession.builder.appName('Select Column PySpark').getOrCreate()
sample_data = [("Ram","Gupta","India","Delhi"),
("Shyam","Aggarwal","India","Delhi"),
("Amit","Kabuliwala","India","Uttar Pradesh"),
("Babu","Dabbawala","India","Rajasthan")]
sample_columns = ["firstname","lastname","country","state"]
dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns)
dataframe.show(truncate=False)
# Selecting Single column and Multiple columns
dataframe.select("firstname").show()
dataframe.select("firstname","lastname").show()
#Using Dataframe object name to select column
dataframe.select(dataframe.firstname, dataframe.lastname).show()
# Using col function
dataframe.select(col("firstname"),col("lastname")).show()
# Selecting the Nested Struct Columns in PySpark
sample_data1 = [(("Rame",None,"Gupta"),"Rajasthan","M"),
(("Anita","Garg",""),"Delhi","F"),
(("Pooja","","Aggarwal"),"Delhi","F"),
(("Saurabh","Anne","Jones"),"Jammu","M"),
(("Shahrukh","Khan","Brown"),"Maharashtra","M"),
(("Salman","Gupta","Williams"),"Delhi","M")
]
sample_schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)
])
dataframe2 = spark.createDataFrame(data = sample_data1, schema = sample_schema)
dataframe2.printSchema()
dataframe2.show(truncate=False)
dataframe2.select("name").show(truncate=False)
dataframe2.select("name.firstname","name.lastname").show(truncate=False)
dataframe2.select("name.*").show(truncate=False)
Databricks-2

Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8

The Spark Session is defined. The "sample_columns", "sample_data" are defined using which the "data frame" is defined. The single or multiple columns of the DataFrame bypassing the column names can be selected using the select() function and since the DataFrame is immutable, it further creates a new DataFrame with the selected columns. Also, the show() function is used to show Dataframe contents. The "sample_data1", "sample_schema1" are defined using which the "dataframe2" is defined. Using the dataframe2, the Nested struct columns are selected

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

EMR Serverless Example to Build a Search Engine for COVID19
In this AWS Project, create a search engine using the BM25 TF-IDF Algorithm that uses EMR Serverless for ad-hoc processing of a large amount of unstructured textual data.

Data Processing and Transformation in Hive using Azure VM
Hive Practice Example - Explore hive usage efficiently for data transformation and processing in this big data project using Azure VM.

PySpark Project-Build a Data Pipeline using Hive and Cassandra
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Hive and Cassandra

Databricks Data Lineage and Replication Management
Databricks Project on data lineage and replication management to help you optimize your data management practices | ProjectPro

PySpark ETL Project for Real-Time Data Processing
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations for Real-Time Data Processing

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

Build an AWS ETL Data Pipeline in Python on YouTube Data
AWS Project - Learn how to build ETL Data Pipeline in Python on YouTube Data using Athena, Glue and Lambda

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

Build a Data Pipeline with Azure Synapse and Spark Pool
In this Azure Project, you will learn to build a Data Pipeline in Azure using Azure Synapse Analytics, Azure Storage, Azure Synapse Spark Pool to perform data transformations on an Airline dataset and visualize the results in Power BI.

Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
The goal of this hadoop project is to apply some data engineering principles to Yelp Dataset in the areas of processing, storage, and retrieval.