Explain the selection of columns from Dataframe in PySpark in Databricks

This recipe explains what the selection of columns from Dataframe in PySpark in Databricks
Last Updated: 21 Aug 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

In PySpark, the select() function is mostly used to select the single, multiple, column by the index, all columns from the list and also the nested columns from the DataFrame. The PySpark select() is the transformation function that is it returns the new DataFrame with the selected columns. Using the select() function, the single or multiple columns of the DataFrame can be selected by passing column names that were selected to select to select() function. Further, the DataFrame is immutable hence this creates the new DataFrame with the selected columns. The show() function is also used to show the Dataframe contents.

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?
- System Requirements
- Implementing the selection of columns from DataFrame in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is select() function is and explains the selection of columns from the data frame in PySpark.

Implementing the selection of columns from DataFrame in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, col, StringType, StructField, IntegerType are imported in the environment to make a selection of columns from Dataframe.

# Implementing the selection of columns from DataFrame in Databricks in PySpark spark = SparkSession.builder.appName('Select Column PySpark').getOrCreate() sample_data = [("Ram","Gupta","India","Delhi"), ("Shyam","Aggarwal","India","Delhi"), ("Amit","Kabuliwala","India","Uttar Pradesh"), ("Babu","Dabbawala","India","Rajasthan")] sample_columns = ["firstname","lastname","country","state"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.show(truncate=False) # Selecting Single column and Multiple columns dataframe.select("firstname").show() dataframe.select("firstname","lastname").show() #Using Dataframe object name to select column dataframe.select(dataframe.firstname, dataframe.lastname).show() # Using col function dataframe.select(col("firstname"),col("lastname")).show() # Selecting the Nested Struct Columns in PySpark sample_data1 = [(("Rame",None,"Gupta"),"Rajasthan","M"), (("Anita","Garg",""),"Delhi","F"), (("Pooja","","Aggarwal"),"Delhi","F"), (("Saurabh","Anne","Jones"),"Jammu","M"), (("Shahrukh","Khan","Brown"),"Maharashtra","M"), (("Salman","Gupta","Williams"),"Delhi","M") ] sample_schema = StructType([ StructField('name', StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ])), StructField('state', StringType(), True), StructField('gender', StringType(), True) ]) dataframe2 = spark.createDataFrame(data = sample_data1, schema = sample_schema) dataframe2.printSchema() dataframe2.show(truncate=False) dataframe2.select("name").show(truncate=False) dataframe2.select("name.firstname","name.lastname").show(truncate=False) dataframe2.select("name.*").show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8

The Spark Session is defined. The "sample_columns", "sample_data" are defined using which the "data frame" is defined. The single or multiple columns of the DataFrame bypassing the column names can be selected using the select() function and since the DataFrame is immutable, it further creates a new DataFrame with the selected columns. Also, the show() function is used to show Dataframe contents. The "sample_data1", "sample_schema1" are defined using which the "dataframe2" is defined. Using the dataframe2, the Nested struct columns are selected

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

Databricks_8

What Users are saying..

Jingwei Li

Graduate Research assistance at Stony Brook University

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Explain the selection of columns from Dataframe in PySpark in Databricks

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the selection of columns from DataFrame in Databricks in PySpark

Jingwei Li

Relevant Projects

You might also like

Relevant Projects