Explain the selection of columns from Dataframe in PySpark in Databricks

This recipe explains what the selection of columns from Dataframe in PySpark in Databricks
Last Updated: 21 Aug 2023

Get access to Big Data projects View all Big Data projects

APACHE SPARK PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

In PySpark, the select() function is mostly used to select the single, multiple, column by the index, all columns from the list and also the nested columns from the DataFrame. The PySpark select() is the transformation function that is it returns the new DataFrame with the selected columns. Using the select() function, the single or multiple columns of the DataFrame can be selected by passing column names that were selected to select to select() function. Further, the DataFrame is immutable hence this creates the new DataFrame with the selected columns. The show() function is also used to show the Dataframe contents.

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?
- System Requirements
- Implementing the selection of columns from DataFrame in Databricks in PySpark

System Requirements

Python (3.0 version)
Apache Spark (3.1.1 version)

This recipe explains what is select() function is and explains the selection of columns from the data frame in PySpark.

Implementing the selection of columns from DataFrame in Databricks in PySpark

# Importing packages import pyspark from pyspark.sql import SparkSession, Row from pyspark.sql.types import MapType, StringType from pyspark.sql.functions import col from pyspark.sql.types import StructType,StructField, StringType Databricks-1

The Sparksession, Row, MapType, col, StringType, StructField, IntegerType are imported in the environment to make a selection of columns from Dataframe.

# Implementing the selection of columns from DataFrame in Databricks in PySpark spark = SparkSession.builder.appName('Select Column PySpark').getOrCreate() sample_data = [("Ram","Gupta","India","Delhi"), ("Shyam","Aggarwal","India","Delhi"), ("Amit","Kabuliwala","India","Uttar Pradesh"), ("Babu","Dabbawala","India","Rajasthan")] sample_columns = ["firstname","lastname","country","state"] dataframe = spark.createDataFrame(data = sample_data, schema = sample_columns) dataframe.show(truncate=False) # Selecting Single column and Multiple columns dataframe.select("firstname").show() dataframe.select("firstname","lastname").show() #Using Dataframe object name to select column dataframe.select(dataframe.firstname, dataframe.lastname).show() # Using col function dataframe.select(col("firstname"),col("lastname")).show() # Selecting the Nested Struct Columns in PySpark sample_data1 = [(("Rame",None,"Gupta"),"Rajasthan","M"), (("Anita","Garg",""),"Delhi","F"), (("Pooja","","Aggarwal"),"Delhi","F"), (("Saurabh","Anne","Jones"),"Jammu","M"), (("Shahrukh","Khan","Brown"),"Maharashtra","M"), (("Salman","Gupta","Williams"),"Delhi","M") ] sample_schema = StructType([ StructField('name', StructType([ StructField('firstname', StringType(), True), StructField('middlename', StringType(), True), StructField('lastname', StringType(), True) ])), StructField('state', StringType(), True), StructField('gender', StringType(), True) ]) dataframe2 = spark.createDataFrame(data = sample_data1, schema = sample_schema) dataframe2.printSchema() dataframe2.show(truncate=False) dataframe2.select("name").show(truncate=False) dataframe2.select("name.firstname","name.lastname").show(truncate=False) dataframe2.select("name.*").show(truncate=False) Databricks-2
Databricks-3
Databricks-4
Databricks-5
Databricks-6
Databricks-7
Databricks-8

The Spark Session is defined. The "sample_columns", "sample_data" are defined using which the "data frame" is defined. The single or multiple columns of the DataFrame bypassing the column names can be selected using the select() function and since the DataFrame is immutable, it further creates a new DataFrame with the selected columns. Also, the show() function is used to show Dataframe contents. The "sample_data1", "sample_schema1" are defined using which the "dataframe2" is defined. Using the dataframe2, the Nested struct columns are selected

Download Materials

Databricks_1

Databricks_2

Databricks_3

Databricks_4

Databricks_5

Databricks_6

Databricks_7

Databricks_8

What Users are saying..

Ray han

Tech Leader | Stanford / Yale University

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Machine Learning Projects

Data Science Projects

Python Projects for Data Science

Data Science Projects in R

Machine Learning Projects for Beginners

Deep Learning Projects

Neural Network Projects

Tensorflow Projects

NLP Projects

Kaggle Projects

IoT Projects

Big Data Projects

Hadoop Real-Time Projects Examples

Spark Projects

Data Analytics Projects for Students

Relevant Projects

Build a Streaming Pipeline with DBT, Snowflake and Kinesis

This dbt project focuses on building a streaming pipeline integrating dbt Cloud, Snowflake and Amazon Kinesis for real-time processing and analysis of Stock Market Data.

View Project Details

Spark Project-Analysis and Visualization on Yelp Dataset

The goal of this Spark project is to analyze business reviews from Yelp dataset and ingest the final output of data processing in Elastic Search.Also, use the visualisation tool in the ELK stack to visualize various kinds of ad-hoc reports from the data.

View Project Details

Build Streaming Data Pipeline using Azure Stream Analytics

In this Azure Data Engineering Project, you will learn how to build a real-time streaming platform using Azure Stream Analytics, Azure Event Hub, and Azure SQL database.

View Project Details

Hands-On Real Time PySpark Project for Beginners

In this PySpark project, you will learn about fundamental Spark architectural concepts like Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark

View Project Details

Movielens Dataset Analysis on Azure

Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

View Project Details

AWS Snowflake Data Pipeline Example using Kinesis and Airflow

Learn to build a Snowflake Data Pipeline starting from the EC2 logs to storage in Snowflake and S3 post-transformation and processing through Airflow DAGs

View Project Details

Streaming Data Pipeline using Spark, HBase and Phoenix

Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .

View Project Details

Retail Analytics Project Example using Sqoop, HDFS, and Hive

This Project gives a detailed explanation of How Data Analytics can be used in the Retail Industry, using technologies like Sqoop, HDFS, and Hive.

View Project Details

Flask API Big Data Project using Databricks and Unity Catalog

In this Flask Project, you will use Flask APIs, Databricks, and Unity Catalog to build a secure data processing platform focusing on climate data. You will also explore advanced features like Docker containerization, data encryption, and detailed data lineage tracking.

View Project Details

Project-Driven Approach to PySpark Partitioning Best Practices

In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

View Project Details

Explain the selection of columns from Dataframe in PySpark in Databricks

Recipe Objective - Explain the selection of columns from Dataframe in PySpark in Databricks?

Table of Contents

System Requirements

Implementing the selection of columns from DataFrame in Databricks in PySpark

Ray han

Relevant Projects

You might also like

Relevant Projects