Explain the creation of Dataframes in PySpark in Databricks

This recipe explains what the creation of Dataframes in PySpark in Databricks

Recipe Objective - Explain the creation of Dataframes in PySpark in Databricks?

The PySpark Dataframe is a distributed collection of the data organized into the named columns and is conceptually equivalent to the table in the relational database or the data frame in Python or R language. The Dataframes in PySpark can also be constructed from a wide array of the sources such as the structured data files, the tables in Apache Hive, External databases or the existing Resilient Distributed Datasets. Further, the DataFrame API(Application Programming Interface) is available in Java, Scala, Python and R. Also, the DataFrame is represented by the Dataset of Rows in Scala and Java. The DataFrame is the type alias of Dataset[Row] in the Scala API. The creation of the PySpark DataFrame is done using the "toDF()" and "createDataFrame()" methods and both this function takes different signatures to create the DataFrame from the existing RDD(Resilient Distributed Datasets), list, and DataFrame.

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is PySpark Dataframe is and how to create them in PySpark.

Build an Awesome Job Winning Data Engineering Projects Portfolio

Implementing the creation of Dataframes in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
Databricks-1

The Sparksession, Row, MapType, StringType, StructField, IntegerType are imported in the environment to create data frames in PySpark.

# Implementing the creation of Dataframes in Databricks in PySpark
spark = SparkSession.builder.appName('Creation Dataframe PySpark').getOrCreate()
columns = ["language","users_count"]
data = [("R", "30000"), ("Go", "200000"), ("Matlab", "2000")]
rdd = spark.sparkContext.parallelize(data)
# Using toDF() function
dataframeFromRDD1 = rdd.toDF()
dataframeFromRDD1.printSchema()
# Create Dataframe using Spark Session
dataframeFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
# Create Dataframe with the Schema
sample_data = [("Amit","","Gupta","36678","M",4000),
("Anita","Mathews","","40299","F",5000),
("Ram","","Aggarwal","42124","M",5000),
("Pooja","Anne","Goel","39298","F",5000),
("Geeta","Banuwala","Brown","","F",-2)
]
sample_schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
dataframe = spark.createDataFrame(data = sample_data, schema = sample_schema)
dataframe.printSchema()
dataframe.show(truncate=False)
Databricks-1

Databricks-2
Databricks-3

The Spark Session is defined. The "columns", "data" and "rdd" are defined. Using toDF() function, the "dataframeFromRDD1" is created using the rdd. Further, the "dataframeFromRDD2" is created using the Spark Session. Also, the Dataframe is created using the "sample_data" and the "sample_schema" using the createDataFrame() function.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Build a real-time Streaming Data Pipeline using Flink and Kinesis
In this big data project on AWS, you will learn how to run an Apache Flink Python application for a real-time streaming platform using Amazon Kinesis.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Build an Analytical Platform for eCommerce using AWS Services
In this AWS Big Data Project, you will use an eCommerce dataset to simulate the logs of user purchases, product views, cart history, and the user’s journey to build batch and real-time pipelines.

Build a big data pipeline with AWS Quicksight, Druid, and Hive
Use the dataset on aviation for analytics to simulate a complex real-world big data pipeline based on messaging with AWS Quicksight, Druid, NiFi, Kafka, and Hive.

Project-Driven Approach to PySpark Partitioning Best Practices
In this Big Data Project, you will learn to implement PySpark Partitioning Best Practices.

Building Data Pipelines in Azure with Azure Synapse Analytics
In this Microsoft Azure Data Engineering Project, you will learn how to build a data pipeline using Azure Synapse Analytics, Azure Storage and Azure Synapse SQL pool to perform data analysis on the 2021 Olympics dataset.

AWS CDK Project for Building Real-Time IoT Infrastructure
AWS CDK Project for Beginners to Build Real-Time IoT Infrastructure and migrate and analyze data to

Python and MongoDB Project for Beginners with Source Code-Part 1
In this Python and MongoDB Project, you learn to do data analysis using PyMongo on MongoDB Atlas Cluster.

Implementing Slow Changing Dimensions in a Data Warehouse using Hive and Spark
Hive Project- Understand the various types of SCDs and implement these slowly changing dimesnsion in Hadoop Hive and Spark.

Real-time Auto Tracking with Spark-Redis
Spark Project - Discuss real-time monitoring of taxis in a city. The real-time data streaming will be simulated using Flume. The ingestion will be done using Spark Streaming.