Explain the flatmap transformation in PySpark in Databricks

This recipe explains what the flatmap transformation in PySpark in Databricks

Recipe Objective - Explain the flatmap() transformation in PySpark in Databricks?

In PySpark, the flatMap() is defined as the transformation operation which flattens the Resilient Distributed Dataset or DataFrame(i.e. array/map DataFrame columns) after applying the function on every element and further returns the new PySpark Resilient Distributed Dataset or DataFrame. The PySpark Dataframe is a distributed collection of data organized into the named columns and is conceptually equivalent to the table in the relational database or the data frame in Python or R language. The Dataframes in PySpark can also be constructed from a wide array of the sources such as the structured data files, the tables in Apache Hive, External databases or the existing Resilient Distributed Datasets. Further, the DataFrame API(Application Programming Interface) is available in Java, Scala, Python and R. Also, the DataFrame is represented by the Dataset of Rows in Scala and Java. \ The DataFrame is the type alias of Dataset[Row] in the Scala API.

ETL Orchestration on AWS using Glue and Step Functions

System Requirements

  • Python (3.0 version)
  • Apache Spark (3.1.1 version)

This recipe explains what is flatmap() transformation and explains the usage of flatmap() in PySpark.

Implementing the flatmap() transformation in Databricks in PySpark

# Importing packages
import pyspark
from pyspark.sql import SparkSession, Row
from pyspark.sql.types import MapType, StringType
from pyspark.sql.functions import col, explode
from pyspark.sql.types import StructType,StructField, StringType
Databricks-1

The Sparksession, Row, MapType, StringType, col, explode, StructType, StructField, StringType are imported in the environment to use flatmap() transformation in the PySpark.

# Implementing the flatmap() transformation in Databricks in PySpark
spark = SparkSession.builder.appName('flatmap() PySpark').getOrCreate()
sample_data = ["Project Gutenberg’s",
"Alice’s Adventures in Wonderland",
"Project Gutenberg’s",
"Adventures in Wonderland",
"Project Gutenberg’s"]
Rdd = spark.sparkContext.parallelize(sample_data)
for element in Rdd.collect():
print(element)
# Using Flatmap() Transformation
Rdd2 = Rdd.flatMap(lambda x: x.split(" "))
for element in Rdd2.collect():
print(element)
Databricks-2

Databricks-3

The Spark Session is defined. The "sample_data" is defined. Further, "RDD" is defined using the sample_data. Using the flatmap() transformation, it splits each record by the space in an RDD and finally flattens it which results in the RDD consisting of the single word on each record.

What Users are saying..

profile image

Ray han

Tech Leader | Stanford / Yale University
linkedin profile url

I think that they are fantastic. I attended Yale and Stanford and have worked at Honeywell,Oracle, and Arthur Andersen(Accenture) in the US. I have taken Big Data and Hadoop,NoSQL, Spark, Hadoop... Read More

Relevant Projects

Learn Data Processing with Spark SQL using Scala on AWS
In this AWS Spark SQL project, you will analyze the Movies and Ratings Dataset using RDD and Spark SQL to get hands-on experience on the fundamentals of Scala programming language.

Web Server Log Processing using Hadoop in Azure
In this big data project, you will use Hadoop, Flume, Spark and Hive to process the Web Server logs dataset to glean more insights on the log data.

Build an ETL Pipeline on EMR using AWS CDK and Power BI
In this ETL Project, you will learn build an ETL Pipeline on Amazon EMR with AWS CDK and Apache Hive. You'll deploy the pipeline using S3, Cloud9, and EMR, and then use Power BI to create dynamic visualizations of your transformed data.

SQL Project for Data Analysis using Oracle Database-Part 3
In this SQL Project for Data Analysis, you will learn to efficiently write sub-queries and analyse data using various SQL functions and operators.

Movielens Dataset Analysis on Azure
Build a movie recommender system on Azure using Spark SQL to analyse the movielens dataset . Deploy Azure data factory, data pipelines and visualise the analysis.

AWS Project for Batch Processing with PySpark on AWS EMR
In this AWS Project, you will learn how to perform batch processing on Wikipedia data with PySpark on AWS EMR.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

dbt Snowflake Project to Master dbt Fundamentals in Snowflake
DBT Snowflake Project to Master the Fundamentals of DBT and learn how it can be used to build efficient and robust data pipelines with Snowflake.

A Hands-On Approach to Learn Apache Spark using Scala
Get Started with Apache Spark using Scala for Big Data Analysis

Streaming Data Pipeline using Spark, HBase and Phoenix
Build a Real-Time Streaming Data Pipeline for an application that monitors oil wells using Apache Spark, HBase and Apache Phoenix .