Explain Spark UI

This recipe explains Spark UI

Recipe Objective: Explain Spark UI

Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. So, here we are going to learn about spark UI.

Access Snowflake Real-Time Project to Implement SCD's

Implementation Info:

  1. Databricks Community Edition click here
  2. Scala
  3. storage - Databricks File System(DBFS)

This set of user interfaces comes in handy to better understand how Spark executes the Spark/PySpark Jobs. Here, I will run a small application and explain how Spark executes this by using different sections in Spark Web UI.

Before going into Spark UI first, let us revise two concepts.

Transformations Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and has one or more RDD as output. Each time it creates a new RDD when we apply any transformation. Using transformation, built an RDD lineage with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan, i.e., it is a Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD. Transformations are lazy; they get executed when we call an action. They are not executed immediately.

Action transformations create RDDs from each other, but action is performed when we want to work with the actual dataset. When the action is triggered after the result, a new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or the external storage system.

Creation of Test DataFrame

Here, we have created to DataFrames and performed inner join, and An action show(false) is performed. To get the final DataFrame after joining.

val employee = Seq((1,"ramu",3,"2018",10001,"M",25000), (2,"raju",1,"2010",20001,"M",40000), (3,"mahesh",1,"2010",10001,"M",35000), (4,"suresh",2,"2005",10001,"M",45000), (5,"likitha",2,"2010",40001,"F",35000), (6,"lavanya",2,"2010",50001,"F",40000), (8,"madhu",1,"2011",50001,"",40000)) val emp_schema = Seq("emp_id","name","reporting_head_id","year_joined","dept_id","gender","salary") val employeeDF = employee.toDF(emp_schema:_*) val dept = Seq(("Accounts",10001), ("Marketing",20001), ("Finance",30001), ("Engineering",40001) ) val dept_schema = Seq("department","dept_id") val dept_df = dept.toDF(dept_schema:_*) //Inner Join println("Inner Join") val inner_df = employeeDF.join(dept_df,employeeDF("dept_id") === dept_df("dept_id"),"inner") inner_df.show(false)

bigdata_01.PNG

Here we are going to study Spark UI in databricks. Spark UI is separated into tabs as below.

  1. Spark Jobs
  2. Stages
  3. Tasks
  4. Storage
  5. Environment
  6. Executors
  7. SQL

1. Spark Jobs Tab

The details that I want you to be aware of under the jobs section are Scheduling mode, the number of Spark Jobs, the number of stages it has, and the Description of your spark job. Scheduling Mode -> In databricks, by default, the community version is provided with only one NODE.

The Number of Spark Jobs -> Always keep in mind, the number of Spark jobs is equal to the number of actions in the application, and each Spark job should have at least one Stage. In our above application, we have performed 3 Spark jobs (0,1,2)

Number of Stages ->Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1, Spark job1 have single individual stages.

bigdata_02.PNG

Description -> Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages. And below image shows how a job is to be in spark UI.

bigdata_03.PNG

2. Stages Tab

bigdata_04.PNG

The above image gives the UI representation of the Spark Stage Tab. We can navigate into Stage Tab in two ways. One by Select the Description of the respective Spark job (Shows stages only for the Spark job opted).On the top of the Spark Job tab, select the Stages option (Which shows all stages in the application). In our application, we have a total of 3 Stages.

The Stage tab displays a summary page that shows the current state of all stages of all Spark jobs in the spark application.

The number of tasks you could see in each Stage is the number of partitions that spark is going to work on, and each task inside a stage is the same work that will be done by Spark but on a different partition of data.

Here In our case, we have MapPartitionsRDD, parallelcollection Rdd in wholestagecodegen Stage in all stages 0,1,2.As below

bigdata_05.PNG

MapPartitionsRDD -> MapPartitionsRDD will be created when you use map Partition transformation

parallelcollectionRDD -> parallelcollectionRDD will be created when you use Parallelize method to make partitioned data.

Wholestagecodegen -> A physical query optimizer in Spark SQL that fuses multiple physical operators.

3. Tasks

Tasks are located at the bottom space in the respective Stage. Key things to look task page are

1. Input Size – Input for the Stage and 2. Shuffle Write-Output is the Stage written. The below image shows the tasks that are executed in stage 2 of our spark application execution.

bigdata_06.PNG

4. Storage

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes, and partitions of all RDDs, and the details page shows the dimensions and uses executors for all partitions in an RDD or DataFrame.

5. Environment

This environment page has five parts. It is an excellent place to check whether your properties have been set correctly.

  • Runtime Information: contains the runtime properties like versions of Java and Scala.
  • Spark Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
  • Hadoop Properties: displays properties relative to Hadoop and YARN. Note: Properties like ‘spark.hadoop’ are shown not in this part but in ‘Spark Properties.’
  • System Properties: shows more details about the JVM.
  • Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts.

bigdata_07.PNG

6. Executors

The Executors tab displays summary information about the executors created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.

The Executors tab provides resource information like the amount of memory, disk, and cores of each executor and performance information. In our case, In Executors Number of cores = 8 and Number of tasks = 7.

bigdata_08.PNG

7. SQL Tab

If the application executes Spark SQL queries, then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries. In our application, we created a dataframe from raw data, and we joined those dataframes and called Show(). Here our spark application executed seven SQL queries.

bigdata_09.PNG

Conclusion

Here, the Complete Description of Spark web UI and the tabs are the info it contains regarding the execution spark application like jobs, stages, tasks, executors, garbage collection, and data shuffling information.

What Users are saying..

profile image

Ameeruddin Mohammed

ETL (Abintio) developer at IBM
linkedin profile url

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More

Relevant Projects

Snowflake Real Time Data Warehouse Project for Beginners-1
In this Snowflake Data Warehousing Project, you will learn to implement the Snowflake architecture and build a data warehouse in the cloud to deliver business value.

SQL Project for Data Analysis using Oracle Database-Part 6
In this SQL project, you will learn the basics of data wrangling with SQL to perform operations on missing data, unwanted features and duplicated records.

Build a Real-Time Spark Streaming Pipeline on AWS using Scala
In this Spark Streaming project, you will build a real-time spark streaming pipeline on AWS using Scala and Python.

Learn Real-Time Data Ingestion with Azure Purview
In this Microsoft Azure project, you will learn data ingestion and preparation for Azure Purview.

SQL Project for Data Analysis using Oracle Database-Part 2
In this SQL Project for Data Analysis, you will learn to efficiently analyse data using JOINS and various other operations accessible through SQL in Oracle Database.

AWS Project - Build an ETL Data Pipeline on AWS EMR Cluster
Build a fully working scalable, reliable and secure AWS EMR complex data pipeline from scratch that provides support for all data stages from data collection to data analysis and visualization.

Azure Stream Analytics for Real-Time Cab Service Monitoring
Build an end-to-end stream processing pipeline using Azure Stream Analytics for real time cab service monitoring

PySpark Project-Build a Data Pipeline using Kafka and Redshift
In this PySpark ETL Project, you will learn to build a data pipeline and perform ETL operations by integrating PySpark with Apache Kafka and AWS Redshift

Learn How to Implement SCD in Talend to Capture Data Changes
In this Talend Project, you will build an ETL pipeline in Talend to capture data changes using SCD techniques.

Hive Mini Project to Build a Data Warehouse for e-Commerce
In this hive project, you will design a data warehouse for e-commerce application to perform Hive analytics on Sales and Customer Demographics data using big data tools such as Sqoop, Spark, and HDFS.