Explain Spark UI

This recipe explains Spark UI
Last Updated: 29 Aug 2022

Get access to Big Data projects View all Big Data projects

APACHE HADOOP PROJECTS DATA CLEANING PYTHON DATA MUNGING MACHINE LEARNING RECIPES PANDAS CHEATSHEET ALL TAGS

Recipe Objective: Explain Spark UI

Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. So, here we are going to learn about spark UI.

Access Snowflake Real-Time Project to Implement SCD's

Recipe Objective: Explain Spark UI

Implementation Info:

Databricks Community Edition click here
Scala
storage - Databricks File System(DBFS)

This set of user interfaces comes in handy to better understand how Spark executes the Spark/PySpark Jobs. Here, I will run a small application and explain how Spark executes this by using different sections in Spark Web UI.

Before going into Spark UI first, let us revise two concepts.

Transformations Spark Transformation is a function that produces new RDD from the existing RDDs. It takes RDD as input and has one or more RDD as output. Each time it creates a new RDD when we apply any transformation. Using transformation, built an RDD lineage with the entire parent RDDs of the final RDD(s). RDD lineage, also known as RDD operator graph or RDD dependency graph. It is a logical execution plan, i.e., it is a Directed Acyclic Graph (DAG) of the entire parent RDDs of RDD. Transformations are lazy; they get executed when we call an action. They are not executed immediately.

Action transformations create RDDs from each other, but action is performed when we want to work with the actual dataset. When the action is triggered after the result, a new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or the external storage system.

Creation of Test DataFrame

Here, we have created to DataFrames and performed inner join, and An action show(false) is performed. To get the final DataFrame after joining.

val employee = Seq((1,"ramu",3,"2018",10001,"M",25000), (2,"raju",1,"2010",20001,"M",40000), (3,"mahesh",1,"2010",10001,"M",35000), (4,"suresh",2,"2005",10001,"M",45000), (5,"likitha",2,"2010",40001,"F",35000), (6,"lavanya",2,"2010",50001,"F",40000), (8,"madhu",1,"2011",50001,"",40000)) val emp_schema = Seq("emp_id","name","reporting_head_id","year_joined","dept_id","gender","salary") val employeeDF = employee.toDF(emp_schema:_*) val dept = Seq(("Accounts",10001), ("Marketing",20001), ("Finance",30001), ("Engineering",40001) ) val dept_schema = Seq("department","dept_id") val dept_df = dept.toDF(dept_schema:_*) //Inner Join println("Inner Join") val inner_df = employeeDF.join(dept_df,employeeDF("dept_id") === dept_df("dept_id"),"inner") inner_df.show(false)

Here we are going to study Spark UI in databricks. Spark UI is separated into tabs as below.

Spark Jobs
Stages
Tasks
Storage
Environment
Executors
SQL

1. Spark Jobs Tab

The details that I want you to be aware of under the jobs section are Scheduling mode, the number of Spark Jobs, the number of stages it has, and the Description of your spark job. Scheduling Mode -> In databricks, by default, the community version is provided with only one NODE.

The Number of Spark Jobs -> Always keep in mind, the number of Spark jobs is equal to the number of actions in the application, and each Spark job should have at least one Stage. In our above application, we have performed 3 Spark jobs (0,1,2)

Number of Stages ->Each Wide Transformation results in a separate Number of Stages. In our case, Spark job0 and Spark job1, Spark job1 have single individual stages.

Description -> Description links the complete details of the associated SparkJob like Spark Job Status, DAG Visualization, Completed Stages. And below image shows how a job is to be in spark UI.

2. Stages Tab

The above image gives the UI representation of the Spark Stage Tab. We can navigate into Stage Tab in two ways. One by Select the Description of the respective Spark job (Shows stages only for the Spark job opted).On the top of the Spark Job tab, select the Stages option (Which shows all stages in the application). In our application, we have a total of 3 Stages.

The Stage tab displays a summary page that shows the current state of all stages of all Spark jobs in the spark application.

The number of tasks you could see in each Stage is the number of partitions that spark is going to work on, and each task inside a stage is the same work that will be done by Spark but on a different partition of data.

Here In our case, we have MapPartitionsRDD, parallelcollection Rdd in wholestagecodegen Stage in all stages 0,1,2.As below

MapPartitionsRDD -> MapPartitionsRDD will be created when you use map Partition transformation

parallelcollectionRDD -> parallelcollectionRDD will be created when you use Parallelize method to make partitioned data.

Wholestagecodegen -> A physical query optimizer in Spark SQL that fuses multiple physical operators.

3. Tasks

Tasks are located at the bottom space in the respective Stage. Key things to look task page are

1. Input Size – Input for the Stage and 2. Shuffle Write-Output is the Stage written. The below image shows the tasks that are executed in stage 2 of our spark application execution.

4. Storage

The Storage tab displays the persisted RDDs and DataFrames, if any, in the application. The summary page shows the storage levels, sizes, and partitions of all RDDs, and the details page shows the dimensions and uses executors for all partitions in an RDD or DataFrame.

5. Environment

This environment page has five parts. It is an excellent place to check whether your properties have been set correctly.

Runtime Information: contains the runtime properties like versions of Java and Scala.
Spark Properties: lists the application properties like ‘spark.app.name’ and ‘spark.driver.memory’.
Hadoop Properties: displays properties relative to Hadoop and YARN. Note: Properties like ‘spark.hadoop’ are shown not in this part but in ‘Spark Properties.’
System Properties: shows more details about the JVM.
Classpath Entries: lists the classes loaded from different sources, which is very useful to resolve class conflicts.

6. Executors

The Executors tab displays summary information about the executors created for the application, including memory and disk usage and task and shuffle information. The Storage Memory column shows the amount of memory used and reserved for caching data.

The Executors tab provides resource information like the amount of memory, disk, and cores of each executor and performance information. In our case, In Executors Number of cores = 8 and Number of tasks = 7.

7. SQL Tab

If the application executes Spark SQL queries, then the SQL tab displays information, such as the duration, Spark jobs, and physical and logical plans for the queries. In our application, we created a dataframe from raw data, and we joined those dataframes and called Show(). Here our spark application executed seven SQL queries.

Conclusion

Here, the Complete Description of Spark web UI and the tabs are the info it contains regarding the execution spark application like jobs, stages, tasks, executors, garbage collection, and data shuffling information.

Download Materials

bigdata_01

bigdata_02

bigdata_03

bigdata_04

bigdata_05

bigdata_06

bigdata_07

bigdata_08

bigdata_09

What Users are saying..

Ameeruddin Mohammed

ETL (Abintio) developer at IBM

I come from a background in Marketing and Analytics and when I developed an interest in Machine Learning algorithms, I did multiple in-class courses from reputed institutions though I got good... Read More