Understanding Apache Spark’s Execution Model Using SparkListeners
When you execute an action on a RDD, Apache Spark runs a job that in turn triggers tasks using DAGScheduler and TaskScheduler, respectively. They are all low-level details that may be often useful to understand when a simple transformation is no longer simple performance-wise and takes ages to complete.
There are a few ways to monitor Spark and WebUI is the most obvious choice with toDebugString and logs being at the other side of the spectrum – still useful, but require more skills than opening a browser at http://localhost:4040 and looking at the Details for Stage in the Stages tab for a given job.
There are however other ways that are not so often used which I’m going to present in this blog post – Scheduler Listeners.
Scheduler Listeners
A Scheduler listener (also known as SparkListener) is a class that listens to execution events from Spark’s DAGScheduler – the main part of the execution engine in Spark. It extends org.apache.spark.scheduler.SparkListener.
A SparkListener can receive events about when applications, jobs, stages, and tasks start and complete as well as other infrastructure-centric events like drivers being added or removed, when an RDD is unpersisted, or when environment properties change. All the information you can find about the health of Spark applications and the entire infrastructure is in the WebUI.
spark.extraListeners
By default, Spark starts with no listeners but the one for WebUI. You can however change the default behaviour using the spark.extraListeners (default: empty) setting.
spark.extraListeners is a comma-separated list of listener class names that are registered with Spark’s listener bus when SparkContext is initialized.
You can be informed about the extra listeners being registered in the logs as follows:
INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener
StatsReportListener – Logging summary statistics
Interestingly, Spark comes with two listeners that are worth knowing about – org.apache.spark.scheduler.StatsReportListener and org.apache.spark.scheduler.EventLoggingListener .
Let’s focus on StatsReportListener first, and leave EventLoggingListener for the next blog post.
org.apache.spark.scheduler.StatsReportListener (see the class’ scaladoc) is a SparkListener that logs summary statistics when a stage completes.
It listens to SparkListenerTaskEnd and SparkListenerStageCompleted events, and prints out the summary as INFOs to the logs:
15/11/04 15:39:45 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler.StageInfo@4d3956a4 15/11/04 15:39:45 INFO StatsReportListener: task runtime:(count: 8, mean: 36.625000, stdev: 5.893588, max: 52.000000, min: 33.000000) 15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100% 15/11/04 15:39:45 INFO StatsReportListener: 33.0 ms 33.0 ms 33.0 ms 34.0 ms 35.0 ms 36.0 ms 52.0 ms 52.0 ms 52.0 ms 15/11/04 15:39:45 INFO StatsReportListener: task result size:(count: 8, mean: 953.000000, stdev: 0.000000, max: 953.000000, min: 953.000000) 15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100% 15/11/04 15:39:45 INFO StatsReportListener: 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 953.0 B 15/11/04 15:39:45 INFO StatsReportListener: executor (non-fetch) time pct: (count: 8, mean: 17.660220, stdev: 1.948627, max: 20.000000, min: 13.461538) 15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100% 15/11/04 15:39:45 INFO StatsReportListener: 13 % 13 % 13 % 17 % 18 % 20 % 20 % 20 % 20 % 15/11/04 15:39:45 INFO StatsReportListener: other time pct: (count: 8, mean: 82.339780, stdev: 1.948627, max: 86.538462, min: 80.000000) 15/11/04 15:39:45 INFO StatsReportListener: 0% 5% 10% 25% 50% 75% 90% 95% 100% 15/11/04 15:39:45 INFO StatsReportListener: 80 % 80 % 80 % 82 % 82 % 83 % 87 % 87 % 87 %
To enable the listener, you register it to SparkContext. You can do it using SparkContext.addSparkListener(listener: SparkListener) method inside your Spark application or –conf command-line option.
$ ./bin/spark-shell --conf spark.extraListeners=org.apache.spark.scheduler.StatsReportListener ... INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener
When you do it, you should see the INFO message and the above summary after every stage completes.
With the listener, your Spark operation toolbox now has another tool to fight against bottlenecks in Spark applications, beside WebUI or logs. Happy tuning!