CertPrepNowFREE
DatabricksTroubleshooting, Monitoring, and Optimization

Spark Performance & Troubleshooting Guide

Key Spark Configuration Parameters

The exam tests your knowledge of basic Spark tuning parameters: - spark.sql.shuffle.partitions (default: 200): Number of partitions after a shuffle operation. Reduce for small datasets, increase for large ones. - spark.default.parallelism: Default number of partitions for RDD operations. - spark.executor.memory: Memory allocated per executor. - spark.driver.memory: Memory allocated to the driver. - spark.sql.autoBroadcastJoinThreshold (default: 10MB): Tables below this size are automatically broadcast during joins.

Exam Tip: You need to know default values for shuffle.partitions (200) and autoBroadcastJoinThreshold (10MB). The exam asks you to identify the correct parameter to tune for a given performance problem.

Common Performance Bottlenecks

Three main bottlenecks to identify from Spark UI: 1. Data Skew: One partition has significantly more data than others, causing one task to run much longer. Fix: Salt keys, repartition, or use adaptive query execution. 2. Shuffle Spill (Disk Spill): Data exceeds executor memory during shuffle and spills to disk. Fix: Increase executor memory or reduce partition size. 3. Excessive Shuffling: Too many shuffle operations in the query plan. Fix: Reduce joins, use broadcast joins for small tables, or cache intermediate results.

Exam Tip: The exam shows Spark UI screenshots and asks you to identify the bottleneck. Look for: uneven task durations (skew), spill metrics (disk spill), or high shuffle read/write (excessive shuffling).

Using the Spark UI

The Spark UI provides several views for diagnosing performance: - Jobs tab: Overview of all jobs, their stages, and completion status - Stages tab: Detailed view of each stage including task metrics, shuffle sizes, and duration distribution - SQL/DataFrame tab: Query execution plans showing physical operators - Storage tab: Cached DataFrames and their memory usage Key metrics to check in the Stages tab: - Duration distribution (min, median, max) — large variance indicates skew - Shuffle Read/Write sizes — large values indicate expensive shuffles - Spill (Memory) and Spill (Disk) — non-zero values indicate memory pressure

Cluster Diagnostics

Common cluster issues and their causes: - Cluster startup failure: Insufficient quota in cloud provider, invalid instance type, or VPC/networking issues - Library conflicts: Incompatible Python package versions between cluster libraries and notebook libraries - Out-of-memory (OOM): Driver OOM from collecting too much data; Executor OOM from insufficient memory for the workload Diagnostic tools: Cluster event log, driver logs, Ganglia metrics (cluster-level resource usage)

Exam Tip: If the exam mentions collecting data to the driver (e.g., df.collect() or df.toPandas()), the likely issue is driver OOM. The fix is to avoid collecting large datasets or increase driver memory.

Predictive Optimization

Predictive Optimization is a Databricks feature that automatically runs maintenance operations (OPTIMIZE, VACUUM) on Delta tables based on usage patterns. When enabled, Databricks monitors table access patterns and automatically: - Compacts small files (OPTIMIZE) - Removes old files (VACUUM) - Applies Liquid Clustering This reduces manual maintenance overhead for production tables.

Exam Tip: Know that Predictive Optimization exists and what it does — automates OPTIMIZE and VACUUM. The exam may ask about it as a maintenance strategy.