Key Spark Configuration Parameters
The exam tests your knowledge of basic Spark tuning parameters: - spark.sql.shuffle.partitions (default: 200): Number of partitions after a shuffle operation. Reduce for small datasets, increase for large ones. - spark.default.parallelism: Default number of partitions for RDD operations. - spark.executor.memory: Memory allocated per executor. - spark.driver.memory: Memory allocated to the driver. - spark.sql.autoBroadcastJoinThreshold (default: 10MB): Tables below this size are automatically broadcast during joins.
Common Performance Bottlenecks
Three main bottlenecks to identify from Spark UI: 1. Data Skew: One partition has significantly more data than others, causing one task to run much longer. Fix: Salt keys, repartition, or use adaptive query execution. 2. Shuffle Spill (Disk Spill): Data exceeds executor memory during shuffle and spills to disk. Fix: Increase executor memory or reduce partition size. 3. Excessive Shuffling: Too many shuffle operations in the query plan. Fix: Reduce joins, use broadcast joins for small tables, or cache intermediate results.
Using the Spark UI
The Spark UI provides several views for diagnosing performance: - Jobs tab: Overview of all jobs, their stages, and completion status - Stages tab: Detailed view of each stage including task metrics, shuffle sizes, and duration distribution - SQL/DataFrame tab: Query execution plans showing physical operators - Storage tab: Cached DataFrames and their memory usage Key metrics to check in the Stages tab: - Duration distribution (min, median, max) — large variance indicates skew - Shuffle Read/Write sizes — large values indicate expensive shuffles - Spill (Memory) and Spill (Disk) — non-zero values indicate memory pressure
Cluster Diagnostics
Common cluster issues and their causes: - Cluster startup failure: Insufficient quota in cloud provider, invalid instance type, or VPC/networking issues - Library conflicts: Incompatible Python package versions between cluster libraries and notebook libraries - Out-of-memory (OOM): Driver OOM from collecting too much data; Executor OOM from insufficient memory for the workload Diagnostic tools: Cluster event log, driver logs, Ganglia metrics (cluster-level resource usage)
Predictive Optimization
Predictive Optimization is a Databricks feature that automatically runs maintenance operations (OPTIMIZE, VACUUM) on Delta tables based on usage patterns. When enabled, Databricks monitors table access patterns and automatically: - Compacts small files (OPTIMIZE) - Removes old files (VACUUM) - Applies Liquid Clustering This reduces manual maintenance overhead for production tables.