You Can Pass This Exam For Free
Choose Your Study Path
No prior Spark or distributed computing experience. You'll build foundational knowledge from scratch over 4 weeks.
Exam Overview
Format
60 questions, 120 minutes. Multiple choice (single select and multiple select). Code-heavy — most questions present PySpark or Spark SQL code snippets and ask you to predict behavior, fix errors, or choose the correct implementation.
Scoring
Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure.
Domains & Weights
- Apache Spark Architecture17%
- Apache Spark DataFrame API34%
- Apache Spark SQL17%
- Spark Structured Streaming8%
- Spark Performance and Optimization12%
- Delta Lake and Spark Ecosystem12%
Registration
$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification. Cost: $200 USD.
Topic Priority Table
Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.
Apache Spark Architecture
At 17% of the exam, this domain tests your understanding of Spark's distributed execution model. You must know the driver-executor architecture, how jobs are broken into stages and tasks, deployment modes, fault tolerance, and the role of the cluster manager. Expect conceptual questions about execution flow.
Key Topics
Must-Know Concepts
- SparkSession is the single entry point for all Spark functionality — it replaces SparkContext, SQLContext, and HiveContext from earlier Spark versions
- The driver process runs on one node, creates the execution plan (DAG), and distributes tasks to executors. Executors run on worker nodes and perform actual data processing
- Execution hierarchy: Application > Job > Stage > Task. Jobs are triggered by actions, stages are separated by shuffle boundaries, tasks operate on individual partitions
- Deployment modes: client mode runs the driver on the submitting machine; cluster mode runs the driver on a cluster node. Client mode is for interactive development, cluster mode for production
- Fault tolerance: if an executor fails, Spark recomputes lost partitions using lineage (the DAG). If the driver fails, the entire application fails
- The number of tasks in a stage equals the number of partitions being processed — each task processes exactly one partition on one core
Common Exam Traps
Apache Spark DataFrame API
The largest domain at 34% — one-third of the exam. Tests your ability to create, transform, and manipulate DataFrames using PySpark. Expect questions on select, filter, join, groupBy, withColumn, reading/writing data, column expressions, null handling, UDFs, and schema operations. You must read code and predict output.
Key Topics
Must-Know Concepts
- Creating DataFrames: spark.read.format().option().load(), spark.createDataFrame(), and reading from Delta/Parquet/CSV/JSON
- Column selection and manipulation: select(), withColumn(), withColumnRenamed(), drop(), cast(), alias(). Know both col() and df['col'] syntax
- Filtering: filter() and where() are identical. Use Column expressions: col('age') > 18, col('name').isNotNull(), col('status').isin(['active', 'pending'])
- Joins: df1.join(df2, on='id', how='inner'). Types: inner, left, right, full, cross, semi (left_semi), anti (left_anti). Know duplicate column handling
- Aggregations: groupBy().agg(count(), sum(), avg(), max(), min(), collect_list(), collect_set()). After groupBy, only grouped and aggregated columns remain
- Reading/writing: spark.read.csv(path, header=True, schema=schema), df.write.mode('overwrite').partitionBy('date').parquet(path). Know all save modes
- Schema: StructType([StructField('name', StringType(), True)]), DDL strings ('name STRING, age INT'), and df.printSchema()/df.schema
- Null handling: isNull(), isNotNull(), na.drop(), na.fill(), coalesce() for choosing the first non-null value from multiple columns
Common Exam Traps
Apache Spark SQL
At 17%, this domain tests your ability to write SQL queries using Spark SQL. You must register DataFrames as views, write complex queries with joins, aggregations, window functions, and subqueries. Questions test both correct syntax and understanding of SQL semantics in a distributed context.
Key Topics
Must-Know Concepts
- Register DataFrames for SQL: df.createOrReplaceTempView('my_table') then spark.sql('SELECT * FROM my_table'). The result is a DataFrame
- Window functions: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER(), AVG() OVER(). Always require OVER(PARTITION BY ... ORDER BY ...)
- CTEs: WITH cte_name AS (SELECT ...) SELECT * FROM cte_name. Useful for readability and deduplication patterns
- Subqueries: scalar subqueries in SELECT, correlated subqueries in WHERE, and IN/EXISTS subqueries for semi-join patterns
- CASE WHEN syntax: CASE WHEN condition THEN value WHEN condition2 THEN value2 ELSE default END
- spark.sql() returns a DataFrame — you can chain DataFrame methods: spark.sql('SELECT * FROM t').filter(col('x') > 10)
Common Exam Traps
Spark Structured Streaming
At 8%, this is the smallest domain but still yields ~5 questions. Tests your understanding of Structured Streaming concepts: the unbounded table model, readStream/writeStream, output modes, triggers, watermarks, and checkpointing. Expect conceptual questions about when to use each output mode.
Key Topics
Must-Know Concepts
- Structured Streaming treats streaming data as an unbounded table — new rows are appended continuously and queries produce incremental results
- readStream starts a streaming source: spark.readStream.format('kafka').option(...).load(). writeStream starts the output: df.writeStream.format('delta').start()
- Output modes: append (only new rows, default), complete (rewrite full result — requires aggregation), update (only changed rows — requires aggregation)
- Triggers: processingTime('10 seconds') for micro-batch, availableNow=True for process-all-then-stop, continuous for low-latency (experimental)
- Checkpointing: stores streaming progress (offsets, state) for fault tolerance. Set via .option('checkpointLocation', '/path'). Required for exactly-once guarantees
- Watermarks: handle late-arriving data with withWatermark('eventTime', '10 minutes'). Data later than the watermark is dropped. Required for windowed aggregations on event time
Common Exam Traps
Spark Performance and Optimization
At 12%, this domain tests your ability to diagnose and fix performance issues. Covers shuffle operations, data skew, partitioning strategies, caching, broadcast joins, Adaptive Query Execution, and reading the Spark UI. Expect scenario-based questions: 'a job is slow — what's the most likely cause?'
Key Topics
Must-Know Concepts
- Shuffle operations (wide transformations) are the primary performance bottleneck in Spark — data must be redistributed across the network between executors
- spark.sql.shuffle.partitions defaults to 200 — reduce for small datasets (to avoid tiny partitions) and increase for large datasets (to avoid oversized partitions)
- Data skew: when one partition has significantly more data than others, causing a single task to run much longer. Fix with salting, repartitioning, or broadcast join
- Broadcast joins eliminate shuffles for the large table — the smaller table is replicated to all executors. Default auto-broadcast threshold is 10MB
- Caching: use cache()/persist() when a DataFrame is reused multiple times. Caching a DataFrame used only once wastes memory. unpersist() to release
- Adaptive Query Execution (AQE): runtime optimization enabled by default. Coalesces shuffle partitions, handles skewed joins, and converts joins to broadcast when one side is small
Common Exam Traps
Delta Lake and Spark Ecosystem
At 12%, this domain tests your knowledge of Delta Lake features and the broader Spark ecosystem. Covers ACID transactions, time travel, schema enforcement and evolution, MERGE INTO, OPTIMIZE, VACUUM, and how Delta Lake integrates with the Spark processing engine. Know the SQL syntax for Delta-specific operations.
Key Topics
Must-Know Concepts
- Delta Lake adds ACID transactions to Spark via a transaction log (_delta_log directory) stored alongside the Parquet data files
- Time travel: query historical versions with SELECT * FROM t VERSION AS OF 3 or TIMESTAMP AS OF '2026-01-01'. Use DESCRIBE HISTORY t to see all versions
- Schema enforcement: by default, writes that don't match the table schema are rejected. This prevents data corruption from schema mismatches
- Schema evolution: enable with .option('mergeSchema', 'true') on write or spark.databricks.delta.schema.autoMerge.enabled = true globally. New columns are added automatically
- MERGE INTO: upsert pattern — MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
- OPTIMIZE compacts small files into larger ones. VACUUM removes old data files beyond the retention period (default 7 days). Both are maintenance operations
Common Exam Traps
Key Spark Concepts Compared
These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.
Top Mistakes to Avoid
Exam-Ready Checklist
Recommended Resources
Free & Official Resources
Paid Courses & Practice Exams
These are recommended if you prefer a structured learning path. They can save time but are not required to pass.