CertPrepNowFREE
DatabricksSpark Dev AssociateUpdated 2026-06-05

Spark Dev Associate Study Guide

Everything you need to pass the Databricks Certified Associate Developer for Apache Spark exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The Databricks Certified Associate Developer for Apache Spark exam is passable with free resources if you study consistently for 2-4 weeks:

  • Apache Spark official documentation (spark.apache.org) — definitive reference for APIs and concepts
  • Databricks Academy free learning paths (Apache Spark Developer track)
  • Databricks Community Edition (free cluster for hands-on PySpark and Spark SQL practice)
  • 340+ free practice questions on this site covering all 6 exam domains

This exam is heavily code-focused — you must be able to read and write PySpark DataFrame operations and Spark SQL queries. Spin up a free Community Edition cluster and write real transformations, not just read about them.

Choose Your Study Path

No prior Spark or distributed computing experience. You'll build foundational knowledge from scratch over 4 weeks.

Day 1-2Learn distributed computing fundamentals: why single-machine processing isn't enough, how data is split across nodes, the driver/executor model, and what a SparkSession is
Day 3-4Spark architecture deep dive: cluster manager types, deployment modes (client vs cluster), executors, tasks, stages, and the execution hierarchy (Job > Stage > Task)
Day 5-7DataFrame basics: creating DataFrames from files (CSV, JSON, Parquet), schema definition with StructType, selecting columns, filtering rows, and adding columns with withColumn
Day 8-9Lazy evaluation and the execution model: understand transformations vs actions, narrow vs wide transformations, and why Spark builds a DAG before executing
Day 10-12DataFrame transformations: joins (inner, left, right, anti, cross), groupBy with aggregations (count, sum, avg, max), union/unionByName, and handling nulls
Day 13-14Column operations: withColumnRenamed, cast, when/otherwise, coalesce, lit, col, expr. Practice writing complex column expressions
Day 15-16Reading and writing data: DataFrameReader/Writer API, file formats (Parquet, CSV, JSON, Delta), partitionBy on write, save modes (overwrite, append, errorIfExists)
Day 17-18Spark SQL: creating temp views, running SQL queries, window functions (ROW_NUMBER, RANK, LAG, LEAD), subqueries, and CTEs
Day 19-20UDFs and higher-order functions: writing Python UDFs, understanding performance implications, when to use built-in functions instead
Day 21-22Structured Streaming basics: readStream/writeStream, output modes (append, complete, update), triggers, checkpointing, and watermarks
Day 23-24Performance and optimization: partitioning, repartition vs coalesce, broadcast joins, caching/persisting, Adaptive Query Execution, and reading the Spark UI
Day 25-26Delta Lake fundamentals: ACID transactions, time travel, schema enforcement vs evolution, OPTIMIZE, VACUUM, and MERGE INTO
Day 27Practice questions across all 6 domains, review explanations carefully
Day 28Take a full mock exam. Review all wrong answers. Retake if below 75%

Exam Overview

Format

60 questions, 120 minutes. Multiple choice (single select and multiple select). Code-heavy — most questions present PySpark or Spark SQL code snippets and ask you to predict behavior, fix errors, or choose the correct implementation.

Scoring

Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure.

Domains & Weights

  • Apache Spark Architecture17%
  • Apache Spark DataFrame API34%
  • Apache Spark SQL17%
  • Spark Structured Streaming8%
  • Spark Performance and Optimization12%
  • Delta Lake and Spark Ecosystem12%

Registration

$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification. Cost: $200 USD.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowYou must understand these deeply — what they do, exact syntax, and common pitfalls. These appear across multiple domains and make up the bulk of exam questions.
Tier 2: Should KnowUnderstand what they do and key use cases. May appear in 3-8 questions each.
Tier 3: Recognize OnlyKnow what they do at a high level. Rarely more than 1-2 questions each.
Domain 117% of exam

Apache Spark Architecture

At 17% of the exam, this domain tests your understanding of Spark's distributed execution model. You must know the driver-executor architecture, how jobs are broken into stages and tasks, deployment modes, fault tolerance, and the role of the cluster manager. Expect conceptual questions about execution flow.

Key Topics

SparkSessionDriver ProcessExecutorsCluster ManagerDAG SchedulerTask Scheduler

Must-Know Concepts

  • SparkSession is the single entry point for all Spark functionality — it replaces SparkContext, SQLContext, and HiveContext from earlier Spark versions
  • The driver process runs on one node, creates the execution plan (DAG), and distributes tasks to executors. Executors run on worker nodes and perform actual data processing
  • Execution hierarchy: Application > Job > Stage > Task. Jobs are triggered by actions, stages are separated by shuffle boundaries, tasks operate on individual partitions
  • Deployment modes: client mode runs the driver on the submitting machine; cluster mode runs the driver on a cluster node. Client mode is for interactive development, cluster mode for production
  • Fault tolerance: if an executor fails, Spark recomputes lost partitions using lineage (the DAG). If the driver fails, the entire application fails
  • The number of tasks in a stage equals the number of partitions being processed — each task processes exactly one partition on one core

Common Exam Traps

The driver is a single point of failure — Spark cannot automatically recover from a driver crash. External tools (YARN, Kubernetes) must restart the application
Executors do NOT share memory with each other. Each executor's memory is isolated. A skewed partition that exceeds one executor's memory causes OOM even if other executors have spare memory
Slots (cores) determine parallelism, not the number of executors. 4 executors with 2 cores each = 8 tasks running in parallel, same as 2 executors with 4 cores each
Quick Check: Apache Spark Architecture

Question 1 of 3

In a Spark cluster, the component that initializes the SparkSession and oversees the distribution of work across the cluster while maintaining application state is known as which of the following?

Domain 234% of exam

Apache Spark DataFrame API

The largest domain at 34% — one-third of the exam. Tests your ability to create, transform, and manipulate DataFrames using PySpark. Expect questions on select, filter, join, groupBy, withColumn, reading/writing data, column expressions, null handling, UDFs, and schema operations. You must read code and predict output.

Key Topics

DataFrameColumnRowDataFrameReaderDataFrameWriterFunctions (pyspark.sql.functions)

Must-Know Concepts

  • Creating DataFrames: spark.read.format().option().load(), spark.createDataFrame(), and reading from Delta/Parquet/CSV/JSON
  • Column selection and manipulation: select(), withColumn(), withColumnRenamed(), drop(), cast(), alias(). Know both col() and df['col'] syntax
  • Filtering: filter() and where() are identical. Use Column expressions: col('age') > 18, col('name').isNotNull(), col('status').isin(['active', 'pending'])
  • Joins: df1.join(df2, on='id', how='inner'). Types: inner, left, right, full, cross, semi (left_semi), anti (left_anti). Know duplicate column handling
  • Aggregations: groupBy().agg(count(), sum(), avg(), max(), min(), collect_list(), collect_set()). After groupBy, only grouped and aggregated columns remain
  • Reading/writing: spark.read.csv(path, header=True, schema=schema), df.write.mode('overwrite').partitionBy('date').parquet(path). Know all save modes
  • Schema: StructType([StructField('name', StringType(), True)]), DDL strings ('name STRING, age INT'), and df.printSchema()/df.schema
  • Null handling: isNull(), isNotNull(), na.drop(), na.fill(), coalesce() for choosing the first non-null value from multiple columns

Common Exam Traps

withColumn() replaces a column if the name already exists — it does NOT always add a new column. This is a common source of confusion
select() with a string returns a DataFrame: df.select('name'). select() with col() returns a DataFrame: df.select(col('name')). Both are valid
union() matches columns by position, not by name. unionByName() matches by column name. Using union() with differently-ordered schemas produces wrong results silently
After a join on a column that exists in both DataFrames using a string (on='id'), the join column appears once. Using an expression (df1.id == df2.id), both columns appear — causing ambiguity in subsequent operations
Quick Check: Apache Spark DataFrame API

Question 1 of 3

A developer needs to combine data from two DataFrames based on a common column. Which operation will require a shuffle of data across partitions?

Domain 317% of exam

Apache Spark SQL

At 17%, this domain tests your ability to write SQL queries using Spark SQL. You must register DataFrames as views, write complex queries with joins, aggregations, window functions, and subqueries. Questions test both correct syntax and understanding of SQL semantics in a distributed context.

Key Topics

spark.sql()Temp ViewsWindow FunctionsCommon Table ExpressionsSubqueries

Must-Know Concepts

  • Register DataFrames for SQL: df.createOrReplaceTempView('my_table') then spark.sql('SELECT * FROM my_table'). The result is a DataFrame
  • Window functions: ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER(), AVG() OVER(). Always require OVER(PARTITION BY ... ORDER BY ...)
  • CTEs: WITH cte_name AS (SELECT ...) SELECT * FROM cte_name. Useful for readability and deduplication patterns
  • Subqueries: scalar subqueries in SELECT, correlated subqueries in WHERE, and IN/EXISTS subqueries for semi-join patterns
  • CASE WHEN syntax: CASE WHEN condition THEN value WHEN condition2 THEN value2 ELSE default END
  • spark.sql() returns a DataFrame — you can chain DataFrame methods: spark.sql('SELECT * FROM t').filter(col('x') > 10)

Common Exam Traps

Window functions cannot be used directly in WHERE clauses — wrap them in a subquery or CTE: SELECT * FROM (SELECT *, ROW_NUMBER() OVER(...) AS rn FROM t) WHERE rn = 1
Global temp views must be accessed with the global_temp prefix: SELECT * FROM global_temp.my_view. Omitting the prefix causes a 'table not found' error
spark.sql() executes SQL lazily — the query is compiled and optimized but not executed until an action is called on the resulting DataFrame
ORDER BY sorts the entire dataset (requires a shuffle to a single partition). Use SORT BY for per-partition sorting without a full shuffle
Quick Check: Apache Spark SQL

Question 1 of 3

A table contains customer records with duplicates. Which SQL approach correctly keeps only the most recent record per customer_id?

Domain 48% of exam

Spark Structured Streaming

At 8%, this is the smallest domain but still yields ~5 questions. Tests your understanding of Structured Streaming concepts: the unbounded table model, readStream/writeStream, output modes, triggers, watermarks, and checkpointing. Expect conceptual questions about when to use each output mode.

Key Topics

readStreamwriteStreamOutput ModesTriggersWatermarksCheckpoints

Must-Know Concepts

  • Structured Streaming treats streaming data as an unbounded table — new rows are appended continuously and queries produce incremental results
  • readStream starts a streaming source: spark.readStream.format('kafka').option(...).load(). writeStream starts the output: df.writeStream.format('delta').start()
  • Output modes: append (only new rows, default), complete (rewrite full result — requires aggregation), update (only changed rows — requires aggregation)
  • Triggers: processingTime('10 seconds') for micro-batch, availableNow=True for process-all-then-stop, continuous for low-latency (experimental)
  • Checkpointing: stores streaming progress (offsets, state) for fault tolerance. Set via .option('checkpointLocation', '/path'). Required for exactly-once guarantees
  • Watermarks: handle late-arriving data with withWatermark('eventTime', '10 minutes'). Data later than the watermark is dropped. Required for windowed aggregations on event time

Common Exam Traps

append mode does NOT support aggregations without a watermark — because Spark cannot guarantee that aggregation results won't change as new data arrives
complete mode rewrites the entire result table on every trigger — only use for aggregations where the full result fits in memory
Deleting a checkpoint causes the stream to reprocess ALL data from the beginning — checkpoints are not optional in production
You cannot sort a streaming DataFrame in append mode — sorting requires seeing all the data, which contradicts the append-only model
Quick Check: Spark Structured Streaming

Question 1 of 3

A developer is writing a Structured Streaming application that aggregates incoming events by category and outputs running totals. Which output mode should be used?

Domain 512% of exam

Spark Performance and Optimization

At 12%, this domain tests your ability to diagnose and fix performance issues. Covers shuffle operations, data skew, partitioning strategies, caching, broadcast joins, Adaptive Query Execution, and reading the Spark UI. Expect scenario-based questions: 'a job is slow — what's the most likely cause?'

Key Topics

Spark UIShuffle OperationsPartitioningCachingBroadcast JoinsAdaptive Query Execution

Must-Know Concepts

  • Shuffle operations (wide transformations) are the primary performance bottleneck in Spark — data must be redistributed across the network between executors
  • spark.sql.shuffle.partitions defaults to 200 — reduce for small datasets (to avoid tiny partitions) and increase for large datasets (to avoid oversized partitions)
  • Data skew: when one partition has significantly more data than others, causing a single task to run much longer. Fix with salting, repartitioning, or broadcast join
  • Broadcast joins eliminate shuffles for the large table — the smaller table is replicated to all executors. Default auto-broadcast threshold is 10MB
  • Caching: use cache()/persist() when a DataFrame is reused multiple times. Caching a DataFrame used only once wastes memory. unpersist() to release
  • Adaptive Query Execution (AQE): runtime optimization enabled by default. Coalesces shuffle partitions, handles skewed joins, and converts joins to broadcast when one side is small

Common Exam Traps

Adding more executors does NOT fix data skew — the bottleneck is a single oversized partition on one executor. Repartition or salt the skewed key instead
Disk spill in the Spark UI means a task's data exceeds executor memory and is written to disk — fix by increasing executor memory or reducing partition size, not by adding more nodes
coalesce() before a join or groupBy can create a bottleneck — you're reducing parallelism right before an expensive operation. Always repartition() before wide transformations if needed
Caching a DataFrame that's read from Parquet/Delta may not help — these formats already support predicate pushdown and column pruning. Caching prevents pushdown optimizations
Quick Check: Spark Performance and Optimization

Question 1 of 3

A Spark job's stage has 200 tasks, but one task takes 30 minutes while the rest finish in 2 minutes. What is the most likely cause?

Domain 612% of exam

Delta Lake and Spark Ecosystem

At 12%, this domain tests your knowledge of Delta Lake features and the broader Spark ecosystem. Covers ACID transactions, time travel, schema enforcement and evolution, MERGE INTO, OPTIMIZE, VACUUM, and how Delta Lake integrates with the Spark processing engine. Know the SQL syntax for Delta-specific operations.

Key Topics

Delta LakeTransaction LogTime TravelSchema EnforcementSchema EvolutionMERGE INTO

Must-Know Concepts

  • Delta Lake adds ACID transactions to Spark via a transaction log (_delta_log directory) stored alongside the Parquet data files
  • Time travel: query historical versions with SELECT * FROM t VERSION AS OF 3 or TIMESTAMP AS OF '2026-01-01'. Use DESCRIBE HISTORY t to see all versions
  • Schema enforcement: by default, writes that don't match the table schema are rejected. This prevents data corruption from schema mismatches
  • Schema evolution: enable with .option('mergeSchema', 'true') on write or spark.databricks.delta.schema.autoMerge.enabled = true globally. New columns are added automatically
  • MERGE INTO: upsert pattern — MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
  • OPTIMIZE compacts small files into larger ones. VACUUM removes old data files beyond the retention period (default 7 days). Both are maintenance operations

Common Exam Traps

VACUUM with retention shorter than 7 days requires setting delta.retentionDurationCheck.enabled = false — this breaks time travel for versions older than the retention period
MERGE INTO requires a deterministic match — if multiple source rows match a single target row, the MERGE fails. Deduplicate the source first
Delta tables store data as Parquet files but add a transaction log. You cannot skip the transaction log and read Delta data as plain Parquet — you'll get inconsistent results
Schema enforcement applies on write, not read. Reading a Delta table always succeeds; writing with mismatched schema fails unless mergeSchema is enabled
Quick Check: Delta Lake and Spark Ecosystem

Question 1 of 3

After running VACUUM on a Delta table with the default retention period, a user cannot query the table as of 10 days ago using time travel. Why?

Key Spark Concepts Compared

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

Transformations vs Actions

Use Transformations when…

Build up a processing pipeline: select(), filter(), join(), groupBy(), withColumn(), orderBy(). These are lazy and build a DAG without executing anything.

Use Actions when…

Trigger actual computation and produce results: show(), collect(), count(), first(), head(), take(), write(). These execute the entire DAG.

Exam trap

Calling a transformation returns a new DataFrame but processes zero rows. Only when an action is called does Spark execute the plan. cache() is a lazy transformation — it doesn't cache until the next action runs.

repartition() vs coalesce()

Use repartition() when…

You need to increase the number of partitions or want evenly distributed partitions (for a join or groupBy). Performs a full shuffle — expensive but guarantees even distribution.

Use coalesce() when…

You need to reduce the number of partitions (e.g., before writing to avoid too many small files). Avoids a full shuffle by combining adjacent partitions — cheaper but can create uneven sizes.

Exam trap

coalesce() can only reduce partitions — calling coalesce(100) on a DataFrame with 50 partitions will silently keep 50 partitions. Use repartition() to increase. Also, coalesce before a wide transformation may create a bottleneck due to uneven partition sizes.

cache() vs persist()

Use cache() when…

Quick shorthand for caching a DataFrame in MEMORY_AND_DISK. Use when the default storage level is acceptable and you want simple syntax.

Use persist() when…

You need fine-grained control over storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, or with replication (_2 variants).

Exam trap

Both are lazy — neither actually caches data until an action is triggered. cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK). With MEMORY_ONLY, partitions that don't fit are recomputed on each access rather than spilled to disk.

Narrow Transformations vs Wide Transformations

Use Narrow Transformations when…

Operations where each input partition contributes to at most one output partition: select, filter, withColumn, map, union. No shuffle required — fast.

Use Wide Transformations when…

Operations where input partitions contribute to multiple output partitions: join, groupBy, repartition, distinct, orderBy. Require data shuffle across the network — expensive.

Exam trap

Wide transformations create new stage boundaries in the execution plan. The number of stages equals the number of wide transformations plus one. Minimizing wide transformations is the primary strategy for Spark performance tuning.

createOrReplaceTempView() vs createOrReplaceGlobalTempView()

Use createOrReplaceTempView() when…

Register a DataFrame as a temporary view visible only within the current SparkSession. Most common pattern for running SQL queries against DataFrames.

Use createOrReplaceGlobalTempView() when…

Register a DataFrame as a global temporary view visible across all SparkSessions in the same application. Accessed via the global_temp database prefix.

Exam trap

Temp views are session-scoped — they disappear when the session ends. Global temp views require the prefix: SELECT * FROM global_temp.my_view. Forgetting the prefix causes a 'table not found' error.

Broadcast Join vs Sort-Merge Join

Use Broadcast Join when…

When one DataFrame is small enough to fit in executor memory (default < 10MB). The small table is broadcast to all nodes, avoiding the shuffle of the large table entirely.

Use Sort-Merge Join when…

When both DataFrames are large. Both sides are shuffled and sorted by the join key, then merged. More expensive due to shuffle and sort but works for any data size.

Exam trap

Spark auto-broadcasts tables under 10MB (spark.sql.autoBroadcastJoinThreshold). For larger tables, use broadcast() hint explicitly. Broadcasting a table that's too large causes OOM. In streaming, broadcast joins are only supported when the broadcast side is a batch DataFrame.

Top Mistakes to Avoid

Confusing transformations (lazy, build a plan) with actions (trigger execution) — for example, thinking filter() processes data immediately
Using union() when column order differs between DataFrames — union() matches by position, not name. Use unionByName() for name-based matching
Forgetting that window functions cannot be used directly in WHERE clauses — they must be wrapped in a subquery or CTE
Not understanding that cache() is lazy — calling cache() alone does nothing. An action must follow to actually materialize the cache
Using coalesce() to try to increase partitions — coalesce() can only reduce partition count. Use repartition() to increase
Thinking more executors fix data skew — the bottleneck is one oversized partition on a single executor. Repartition or salt the data instead
Forgetting the global_temp prefix when querying global temporary views — SELECT * FROM my_view fails; SELECT * FROM global_temp.my_view works
Using Python UDFs when built-in functions exist — UDFs are 2-10x slower due to Python serialization overhead and cannot be optimized by Catalyst
Confusing repartition (full shuffle, even distribution) with coalesce (no full shuffle, can be uneven) — use repartition before joins for even distribution
Not deduplicating source data before MERGE INTO — duplicate source matches on the same target row cause the MERGE to fail
Assuming inferSchema is free for CSV — it reads the entire file twice (once to infer types, once to load). Always provide an explicit schema for large CSV files
Deleting a Structured Streaming checkpoint and expecting the stream to resume — without the checkpoint, all data is reprocessed from the beginning

Exam-Ready Checklist

Can explain the Spark driver/executor architecture and how jobs break into stages and tasks
Know transformations vs actions cold — can instantly classify any DataFrame operation
Can write DataFrame transformations from memory: select, filter, join, groupBy, withColumn, withColumnRenamed, drop, union, distinct
Understand lazy evaluation: know what triggers computation and what just builds a plan
Can configure DataFrameReader for CSV, JSON, Parquet, and Delta with correct options (header, inferSchema, schema, multiLine)
Know all join types (inner, left, right, full, cross, semi, anti) and when each is appropriate
Can write window functions in both PySpark and SQL syntax with correct PARTITION BY and ORDER BY
Understand broadcast joins: when to use them, the auto-broadcast threshold, and how to force them
Know repartition vs coalesce: when each is appropriate and their performance tradeoffs
Can write MERGE INTO statements with correct syntax for upserts
Understand Structured Streaming output modes and when each applies
Can diagnose performance issues: data skew, shuffle overhead, memory spill, and over-partitioning
Scored 80%+ on at least two full practice exams
Reviewed all incorrect answers and understand why the right answer is right
Can complete the exam within time: average 2 minutes per question

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions