CertPrepNow
DatabricksSpark Dev Associate6 domains

Spark Dev Associate Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Associate Developer for Apache Spark exam.

General Exam Tips

  • 1.Read ALL answer choices before selecting — many questions have two plausible-looking options where one has a subtle method signature or behavioral difference
  • 2.Mentally classify every operation as a transformation (lazy) or action (eager) before reasoning about when execution occurs
  • 3.No documentation access during the exam — you must know default values (shuffle.partitions = 200, autoBroadcastJoinThreshold = 10MB) cold
  • 4.When a question asks 'what happens when…', trace the execution: is an action called? Is there a shuffle? Where is the stage boundary?
  • 5.Code-snippet questions often test one subtle difference between two nearly identical answers — the wrong option uses an invalid method name or slightly wrong syntax
  • 6.Time budget: 45 questions in 90 minutes = 2 minutes each. Flag hard questions and return — don't stall
  • 7.Always guess if unsure — no penalty for wrong answers
  • 8.~60-70% of questions involve code: predicting output, identifying errors, or choosing the correct implementation. Practice reading unfamiliar PySpark code quickly
  • 9.When the question mentions 'most efficient' or 'best approach', the answer almost always avoids unnecessary shuffles
  • 10.For streaming output mode questions, first ask: does the query have an aggregation? If yes, complete or update. If no, append only
Domain 134% of exam

Apache Spark DataFrame API

Must-Know Facts

  • withColumn() REPLACES the column if the name already exists — it does not always add. This produces silent data corruption bugs
  • union() matches columns by POSITION not by name. If two DataFrames have the same columns in different order, union() produces silently wrong results
  • unionByName(df2, allowMissingColumns=True) is the safe way to combine DataFrames with differently ordered or mismatched columns
  • filter() and where() are completely identical — the exam uses both and expects you to treat them as interchangeable
  • select() on a string ('name') and on col('name') both return a DataFrame — both syntaxes are valid and produce identical results
  • After groupBy().agg(), ONLY the grouping columns and aggregated columns remain — all other columns are silently dropped
  • distinct() removes rows that are fully identical. dropDuplicates(['col']) removes rows with the same value in specified columns only — these behave very differently
  • isNull() and isNotNull() are the correct null checks — never use == None or != None in Spark column expressions
  • array_explode() is NOT a valid PySpark function — the correct function is explode(). The exam uses array_explode() as a distractor
  • sample(withReplacement, fraction) does NOT guarantee exactly fraction * count rows — the result is probabilistic. withReplacement=True allows the same row to appear multiple times
  • DataFrameReader uses .load() to execute; DataFrameWriter uses .save() to execute. These are not interchangeable

Common Traps

TrapCalling withColumn('existing_col', new_expr) to add a column
RealitywithColumn() replaces the column if the name already exists. Use withColumnRenamed() to rename, and choose a new name in withColumn() to avoid overwriting
TrapUsing union() to combine DataFrames with columns in different orders
Realityunion() aligns by position — if DataFrame A has (id, name) and B has (name, id), the result maps A.id to B.name and vice versa. Always use unionByName() when column order may differ
TrapAssuming df.filter() or df.select() triggers computation
RealityBoth are lazy transformations. Nothing executes until an action (show, collect, count, write, head, first, take) is called
TrapJoining on a string column name ('id') vs a column expression (df1['id'] == df2['id'])
RealityJoining with on='id' (string) produces ONE 'id' column. Joining with df1['id'] == df2['id'] (expression) produces TWO 'id' columns, causing downstream ambiguity errors
TrapUsing .collect() to retrieve a small sample for inspection
Realitycollect() returns ALL rows to the driver — it has no row limit parameter. For n rows use head(n), take(n), or limit(n).collect(). Calling collect() on a large DataFrame crashes the driver
TrapThinking coalesce() can increase partitions
Realitycoalesce() can only reduce partitions. Calling coalesce(100) on a 50-partition DataFrame silently keeps 50 partitions. Use repartition() to increase
TrapFiltering with == None instead of isNull()
RealityIn Spark column expressions, == None always evaluates to null (not True/False) due to SQL null semantics. Use col('x').isNull() and col('x').isNotNull()
TrapAssuming orderBy() result persists across write operations
RealityorderBy() sorts globally but the sort is not guaranteed to persist in the output files unless you use coalesce(1) first. Multiple output partitions can arrive in any order

Confusing Pairs

union()unionByName()

union() = matches by column POSITION (unsafe if schemas differ in order). unionByName() = matches by column NAME (safe, use allowMissingColumns=True for partial schema overlap). Exam-critical: union() is a narrow transformation (no shuffle). Both are transformations, not actions

distinct()dropDuplicates()

distinct() removes rows where ALL column values are identical. dropDuplicates(['col1', 'col2']) removes rows with duplicate values in the specified columns only — other columns can differ. dropDuplicates() with no args is equivalent to distinct()

head(n)limit(n)

head(n) is an ACTION — it triggers execution and returns a Python list of Row objects. limit(n) is a TRANSFORMATION — it returns a new DataFrame and triggers nothing. limit(n).collect() is equivalent to head(n) but much more verbose

explode()posexplode()

explode(col) flattens an array into one row per element with just the value. posexplode(col) produces TWO columns: pos (position index) and col (value). Neither is called array_explode() — that is not a valid function

when().otherwise()coalesce()

when(condition, value).otherwise(default) = conditional column based on a Boolean expression (like CASE WHEN). coalesce(col_a, col_b, col_c) = returns the first non-null value from a list of columns. Different problem shapes

na.fill()na.drop()

na.fill(value) or na.fill({'col': value}) replaces nulls with the given value. na.drop() removes rows containing ANY null. na.drop(subset=['col']) removes rows where specified columns are null. Subset parameter works for both

Scenario Tips

If the question asks about:

Question shows df1.union(df2) and asks what the result contains, given that the DataFrames have columns in different orders

Answer:

The result maps columns by position — if A = (id, name) and B = (name, id), the result column named 'id' will contain A's ids but B's names. The schema name comes from the LEFT DataFrame

Distractor to avoid:

Many assume union() is smart enough to match by name like unionByName(). It is not

If the question asks about:

Question asks which operation would be used to retrieve exactly the first 5 rows and bring them to the driver

Answer:

head(5) or take(5) — both are actions that return a Python list. df.limit(5) is a transformation that does NOT bring data to the driver

Distractor to avoid:

collect(5) looks right but collect() takes no arguments. limit(5) is wrong because it returns a DataFrame, not data

If the question asks about:

Question presents a join using df1.join(df2, df1['id'] == df2['id'], 'inner') and asks why a subsequent .select('id') fails

Answer:

Joining via expression (df1['id'] == df2['id']) keeps both id columns, causing an ambiguity error when selecting 'id'. Fix with .drop(df2['id']) after the join or use on='id' (string) instead

Distractor to avoid:

Candidates often think the join itself fails — the join succeeds, but the ambiguity surfaces only when referencing the column by name afterward

If the question asks about:

Question asks what happens when you call df.withColumn('price', col('price') * 1.1) where 'price' already exists

Answer:

The existing 'price' column is replaced by the new expression. No error occurs, no new column is added

Distractor to avoid:

Common wrong answer: a second 'price' column is added, causing ambiguity. This would only happen if you used .select() to add a duplicate column name

If the question asks about:

Question presents groupBy('dept').agg(sum('salary')) and asks which columns are in the result

Answer:

Only 'dept' and 'sum(salary)' — all other columns (name, hire_date, etc.) are gone. Use a window function if you need to keep all columns while computing per-group aggregates

Distractor to avoid:

Many assume all original columns are preserved after groupBy, like a SQL query with a GROUP BY that allows referencing non-aggregated columns

Last-Minute Facts

1filter() and where() are identical — both accepted as correct answers
2collect() has no limit parameter — it returns ALL rows
3head(n) is an action; limit(n) is a transformation
4union() = position-based; unionByName() = name-based
5withColumn() overwrites if column name exists
6array_explode() is INVALID — correct function is explode()
7dropDuplicates() with no args = distinct()
8After groupBy().agg(), only grouped + aggregated columns remain
9DataFrameReader: .load() | DataFrameWriter: .save()
10== None in Spark expressions is always null — use .isNull()
Domain 217% of exam

Apache Spark Architecture

Must-Know Facts

  • The driver is a single point of failure — if it crashes, the entire Spark application fails. There is no automatic promotion of a worker to driver
  • Executor failures are recoverable via lineage recomputation — Spark recomputes lost partitions from the original data using the DAG
  • Each task processes exactly ONE partition on ONE core — the number of tasks in a stage equals the number of partitions
  • Parallelism = number of slots (cores) across ALL executors, not the number of executors. 4 executors × 2 cores = 8-way parallelism
  • Stage boundaries are created at shuffle points — every wide transformation (join, groupBy, repartition, distinct, orderBy) creates a new stage
  • SparkSession is the unified entry point since Spark 2.0 — it replaces SparkContext, SQLContext, and HiveContext
  • Client mode: driver runs on submission machine. Cluster mode: driver runs on a cluster node. Client mode is for interactive notebooks; cluster mode for production
  • Garbage collection pauses affect task execution times — excessive GC shows up as long task times with low CPU usage in the Spark UI

Common Traps

TrapAssuming Spark automatically recovers from a driver failure
RealityDriver failure kills the application. External cluster managers (YARN, Kubernetes) can restart the application from scratch, but Spark itself has no automatic driver failover
TrapThinking that adding more executors fixes data skew
RealityData skew is a per-partition problem — one executor has an oversized partition. More executors don't help because the skewed partition still runs on one core. Fix with salting, repartitioning, or broadcast joins
TrapConfusing executor memory isolation with shared memory
RealityExecutors do NOT share memory. Each executor's memory is isolated. A skewed partition that exceeds one executor's memory causes OOM even if all other executors have plenty of free memory
TrapThinking client mode is safer because the driver is local
RealityClient mode ties the driver to the submission machine — if the machine or network disconnects, the job fails. Cluster mode is better for production because the driver runs on a stable cluster node

Confusing Pairs

Narrow TransformationWide Transformation

Narrow = each input partition produces at most one output partition, no shuffle (select, filter, withColumn, map, union, coalesce). Wide = input partitions contribute to multiple output partitions, requires shuffle (join, groupBy, repartition, distinct, orderBy). Wide transformations create stage boundaries

JobStageTask

Job = triggered by one action (e.g., count()). Stage = a set of transformations with no shuffle between them (bounded by wide transformations). Task = one unit of work processing one partition on one core. Count: 1 job → N stages → M tasks per stage

Client ModeCluster Mode

Client = driver runs on the machine submitting the job (your laptop or notebook server). Cluster = driver runs on a worker node in the cluster. Use client for interactive dev, cluster for production. In client mode, network latency to the driver is higher

Scenario Tips

If the question asks about:

Question asks how many stages a job has given a chain of transformations

Answer:

Count the wide transformations (joins, groupBys, distinct, orderBy, repartition) and add 1. For example: filter → groupBy → join produces 3 stages

Distractor to avoid:

Many count each transformation as a stage — only WIDE transformations (shuffle boundaries) create stage separations

If the question asks about:

Question describes one task running 10x longer than all others in the same stage

Answer:

Data skew — one partition is disproportionately large. Adding more executors won't help. Fix by salting the skewed key, using broadcast join, or repartitioning on a higher-cardinality key

Distractor to avoid:

Insufficient CPU or network saturation would affect ALL tasks uniformly, not just one

If the question asks about:

Question asks what happens after an executor node crashes mid-job

Answer:

Spark detects the failure, marks the tasks as failed, and reschedules them on surviving executors, recomputing lost partitions from lineage. The job continues

Distractor to avoid:

Some think the entire job fails — only the driver failure is unrecoverable

Last-Minute Facts

1Driver failure = application failure (unrecoverable by Spark itself)
2Executor failure = Spark recomputes lost partitions (recoverable)
3Tasks per stage = number of partitions processed in that stage
4Parallelism = total cores across all executors
5Each wide transformation = one new stage boundary
6SparkSession replaces SparkContext + SQLContext + HiveContext
7Client mode = driver on submission machine
8Cluster mode = driver on a worker node (production best practice)
Domain 317% of exam

Apache Spark SQL

Must-Know Facts

  • Window functions CANNOT appear in WHERE clauses — must wrap in a subquery or CTE: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1
  • Global temp views require the global_temp prefix: SELECT * FROM global_temp.my_view. Omitting the prefix causes 'table not found'
  • spark.sql() returns a DataFrame — you can chain any DataFrame methods on the result
  • Temp views are session-scoped (disappear when session ends). Global temp views persist across sessions within the same application
  • ORDER BY in Spark SQL causes a global sort with a full shuffle. SORT BY sorts within each partition without a full shuffle
  • ROW_NUMBER() gives unique sequential numbers (no ties). RANK() leaves gaps after ties. DENSE_RANK() has no gaps after ties
  • CTEs (WITH cte AS (...)) are executed lazily like subqueries — no performance benefit over subqueries unless Catalyst folds them
  • CREATE GLOBAL TEMPORARY VIEW syntax — not CREATE OR REPLACE GLOBAL TEMP VIEW (the latter is invalid)

Common Traps

TrapUsing window functions directly in WHERE clauses
RealityWHERE rn = 1 fails if rn is a window function applied in the same SELECT. You must wrap it: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1. This is tested frequently
TrapQuerying a global temp view without the global_temp prefix
RealitySELECT * FROM my_view fails with 'table not found'. The correct query is SELECT * FROM global_temp.my_view. The database name is literal 'global_temp'
TrapTreating spark.sql() as an action that executes immediately
Realityspark.sql() is lazy — it compiles and optimizes the SQL but doesn't execute until an action (show, collect, count) is called on the resulting DataFrame
TrapConfusing RANK and DENSE_RANK behavior with ties
RealityWith values 100, 100, 90: RANK() produces 1, 1, 3 (gap after tie). DENSE_RANK() produces 1, 1, 2 (no gap). ROW_NUMBER() produces 1, 2, 3 (arbitrary tiebreaker). Exam uses tie scenarios to distinguish them

Confusing Pairs

createOrReplaceTempView()createOrReplaceGlobalTempView()

Temp view = session-scoped, accessed as SELECT * FROM my_view. Global temp view = application-scoped (across sessions), accessed as SELECT * FROM global_temp.my_view. The global_temp prefix is required and is NOT configurable

ORDER BYSORT BY

ORDER BY = global sort, triggers a shuffle to a single partition, expensive. SORT BY = per-partition sort, no global ordering guaranteed, no shuffle. Exam tests this difference — SORT BY does NOT produce a globally sorted result

RANK()DENSE_RANK()

RANK() = leaves gaps after ties (1, 1, 3). DENSE_RANK() = no gaps (1, 1, 2). ROW_NUMBER() = always unique, arbitrary tiebreaker (1, 2, 3). Use DENSE_RANK when you need consecutive rank numbers

Scenario Tips

If the question asks about:

Question presents SQL that applies ROW_NUMBER() in the WHERE clause directly and asks if it's valid

Answer:

Invalid — window functions cannot be used in WHERE. Must wrap in subquery: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1

Distractor to avoid:

The syntax looks almost right and trips up candidates who haven't internalized this restriction

If the question asks about:

Question asks you to keep only the most recent record per customer_id from a table

Answer:

SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn FROM customers) WHERE rn = 1. ROW_NUMBER + subquery pattern is the standard deduplication approach

Distractor to avoid:

SELECT DISTINCT customer_id or GROUP BY customer_id won't return all columns of the most recent record

If the question asks about:

Question asks where global_temp views are accessible

Answer:

Accessible across all SparkSessions within the same Spark application via the global_temp database prefix. They are NOT cross-application or persistent after the application ends

Distractor to avoid:

Some think global temp views persist between applications or are stored in a metastore — they are still in-memory and application-scoped

Last-Minute Facts

1Window functions in WHERE = invalid — must use subquery or CTE
2Global temp view access: SELECT * FROM global_temp.view_name
3spark.sql() returns a DataFrame (lazy, chainable)
4Temp view = session-scoped | Global temp view = application-scoped
5RANK has gaps, DENSE_RANK has no gaps, ROW_NUMBER is always unique
6ORDER BY = full shuffle (global sort) | SORT BY = per-partition (no global order)
7CREATE GLOBAL TEMPORARY VIEW (not CREATE GLOBAL TEMP VIEW)
Domain 412% of exam

Spark Performance and Optimization

Must-Know Facts

  • spark.sql.shuffle.partitions defaults to 200 — too high for small datasets (200 tiny partitions = massive scheduling overhead), too low for large datasets
  • spark.sql.autoBroadcastJoinThreshold defaults to 10MB — tables smaller than this are auto-broadcast. Set to -1 to disable auto-broadcasting
  • cache() uses MEMORY_AND_DISK_DESER by default (since Spark 3.0). cache() is lazy — does not materialize until the next action executes
  • unpersist() is EAGER — immediately releases cached data without needing an action
  • Disk spill in Spark UI = executor ran out of memory and wrote shuffle data to disk. Fix by increasing executor memory or reducing partition size, NOT by adding nodes
  • Predicate pushdown: Spark pushes filter() operations closer to the data source. UDFs block predicate pushdown — Catalyst cannot push filters through user-defined functions
  • Adaptive Query Execution (AQE) is enabled by default in Spark 3+ — it coalesces small shuffle partitions, converts sort-merge joins to broadcast if runtime stats show small table, and handles join skew
  • repartition(col) puts all rows with the same column value in the same partition — useful before joins to co-locate matching keys

Common Traps

TrapAdding more executors to fix a slow job caused by data skew
RealitySkew means one partition is much larger than others. That one partition runs on one executor — adding more executors gives nothing to the skewed task. Fix with salting, repartition on a different key, or broadcast join
TrapUsing coalesce() before a join to reduce partitions
Realitycoalesce() merges adjacent partitions without redistributing data — it creates uneven partition sizes. Using coalesce() right before a groupBy or join concentrates all the work on a few tasks. Use repartition() before wide transformations when you need even distribution
TrapCaching a DataFrame that is read from Parquet or Delta and used only once
RealityParquet and Delta already support predicate pushdown and column pruning. Caching prevents these source-level optimizations. Only cache DataFrames that are read multiple times in the same job
TrapThinking AQE only matters for large jobs
RealityAQE is enabled by default and helps with small jobs too — it coalesces 200 empty/tiny shuffle partitions down to the actual number needed. AQE changes plans at runtime after each shuffle completes
TrapAssuming Python UDFs benefit from Tungsten and Catalyst
RealityPython UDFs are opaque to Catalyst — it cannot optimize through them or push predicates past them. Tungsten code generation also bypasses Python UDFs. Every row crosses the JVM-Python boundary twice

Confusing Pairs

repartition(n)coalesce(n)

repartition(n) = full shuffle, produces exactly n evenly-distributed partitions, can increase or decrease. coalesce(n) = no full shuffle, merges adjacent partitions, can ONLY decrease, produces uneven partitions. Use repartition before joins, use coalesce before writes to reduce output files

cache()persist()

cache() = shorthand for persist(MEMORY_AND_DISK_DESER), one storage level option. persist(StorageLevel.X) = lets you choose: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, plus _2 replicated variants. Both are LAZY. unpersist() is EAGER. You must call unpersist() before changing storage level

Broadcast JoinSort-Merge Join

Broadcast = small table replicated to all executors (<10MB auto, or use broadcast() hint), eliminates shuffle of large table. Sort-Merge = both sides shuffled and sorted by join key, works for any size, expensive. Exam tests: which side to broadcast (always the SMALLER one), OOM risk of broadcasting too-large tables

Scenario Tips

If the question asks about:

Question describes spark.sql.shuffle.partitions = 200 for a 10,000-row dataset and asks about the impact

Answer:

Poor performance — 200 partitions for 10K rows creates ~50 rows per partition. The scheduler overhead managing 200 tasks far exceeds the computation cost. Reduce to a small number like 10-20

Distractor to avoid:

Trap answer: 'optimal, 200 is the recommended value'. 200 is the default, not universally optimal

If the question asks about:

Question asks which change best reduces shuffle cost when joining a 200GB table with a 5MB lookup table

Answer:

Broadcast the 5MB lookup table using broadcast() hint or rely on auto-broadcast (5MB < 10MB threshold). This eliminates the shuffle of the 200GB table entirely

Distractor to avoid:

Increasing shuffle.partitions or adding executors helps somewhat but doesn't eliminate the shuffle. Broadcast is the optimal answer here

If the question asks about:

Question asks which scenario benefits most from calling cache()

Answer:

A DataFrame that is used in multiple separate actions within the same job — for example, read once and then called with count() and show() separately. Each action without cache() recomputes the full lineage

Distractor to avoid:

A DataFrame used in a single write() call does not benefit from cache() — it adds memory overhead with no performance gain

Last-Minute Facts

1Default shuffle partitions: 200 (spark.sql.shuffle.partitions)
2Default auto-broadcast threshold: 10MB (spark.sql.autoBroadcastJoinThreshold)
3cache() default storage level: MEMORY_AND_DISK_DESER (since Spark 3.0)
4cache() is LAZY — unpersist() is EAGER
5repartition() = shuffle, can increase or decrease, even distribution
6coalesce() = no shuffle, can only DECREASE, uneven distribution
7AQE enabled by default (spark.sql.adaptive.enabled = true)
8Python UDFs bypass both Tungsten and Catalyst optimizer
9Disk spill = executor ran out of memory during shuffle
Domain 512% of exam

Delta Lake and Spark Ecosystem

Must-Know Facts

  • Delta Lake stores data as Parquet files PLUS a transaction log directory (_delta_log). You cannot skip the transaction log and read Delta as plain Parquet — you'll get inconsistent or missing data
  • Schema enforcement (default): writes with extra columns or wrong types FAIL. Schema evolution: enable with .option('mergeSchema', 'true') to allow new columns on write
  • VACUUM default retention is 7 days (168 hours). After VACUUM, time travel to versions older than the retention period is impossible
  • VACUUM with retention < 7 days requires setting delta.retentionDurationCheck.enabled = false. This is intentionally hard to do accidentally
  • MERGE INTO fails if multiple source rows match the same target row — deduplicate the source before running MERGE
  • DESCRIBE HISTORY table_name shows version, timestamp, operation (WRITE, MERGE, DELETE, OPTIMIZE), and user for each version
  • OPTIMIZE compacts small files. It does NOT change the data — queries before and after OPTIMIZE produce identical results
  • Structured Streaming can write to and read from Delta tables natively — Delta is the recommended sink for streaming pipelines

Common Traps

TrapReading a Delta table path as Parquet to bypass the transaction log
RealityReading Delta data as plain Parquet reads raw data files without applying the transaction log, which can include deleted or overwritten data. Always read Delta tables using format('delta') or by table name
TrapAssuming schema enforcement adds new columns automatically
RealitySchema enforcement is the DEFAULT and it REJECTS writes with new columns. You must explicitly enable schema evolution with .option('mergeSchema', 'true') to allow new columns. The default behavior protects the table schema
TrapRunning VACUUM and expecting time travel to still work for old versions
RealityVACUUM removes physical data files beyond the retention period. After VACUUM with 7-day retention, you cannot query versions older than 7 days even though DESCRIBE HISTORY shows them — the data files are gone
TrapUsing MERGE INTO without deduplicating the source
RealityIf two source rows match the same target row, MERGE throws an error: 'MERGE failed because multiple source rows matched the same target row'. The source must have at most one row per match key

Confusing Pairs

Schema EnforcementSchema Evolution

Schema enforcement = DEFAULT, rejects writes that don't match the table schema (protects integrity). Schema evolution = opt-in, add .option('mergeSchema', 'true') or set autoMerge globally, allows new columns to be added automatically. Enforcement applies on write, not read

OPTIMIZEVACUUM

OPTIMIZE = compacts small files into larger ones for faster reads. Does NOT delete data or affect query results. VACUUM = deletes old data files no longer referenced by the current version (or beyond retention). VACUUM affects time travel — OPTIMIZE does not

VERSION AS OFTIMESTAMP AS OF

VERSION AS OF n = query the exact version number (integer). TIMESTAMP AS OF 'YYYY-MM-DD' = query the table state at a specific point in time. Both work for time travel reads. Use DESCRIBE HISTORY to find the version number for a given timestamp

Scenario Tips

If the question asks about:

Question: after VACUUM with default settings, a user tries to query the table as it was 10 days ago. What happens?

Answer:

The query fails — VACUUM with the 7-day default retention deleted the data files needed for time travel beyond 7 days. DESCRIBE HISTORY still shows the version, but the underlying Parquet files are gone

Distractor to avoid:

Time travel fails not because the version doesn't exist in the log, but because the physical files were deleted. The transaction log itself is not deleted by VACUUM

If the question asks about:

Question asks what happens when you write a DataFrame with a new column to a Delta table using default settings (no mergeSchema)

Answer:

The write fails with a schema mismatch error. To succeed, add .option('mergeSchema', 'true') to allow schema evolution and add the new column to the table schema

Distractor to avoid:

Wrong answer: the new column is silently dropped. Delta fails fast — it does not silently drop data

If the question asks about:

Question: what is the correct syntax for upserting records — updating matches and inserting non-matches — in Delta Lake?

Answer:

MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *. This is the only correct upsert syntax in Delta Lake

Distractor to avoid:

INSERT OR UPDATE, UPSERT INTO, INSERT OVERWRITE — none of these are valid Spark SQL or Delta Lake syntax

Last-Minute Facts

1Delta table = Parquet files + _delta_log transaction directory
2Schema enforcement is the DEFAULT — rejects schema mismatches on write
3mergeSchema = true enables schema evolution (new columns allowed)
4VACUUM default retention = 7 days (168 hours)
5VACUUM removes files — breaks time travel for older versions
6OPTIMIZE compacts files, does NOT affect query results
7MERGE fails if multiple source rows match one target row
8DESCRIBE HISTORY shows version number, timestamp, and operation
Domain 68% of exam

Spark Structured Streaming

Must-Know Facts

  • Output modes: append (new rows only, default), complete (full result rewritten every trigger, requires aggregation), update (only changed rows, requires aggregation)
  • append mode does NOT support aggregations without a watermark — because Spark cannot know if an aggregation result is final until it has seen all data
  • complete mode rewrites the ENTIRE result table every trigger — only use when the full result fits in executor memory and the sink supports overwriting
  • Checkpoints store streaming state and progress. Deleting a checkpoint causes the stream to reprocess ALL data from the beginning — checkpoints are not optional in production
  • Watermarks tell Spark how late data can arrive: .withWatermark('event_time', '10 minutes'). Events more than 10 minutes late are dropped. Required for windowed aggregations
  • Joining two streaming DataFrames requires watermarks on BOTH sides
  • Trigger.AvailableNow() (formerly once=True) processes all available data in one or more micro-batches and stops — the modern bridge between batch and streaming
  • You CANNOT sort a streaming DataFrame in append mode — sort requires seeing all data, which contradicts the unbounded table model

Common Traps

TrapUsing append output mode with an aggregation
Realityappend mode without a watermark rejects aggregations — Spark cannot guarantee that aggregated results won't change as new data arrives. Use complete or update mode for aggregations
TrapDeleting the checkpoint directory and expecting the stream to resume where it left off
RealityWithout a checkpoint, Spark loses all offset tracking and state. The stream restarts from the beginning of the source. This causes duplicate processing
TrapTrying to use sort() in append mode streaming
RealitySorting is not supported in append mode because it requires comparing all rows, which is impossible on an unbounded stream. Complete mode is required for sorted output, and only if you have an aggregation

Confusing Pairs

append modecomplete modeupdate mode

append = writes only newly-added rows since last trigger (default, no aggregation without watermark). complete = rewrites the FULL result table every trigger (requires aggregation, expensive for large results). update = writes only rows that changed since last trigger (requires aggregation, more efficient than complete for large state). Exam key: complete and update BOTH require aggregation

processingTime triggeravailableNow trigger

processingTime='10 seconds' = runs micro-batches at a fixed interval, runs indefinitely. availableNow=True = processes all data currently in the source and stops (replaces deprecated once=True). Use availableNow for scheduled batch-style streaming jobs

event-time watermarkprocessing-time semantics

Event-time = uses a timestamp column in the data itself (e.g., 'event_time' field) to define when the event occurred — .withWatermark('event_time', '10 minutes') drops late events. Processing-time = Spark uses the system clock at the time data arrives. Watermarks apply only to event-time. Exam trap: using a processing-time trigger does NOT make a query event-time aware — you need withWatermark() on an event-time column

Scenario Tips

If the question asks about:

Question: a streaming job counts events by category. The developer wants only the changed totals emitted each trigger, not the full table. Which output mode?

Answer:

update mode — emits only rows where the aggregated count changed since the last trigger. More efficient than complete mode which would rewrite all categories every trigger

Distractor to avoid:

append mode sounds right but doesn't work with aggregations without a watermark. complete mode works but wastes I/O by rewriting unchanged rows

If the question asks about:

Question asks when a watermark is required in a Structured Streaming query

Answer:

When performing windowed aggregations on event time — watermarks define how late data can arrive and allow Spark to clean up old window state. Required when joining two streaming DataFrames

Distractor to avoid:

Watermarks are NOT required for all streaming queries — simple pass-through queries in append mode don't need them

Last-Minute Facts

1append mode = new rows only (no aggregation without watermark)
2complete mode = full result rewritten every trigger (requires aggregation)
3update mode = only changed rows (requires aggregation)
4Checkpoint deleted = stream reprocesses ALL data from start
5Cannot sort() in append mode streaming
6Watermark required for windowed aggregations on event time
7Joining two streams requires watermarks on BOTH sides
8availableNow=True processes all pending data then stops (replaces once=True)

Feeling confident?

Put your knowledge to the test with a timed Spark Dev Associate mock exam.