General Exam Tips
- 1.Read ALL answer choices before selecting — many questions have two plausible-looking options where one has a subtle method signature or behavioral difference
- 2.Mentally classify every operation as a transformation (lazy) or action (eager) before reasoning about when execution occurs
- 3.No documentation access during the exam — you must know default values (shuffle.partitions = 200, autoBroadcastJoinThreshold = 10MB) cold
- 4.When a question asks 'what happens when…', trace the execution: is an action called? Is there a shuffle? Where is the stage boundary?
- 5.Code-snippet questions often test one subtle difference between two nearly identical answers — the wrong option uses an invalid method name or slightly wrong syntax
- 6.Time budget: 45 questions in 90 minutes = 2 minutes each. Flag hard questions and return — don't stall
- 7.Always guess if unsure — no penalty for wrong answers
- 8.~60-70% of questions involve code: predicting output, identifying errors, or choosing the correct implementation. Practice reading unfamiliar PySpark code quickly
- 9.When the question mentions 'most efficient' or 'best approach', the answer almost always avoids unnecessary shuffles
- 10.For streaming output mode questions, first ask: does the query have an aggregation? If yes, complete or update. If no, append only
Quick Navigation
Apache Spark DataFrame API
Must-Know Facts
- withColumn() REPLACES the column if the name already exists — it does not always add. This produces silent data corruption bugs
- union() matches columns by POSITION not by name. If two DataFrames have the same columns in different order, union() produces silently wrong results
- unionByName(df2, allowMissingColumns=True) is the safe way to combine DataFrames with differently ordered or mismatched columns
- filter() and where() are completely identical — the exam uses both and expects you to treat them as interchangeable
- select() on a string ('name') and on col('name') both return a DataFrame — both syntaxes are valid and produce identical results
- After groupBy().agg(), ONLY the grouping columns and aggregated columns remain — all other columns are silently dropped
- distinct() removes rows that are fully identical. dropDuplicates(['col']) removes rows with the same value in specified columns only — these behave very differently
- isNull() and isNotNull() are the correct null checks — never use == None or != None in Spark column expressions
- array_explode() is NOT a valid PySpark function — the correct function is explode(). The exam uses array_explode() as a distractor
- sample(withReplacement, fraction) does NOT guarantee exactly fraction * count rows — the result is probabilistic. withReplacement=True allows the same row to appear multiple times
- DataFrameReader uses .load() to execute; DataFrameWriter uses .save() to execute. These are not interchangeable
Common Traps
Confusing Pairs
Scenario Tips
Question shows df1.union(df2) and asks what the result contains, given that the DataFrames have columns in different orders
The result maps columns by position — if A = (id, name) and B = (name, id), the result column named 'id' will contain A's ids but B's names. The schema name comes from the LEFT DataFrame
Many assume union() is smart enough to match by name like unionByName(). It is not
Question asks which operation would be used to retrieve exactly the first 5 rows and bring them to the driver
head(5) or take(5) — both are actions that return a Python list. df.limit(5) is a transformation that does NOT bring data to the driver
collect(5) looks right but collect() takes no arguments. limit(5) is wrong because it returns a DataFrame, not data
Question presents a join using df1.join(df2, df1['id'] == df2['id'], 'inner') and asks why a subsequent .select('id') fails
Joining via expression (df1['id'] == df2['id']) keeps both id columns, causing an ambiguity error when selecting 'id'. Fix with .drop(df2['id']) after the join or use on='id' (string) instead
Candidates often think the join itself fails — the join succeeds, but the ambiguity surfaces only when referencing the column by name afterward
Question asks what happens when you call df.withColumn('price', col('price') * 1.1) where 'price' already exists
The existing 'price' column is replaced by the new expression. No error occurs, no new column is added
Common wrong answer: a second 'price' column is added, causing ambiguity. This would only happen if you used .select() to add a duplicate column name
Question presents groupBy('dept').agg(sum('salary')) and asks which columns are in the result
Only 'dept' and 'sum(salary)' — all other columns (name, hire_date, etc.) are gone. Use a window function if you need to keep all columns while computing per-group aggregates
Many assume all original columns are preserved after groupBy, like a SQL query with a GROUP BY that allows referencing non-aggregated columns
Last-Minute Facts
Apache Spark Architecture
Must-Know Facts
- The driver is a single point of failure — if it crashes, the entire Spark application fails. There is no automatic promotion of a worker to driver
- Executor failures are recoverable via lineage recomputation — Spark recomputes lost partitions from the original data using the DAG
- Each task processes exactly ONE partition on ONE core — the number of tasks in a stage equals the number of partitions
- Parallelism = number of slots (cores) across ALL executors, not the number of executors. 4 executors × 2 cores = 8-way parallelism
- Stage boundaries are created at shuffle points — every wide transformation (join, groupBy, repartition, distinct, orderBy) creates a new stage
- SparkSession is the unified entry point since Spark 2.0 — it replaces SparkContext, SQLContext, and HiveContext
- Client mode: driver runs on submission machine. Cluster mode: driver runs on a cluster node. Client mode is for interactive notebooks; cluster mode for production
- Garbage collection pauses affect task execution times — excessive GC shows up as long task times with low CPU usage in the Spark UI
Common Traps
Confusing Pairs
Scenario Tips
Question asks how many stages a job has given a chain of transformations
Count the wide transformations (joins, groupBys, distinct, orderBy, repartition) and add 1. For example: filter → groupBy → join produces 3 stages
Many count each transformation as a stage — only WIDE transformations (shuffle boundaries) create stage separations
Question describes one task running 10x longer than all others in the same stage
Data skew — one partition is disproportionately large. Adding more executors won't help. Fix by salting the skewed key, using broadcast join, or repartitioning on a higher-cardinality key
Insufficient CPU or network saturation would affect ALL tasks uniformly, not just one
Question asks what happens after an executor node crashes mid-job
Spark detects the failure, marks the tasks as failed, and reschedules them on surviving executors, recomputing lost partitions from lineage. The job continues
Some think the entire job fails — only the driver failure is unrecoverable
Last-Minute Facts
Apache Spark SQL
Must-Know Facts
- Window functions CANNOT appear in WHERE clauses — must wrap in a subquery or CTE: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1
- Global temp views require the global_temp prefix: SELECT * FROM global_temp.my_view. Omitting the prefix causes 'table not found'
- spark.sql() returns a DataFrame — you can chain any DataFrame methods on the result
- Temp views are session-scoped (disappear when session ends). Global temp views persist across sessions within the same application
- ORDER BY in Spark SQL causes a global sort with a full shuffle. SORT BY sorts within each partition without a full shuffle
- ROW_NUMBER() gives unique sequential numbers (no ties). RANK() leaves gaps after ties. DENSE_RANK() has no gaps after ties
- CTEs (WITH cte AS (...)) are executed lazily like subqueries — no performance benefit over subqueries unless Catalyst folds them
- CREATE GLOBAL TEMPORARY VIEW syntax — not CREATE OR REPLACE GLOBAL TEMP VIEW (the latter is invalid)
Common Traps
Confusing Pairs
Scenario Tips
Question presents SQL that applies ROW_NUMBER() in the WHERE clause directly and asks if it's valid
Invalid — window functions cannot be used in WHERE. Must wrap in subquery: SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1
The syntax looks almost right and trips up candidates who haven't internalized this restriction
Question asks you to keep only the most recent record per customer_id from a table
SELECT * FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn FROM customers) WHERE rn = 1. ROW_NUMBER + subquery pattern is the standard deduplication approach
SELECT DISTINCT customer_id or GROUP BY customer_id won't return all columns of the most recent record
Question asks where global_temp views are accessible
Accessible across all SparkSessions within the same Spark application via the global_temp database prefix. They are NOT cross-application or persistent after the application ends
Some think global temp views persist between applications or are stored in a metastore — they are still in-memory and application-scoped
Last-Minute Facts
Spark Performance and Optimization
Must-Know Facts
- spark.sql.shuffle.partitions defaults to 200 — too high for small datasets (200 tiny partitions = massive scheduling overhead), too low for large datasets
- spark.sql.autoBroadcastJoinThreshold defaults to 10MB — tables smaller than this are auto-broadcast. Set to -1 to disable auto-broadcasting
- cache() uses MEMORY_AND_DISK_DESER by default (since Spark 3.0). cache() is lazy — does not materialize until the next action executes
- unpersist() is EAGER — immediately releases cached data without needing an action
- Disk spill in Spark UI = executor ran out of memory and wrote shuffle data to disk. Fix by increasing executor memory or reducing partition size, NOT by adding nodes
- Predicate pushdown: Spark pushes filter() operations closer to the data source. UDFs block predicate pushdown — Catalyst cannot push filters through user-defined functions
- Adaptive Query Execution (AQE) is enabled by default in Spark 3+ — it coalesces small shuffle partitions, converts sort-merge joins to broadcast if runtime stats show small table, and handles join skew
- repartition(col) puts all rows with the same column value in the same partition — useful before joins to co-locate matching keys
Common Traps
Confusing Pairs
Scenario Tips
Question describes spark.sql.shuffle.partitions = 200 for a 10,000-row dataset and asks about the impact
Poor performance — 200 partitions for 10K rows creates ~50 rows per partition. The scheduler overhead managing 200 tasks far exceeds the computation cost. Reduce to a small number like 10-20
Trap answer: 'optimal, 200 is the recommended value'. 200 is the default, not universally optimal
Question asks which change best reduces shuffle cost when joining a 200GB table with a 5MB lookup table
Broadcast the 5MB lookup table using broadcast() hint or rely on auto-broadcast (5MB < 10MB threshold). This eliminates the shuffle of the 200GB table entirely
Increasing shuffle.partitions or adding executors helps somewhat but doesn't eliminate the shuffle. Broadcast is the optimal answer here
Question asks which scenario benefits most from calling cache()
A DataFrame that is used in multiple separate actions within the same job — for example, read once and then called with count() and show() separately. Each action without cache() recomputes the full lineage
A DataFrame used in a single write() call does not benefit from cache() — it adds memory overhead with no performance gain
Last-Minute Facts
Delta Lake and Spark Ecosystem
Must-Know Facts
- Delta Lake stores data as Parquet files PLUS a transaction log directory (_delta_log). You cannot skip the transaction log and read Delta as plain Parquet — you'll get inconsistent or missing data
- Schema enforcement (default): writes with extra columns or wrong types FAIL. Schema evolution: enable with .option('mergeSchema', 'true') to allow new columns on write
- VACUUM default retention is 7 days (168 hours). After VACUUM, time travel to versions older than the retention period is impossible
- VACUUM with retention < 7 days requires setting delta.retentionDurationCheck.enabled = false. This is intentionally hard to do accidentally
- MERGE INTO fails if multiple source rows match the same target row — deduplicate the source before running MERGE
- DESCRIBE HISTORY table_name shows version, timestamp, operation (WRITE, MERGE, DELETE, OPTIMIZE), and user for each version
- OPTIMIZE compacts small files. It does NOT change the data — queries before and after OPTIMIZE produce identical results
- Structured Streaming can write to and read from Delta tables natively — Delta is the recommended sink for streaming pipelines
Common Traps
Confusing Pairs
Scenario Tips
Question: after VACUUM with default settings, a user tries to query the table as it was 10 days ago. What happens?
The query fails — VACUUM with the 7-day default retention deleted the data files needed for time travel beyond 7 days. DESCRIBE HISTORY still shows the version, but the underlying Parquet files are gone
Time travel fails not because the version doesn't exist in the log, but because the physical files were deleted. The transaction log itself is not deleted by VACUUM
Question asks what happens when you write a DataFrame with a new column to a Delta table using default settings (no mergeSchema)
The write fails with a schema mismatch error. To succeed, add .option('mergeSchema', 'true') to allow schema evolution and add the new column to the table schema
Wrong answer: the new column is silently dropped. Delta fails fast — it does not silently drop data
Question: what is the correct syntax for upserting records — updating matches and inserting non-matches — in Delta Lake?
MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *. This is the only correct upsert syntax in Delta Lake
INSERT OR UPDATE, UPSERT INTO, INSERT OVERWRITE — none of these are valid Spark SQL or Delta Lake syntax
Last-Minute Facts
Spark Structured Streaming
Must-Know Facts
- Output modes: append (new rows only, default), complete (full result rewritten every trigger, requires aggregation), update (only changed rows, requires aggregation)
- append mode does NOT support aggregations without a watermark — because Spark cannot know if an aggregation result is final until it has seen all data
- complete mode rewrites the ENTIRE result table every trigger — only use when the full result fits in executor memory and the sink supports overwriting
- Checkpoints store streaming state and progress. Deleting a checkpoint causes the stream to reprocess ALL data from the beginning — checkpoints are not optional in production
- Watermarks tell Spark how late data can arrive: .withWatermark('event_time', '10 minutes'). Events more than 10 minutes late are dropped. Required for windowed aggregations
- Joining two streaming DataFrames requires watermarks on BOTH sides
- Trigger.AvailableNow() (formerly once=True) processes all available data in one or more micro-batches and stops — the modern bridge between batch and streaming
- You CANNOT sort a streaming DataFrame in append mode — sort requires seeing all data, which contradicts the unbounded table model
Common Traps
Confusing Pairs
Scenario Tips
Question: a streaming job counts events by category. The developer wants only the changed totals emitted each trigger, not the full table. Which output mode?
update mode — emits only rows where the aggregated count changed since the last trigger. More efficient than complete mode which would rewrite all categories every trigger
append mode sounds right but doesn't work with aggregations without a watermark. complete mode works but wastes I/O by rewriting unchanged rows
Question asks when a watermark is required in a Structured Streaming query
When performing windowed aggregations on event time — watermarks define how late data can arrive and allow Spark to clean up old window state. Required when joining two streaming DataFrames
Watermarks are NOT required for all streaming queries — simple pass-through queries in append mode don't need them