General Exam Tips
- 1.Read ALL answer choices before selecting — two options often both work, but one is clearly OPTIMAL. The exam rewards judgment, not just correctness.
- 2.Budget 2 minutes per question. Flag difficult scenario questions and return — don't sink 5 minutes on one complex MERGE question.
- 3.The exam is scenario-heavy. Look for constraint keywords: 'most efficient', 'without duplicating data', 'without restarting', 'best approach'. These words eliminate wrong answers.
- 4.When two answers seem equally valid, ask: which one is more production-appropriate? The exam favors managed, automated solutions over manual approaches.
- 5.Code in questions is primarily Python and SQL. Expect 3–5 sentence production scenario setups before the actual question.
- 6.No penalty for wrong answers — always guess if unsure. Never leave a question blank.
- 7.In the last 15 minutes, your mind is calmer. Return to flagged questions — you'll often spot what you missed initially.
- 8.Data Processing is 30% of the exam. If you're running short on time, prioritize those questions over Data Modeling (16%).
- 9.Right/outer joins are NOT supported with streaming DataFrames. If a question offers a streaming outer join as an option, eliminate it.
- 10.VACUUM on a shallow clone produces an error — if a question asks about maintaining a shallow clone, that's a trap.
Quick Navigation
Data Processing
Must-Know Facts
- foreachBatch is the ONLY correct way to execute MERGE INTO from a streaming context — there is no mode('merge') in writeStream
- trigger(availableNow=True) replaced trigger(once=True): processes data across multiple micro-batches, safe for large backlogs. trigger(once=True) is deprecated and risks OOM
- Stream-static joins are stateless — the static table loads ONCE at stream initialization. New records in the static table are invisible until the stream restarts
- Stream-stream joins require watermarks on BOTH streams to bound state. Without watermarks, state grows unbounded and the job eventually runs out of memory
- Watermarks are an approximate lower bound — Spark may still process some late data beyond the threshold. They are not an exact cutoff
- Output modes: append (new rows only, default for non-aggregation), update (changed rows only, available for aggregations), complete (full result table rewritten, only for aggregations)
- complete output mode rewrites the entire result table every micro-batch — only valid for aggregation queries. Exam trap: applying complete mode to a non-aggregation streaming query causes a runtime error, not a compile-time error. The exam tests whether you know WHICH queries are compatible with each mode.
- dropDuplicates() in streaming deduplicates within a micro-batch only. To prevent duplicates ALREADY in the target table, combine with an insert-only MERGE
- CDC processing pattern: stage incoming changes → deduplicate by key (take latest by timestamp) → MERGE with DELETE/UPDATE/INSERT clauses
- Change Data Feed captures _change_type: insert, update_preimage, update_postimage, delete. Updates produce TWO rows per changed record
- Auto Loader can infer schema from data files — set cloudFiles.schemaLocation to persist the inferred schema across stream restarts
- mergeSchema adds NEW columns during MERGE. overwriteSchema is required to CHANGE an existing column's data type — a critical distinction
- Two streaming queries must NEVER share the same checkpoint location. Sharing checkpoints causes data corruption and undefined behavior
Common Traps
Confusing Pairs
Scenario Tips
When the question asks about processing a large backlog from cloud storage that caused OOM with trigger(once=True)...
Switch to trigger(availableNow=True). It processes the backlog across multiple smaller micro-batches, preventing memory issues while still stopping when done.
Increasing driver/executor memory is a wrong answer — it treats the symptom, not the cause, and is expensive. The root issue is single-batch processing.
When the question describes a streaming MERGE pipeline and asks how to implement it...
Use writeStream.foreachBatch(). Inside the function, create a temp view from the batch DataFrame and execute MERGE INTO using sparkSession.sql().
writeStream.format('delta').mode('merge') does not exist. This is a popular wrong answer. MERGE cannot be called directly from writeStream.
When the question shows a stream-static join and asks why new dimension records are missing from output...
Stream-static joins snapshot the static table at stream start. New records added to the static table during execution are invisible. Fix: restart the stream or convert to a stream-stream join.
Corrupt checkpoint or wrong watermark setting are common distractors. The checkpoint and watermark have nothing to do with static table staleness.
When the question asks about deduplication in a streaming pipeline writing to a Delta table...
Use both dropDuplicates(['key']) to deduplicate within each micro-batch AND an insert-only MERGE to avoid reinserting records already in the target table.
Just using dropDuplicates() alone seems sufficient but only handles within-batch duplicates. A question that asks for 'complete deduplication' requires the MERGE check too.
When the question asks about enriching a streaming event with user profiles that change infrequently (daily updates)...
Stream-static join is appropriate. The profile table changes rarely — daily restarts of the stream will pick up new profiles. If profiles updated every few minutes, a stream-stream join would be needed.
Stream-stream join is not wrong here, but it adds unnecessary complexity for infrequent dimension changes. The exam looks for the SIMPLEST correct solution.
Last-Minute Facts
Testing and Deployment
Must-Know Facts
- The /jobs/create REST API is NOT idempotent — every call creates a new job with a unique job ID. Use /jobs/reset to update an existing job's configuration
- Repair Run re-executes only failed tasks and their downstream dependents — successful tasks are preserved. Fix the root cause before repairing or the same failure will repeat
- Databricks Asset Bundles targets inherit from the default config — only specify overrides per target, not the entire config. Duplicating config across targets is an anti-pattern
- dbutils.widgets.get() retrieves parameters passed from the Jobs API. It does NOT access environment variables or Spark configs
- Shallow clones for testing: zero-copy, fast to create, breaks if source runs VACUUM. The exam judgment call: when a question says 'without duplicating data' or 'minimal storage' → shallow clone. When it says 'independent copy', 'long-lived environment', or 'migration' → deep clone. Never run VACUUM on the clone itself.
- Deep clones for long-lived testing or archives: fully independent copy, uses full storage, changes do not affect the source
- pytest with a local SparkSession (SparkSession.builder.master('local[*]')) is the standard pattern for unit testing PySpark transformations
- Test streaming pipelines with memory sink: writeStream.format('memory').queryName('test_table') — then query spark.table('test_table') for assertions
- Multi-environment promotion: use separate Unity Catalog catalogs (dev_catalog, staging_catalog, prod_catalog) — same code, different catalog parameter
- Job clusters (ephemeral, auto-terminate) for all production jobs. All-purpose clusters only for interactive development — never schedule production jobs on all-purpose clusters
- VACUUM on a shallow clone produces an error — do not run VACUUM on a shallow clone directly
Common Traps
Confusing Pairs
Scenario Tips
When the question asks how to test against production-like data for a 5 TB table without incurring storage costs...
Create a shallow clone. It references the source files without copying them — zero storage cost, production-identical data schema and content.
Deep clone duplicates all 5 TB of storage. Delta Sharing is for cross-organization sharing. Both are wrong for 'no storage duplication' scenarios.
When a multi-task job fails at task 3 and you've fixed the bug — how do you re-run efficiently...
Use Repair Run. It re-executes task 3 and all downstream tasks (4, 5, etc.) while preserving the results of tasks 1 and 2.
Re-running the full job wastes compute re-executing successful tasks 1 and 2. Manually running tasks 3–5 as separate jobs loses dependency tracking.
When a CI/CD pipeline calls the /jobs/create API on every deployment and the team notices accumulating duplicate jobs...
Replace /jobs/create with /jobs/reset. First check if the job exists, then reset its configuration. Or use Databricks Asset Bundles, which handle idempotent deployment internally.
Adding job name uniqueness validation doesn't help — /jobs/create does not enforce name uniqueness and will still create duplicates.
When the question asks how to parameterize a notebook to use different Unity Catalog catalogs in dev vs. production...
Use dbutils.widgets.get('catalog') in the notebook and pass the catalog name via the Jobs API widget configuration. In Asset Bundles, use variables per target environment.
Hardcoding catalog names or using IF/ELSE on cluster environment tags are anti-patterns that bypass the proper parameterization mechanism.
Last-Minute Facts
Data Governance and Security
Must-Know Facts
- Column mask functions MUST return the same data type as the masked column. A STRING mask on an INT column compiles but fails at query time — not at mask creation time
- Row filters evaluate per-row and affect query performance on large tables. Complex filter predicates can cause full table scans
- Secrets are redacted when printed ('[REDACTED]') but the actual value is available in code — this is intentional security design. Exam judgment: a question about why a connection string works but the printed value shows '[REDACTED]' is testing this exact behavior. This is NOT a bug and NOT a reason to use a different secret approach.
- Job ownership cannot be assigned to groups — must be individual users. This limits automation for team-based workflows
- Unity Catalog audit logs: query system.access.audit for compliance reporting — captures who accessed what data and when. Exam judgment: audit logs answer 'who accessed this?' while information_schema answers 'what masks/policies exist?'. Questions about compliance investigations → audit logs. Questions about inventorying security configurations → information_schema.
- DLT pipeline permissions default to owner-only. Other users cannot read materialized views created by DLT unless explicitly granted
- Dynamic views are an alternative to column masking — useful when you want view-level access control rather than table-level masks
- GDPR deletion: DELETE FROM removes from the current version, but old versions in the transaction log still reference the data. Run VACUUM to permanently remove files and enable delta.retentionDurationCheck.enabled = false for immediate removal
- information_schema is the programmatic way to audit metadata: query system.information_schema.columns, tables, table_privileges, and column_masks at scale
- Lakehouse Federation provides READ-ONLY access to external databases via Unity Catalog foreign catalogs — you cannot write back to external sources
Common Traps
Confusing Pairs
Scenario Tips
When the question asks how to show real SSN values to compliance team members and masked values to everyone else, without creating separate tables...
Create a column mask function using is_member('compliance_team') and apply it to the SSN column with ALTER TABLE ... ALTER COLUMN ... SET MASK.
A dynamic view also works but creates a separate object and requires managing view permissions separately — column masking is simpler and applies directly to the table.
When the question asks about permanently deleting a customer's PII for GDPR compliance...
DELETE FROM the table, then run VACUUM with delta.retentionDurationCheck.enabled = false (or RETAIN 0 HOURS to force immediate removal). Verify by querying old versions via time travel — data should be inaccessible.
DELETE alone is not sufficient for GDPR — the data is still accessible via time travel until VACUUM removes the old files.
When the question asks how to programmatically find all tables with column masking applied across an entire catalog...
Query system.information_schema.column_masks (or system.information_schema.columns filtering on masking policy). This is faster and more scalable than running DESCRIBE on each table.
Checking audit logs shows when masks were created but requires parsing events — information_schema provides direct structured metadata access.
Last-Minute Facts
Data Modeling and Design
Must-Know Facts
- SCD Type 1: overwrite current values with MERGE. No history is preserved. WHEN MATCHED THEN UPDATE SET *. Simple.
- SCD Type 2: preserve full history. Requires TWO operations in one MERGE: close the current record (set end_date, is_current=false) AND insert a new version (start_date, is_current=true). Both operations can be expressed in a single MERGE statement.
- Standalone materialized views exist in Databricks: CREATE MATERIALIZED VIEW works in a Pro or Serverless SQL warehouse — no DLT pipeline required. Databricks auto-creates a serverless pipeline for refreshes. Does NOT work with Classic SQL warehouses. The exam may still have questions written when this was DLT-only; recognize the updated behavior.
- Generated columns are computed at WRITE time, not read time. Changing the generation expression does NOT retroactively update existing rows. Manual backfill is required.
- CHECK constraints are enforced on WRITE only — existing rows that violate a newly added constraint are NOT flagged or rejected. Exam judgment: if a question asks 'what happens to existing data when you add a CHECK constraint?', the answer is always 'nothing — existing rows are unaffected'. If a question asks 'what happens when you INSERT a row violating the constraint?', the answer is 'the write fails with an AnalysisException'.
- Liquid Clustering replaces both static partitioning and Z-ORDER. Keys can be changed without rewriting the entire table. Applied incrementally during OPTIMIZE. Exam judgment: any question describing a table with 'changing query patterns' or 'evolving access needs' → Liquid Clustering. Z-ORDER requires re-running OPTIMIZE to re-cluster after every key change and cannot be changed incrementally.
- Star schema: fact tables (large, append-heavy transactions/events) surrounded by dimension tables (smaller, slowly changing attributes). Analytical queries join fact to dimensions.
- Gold layer tables should be pre-aggregated and denormalized for fast consumption — not raw joins on Silver. Views on Silver are too slow for large-scale analytics.
- Foreign key constraints in Delta Lake are informational only — they are NOT enforced. Referential integrity must be managed by application logic.
Common Traps
Confusing Pairs
Scenario Tips
When the question asks how to track customer address changes so historical orders show the address at time of order...
SCD Type 2. Each address change creates a new row with start_date/end_date and is_current flag. Historical orders join to the dimension row active at the order's timestamp.
SCD Type 1 only keeps the current address — historical orders would show the wrong (current) address. A separate address_history table works but is not the canonical pattern.
When the question asks which object to use for a Gold-layer pre-aggregated summary table that BI tools read frequently...
A pre-computed Delta table refreshed on a schedule. Not a view (too slow on large Silver tables) and not a DLT materialized view unless the pipeline is already in DLT.
Views are always 'fresh' but that freshness comes at the cost of recomputing aggregations over billions of rows on every dashboard query.
When the question asks which clustering approach to use for a large Delta table with changing query patterns...
Liquid Clustering. It is applied incrementally, keys can be changed without rewriting all data, and it adapts to evolving access patterns without maintenance overhead.
Z-ORDER requires re-running OPTIMIZE to re-cluster after each change, and changing the Z-ORDER columns requires rewriting more data. It is the legacy approach.
Last-Minute Facts
Monitoring, Logging, and Optimization
Must-Know Facts
- Data skew signature in Spark UI: one or a few tasks process 10–100x more data than others. Fix: repartition, salting on the skewed key, or enable AQE skew handling
- Shuffle spill to disk means executor memory is insufficient. Fix: increase spark.executor.memory or reduce the number of partitions (to make each partition smaller). Adding more executors does NOT help spill.
- AQE (Adaptive Query Execution) is ENABLED by default in Databricks Runtime. It can change the query plan at runtime — the EXPLAIN output may not match actual execution
- AQE three key optimizations: (1) coalesce small shuffle partitions, (2) convert sort-merge join to broadcast hash join if one side is small at runtime, (3) detect and split skewed partitions
- EXPLAIN shows the static query plan. After AQE optimizations, the actual plan in Spark UI may differ. This is expected and correct behavior
- VACUUM removes data files older than the retention period (default 7 days). VACUUM does NOT remove transaction log entries — that is controlled by delta.logRetentionDuration
- Auto Compaction with streaming MERGE workloads auto-tunes target file size DOWN to 32–48 MB instead of the default 1 GB. This is expected and correct — not a problem requiring intervention
- Databricks SQL alerts: schedule a query, set a threshold condition, configure notifications (email, Slack, webhook). Used for SLA monitoring and data quality alerting
- Spark UI task metrics: look at 'Shuffle Read Size / Records' and 'Spill (Memory)' columns for performance diagnosis. High spill = memory pressure. Large shuffle = wide transformation overhead
- Job cluster auto-termination saves cost. All-purpose cluster idle time is a significant cost source in production — always auto-terminate all-purpose clusters after a period of inactivity
Common Traps
Confusing Pairs
Scenario Tips
When Spark UI shows one task taking 45 minutes while 199 others take 30 seconds, and that task processes 50 GB vs 200 MB for others...
Data skew. Repartition the DataFrame by a different key, apply salting to distribute the skewed key, or enable AQE (already on by default) which can automatically split skewed partitions.
Increasing executor count doesn't help with skew — the skewed partition is still one partition that goes to one executor. The bottleneck is partition size, not cluster size.
When the question asks why streaming MERGE files are 35 MB instead of 1 GB and whether to fix it...
Auto Compaction auto-tunes file size to match write patterns. For streaming MERGE with small micro-batches, 32–48 MB files are expected and correct. No action needed.
Manually running OPTIMIZE to force 1 GB files is unnecessary overhead. The examinees who fail this question try to 'fix' something that isn't broken.
When EXPLAIN shows a SortMergeJoin but the Spark UI shows a BroadcastHashJoin was actually executed...
AQE converted the join at runtime. At runtime, after collecting actual data size statistics, AQE determined one side was small enough to broadcast. This is normal AQE behavior — not a bug.
A broadcast hint being added between EXPLAIN and execution is a possible but unlikely scenario. AQE runtime optimization is the canonical explanation for plan discrepancies.