CertPrepNow
DatabricksDB DE Professional5 domains

DB DE Professional Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Data Engineer Professional exam.

General Exam Tips

  • 1.Read ALL answer choices before selecting — two options often both work, but one is clearly OPTIMAL. The exam rewards judgment, not just correctness.
  • 2.Budget 2 minutes per question. Flag difficult scenario questions and return — don't sink 5 minutes on one complex MERGE question.
  • 3.The exam is scenario-heavy. Look for constraint keywords: 'most efficient', 'without duplicating data', 'without restarting', 'best approach'. These words eliminate wrong answers.
  • 4.When two answers seem equally valid, ask: which one is more production-appropriate? The exam favors managed, automated solutions over manual approaches.
  • 5.Code in questions is primarily Python and SQL. Expect 3–5 sentence production scenario setups before the actual question.
  • 6.No penalty for wrong answers — always guess if unsure. Never leave a question blank.
  • 7.In the last 15 minutes, your mind is calmer. Return to flagged questions — you'll often spot what you missed initially.
  • 8.Data Processing is 30% of the exam. If you're running short on time, prioritize those questions over Data Modeling (16%).
  • 9.Right/outer joins are NOT supported with streaming DataFrames. If a question offers a streaming outer join as an option, eliminate it.
  • 10.VACUUM on a shallow clone produces an error — if a question asks about maintaining a shallow clone, that's a trap.
Domain 130% of exam

Data Processing

Must-Know Facts

  • foreachBatch is the ONLY correct way to execute MERGE INTO from a streaming context — there is no mode('merge') in writeStream
  • trigger(availableNow=True) replaced trigger(once=True): processes data across multiple micro-batches, safe for large backlogs. trigger(once=True) is deprecated and risks OOM
  • Stream-static joins are stateless — the static table loads ONCE at stream initialization. New records in the static table are invisible until the stream restarts
  • Stream-stream joins require watermarks on BOTH streams to bound state. Without watermarks, state grows unbounded and the job eventually runs out of memory
  • Watermarks are an approximate lower bound — Spark may still process some late data beyond the threshold. They are not an exact cutoff
  • Output modes: append (new rows only, default for non-aggregation), update (changed rows only, available for aggregations), complete (full result table rewritten, only for aggregations)
  • complete output mode rewrites the entire result table every micro-batch — only valid for aggregation queries. Exam trap: applying complete mode to a non-aggregation streaming query causes a runtime error, not a compile-time error. The exam tests whether you know WHICH queries are compatible with each mode.
  • dropDuplicates() in streaming deduplicates within a micro-batch only. To prevent duplicates ALREADY in the target table, combine with an insert-only MERGE
  • CDC processing pattern: stage incoming changes → deduplicate by key (take latest by timestamp) → MERGE with DELETE/UPDATE/INSERT clauses
  • Change Data Feed captures _change_type: insert, update_preimage, update_postimage, delete. Updates produce TWO rows per changed record
  • Auto Loader can infer schema from data files — set cloudFiles.schemaLocation to persist the inferred schema across stream restarts
  • mergeSchema adds NEW columns during MERGE. overwriteSchema is required to CHANGE an existing column's data type — a critical distinction
  • Two streaming queries must NEVER share the same checkpoint location. Sharing checkpoints causes data corruption and undefined behavior

Common Traps

TrapUsing MERGE without deduplicating the source first
RealityIf the source DataFrame contains duplicate keys matching the same target row, MERGE produces unexpected duplicate inserts — not an error. Always deduplicate with dropDuplicates(['key']) or ROW_NUMBER() before merging.
TrapExpecting stream-static joins to pick up new rows added to the static table
RealityStream-static joins snapshot the static table at stream start. Any additions to the static table during execution are invisible. Restart the stream or switch to a stream-stream join to capture updates.
TrapUsing trigger(once=True) for a large backlog
Realitytrigger(once=True) processes ALL available data in a single micro-batch. On a large backlog (millions of records), this causes OutOfMemoryError. Use trigger(availableNow=True) instead — it processes the same data across multiple safe micro-batches.
TrapAssuming mergeSchema handles column type changes
RealitymergeSchema only adds new columns. Changing an existing column's data type requires overwriteSchema (and will drop existing data). The exam tests this distinction precisely.
TrapThinking dropDuplicates() in streaming prevents all duplicates
RealitydropDuplicates() only deduplicates within the current micro-batch. Records that were already inserted in previous batches are not checked. Use an insert-only MERGE pattern to also check the target table.
TrapThinking a right/outer join on a streaming DataFrame is valid
RealityStreaming DataFrames do not support right outer joins, full outer joins, or left semi/anti joins when the streaming side is on the right. The exam sometimes includes these as plausible-looking wrong answers.

Confusing Pairs

trigger(availableNow=True)trigger(once=True)

availableNow = processes all available data across MULTIPLE micro-batches, then stops. Safe for large backlogs, preferred for scheduled pipelines. once = processes all data in ONE micro-batch, then stops. Deprecated — risks OOM on large datasets. The exam will present backlog scenarios where once=True is the wrong choice.

foreachBatchforeach

foreachBatch = receives an entire micro-batch DataFrame + batch ID. Standard pattern for MERGE, JDBC bulk writes, and any batch operation per micro-batch. foreach = processes one ROW at a time. Use only for per-record custom logic (e.g., send one HTTP request per event). Never use foreach for bulk inserts — it's much slower.

Stream-Static JoinStream-Stream Join

Stream-Static = static table snapshotted at start, never refreshed. No watermark needed. Use only when the static side changes infrequently (e.g., lookup tables). Stream-Stream = both sides continuously produce records. Requires watermarks on both streams. Stateful — bounds memory. Use when both sources are live event streams.

append output modecomplete output mode

append = only write new rows to the sink each micro-batch. Default for non-aggregation queries. complete = rewrite the ENTIRE result table every micro-batch. Only valid for aggregation queries. If you use complete on a non-aggregation streaming query, the job will fail.

mergeSchemaoverwriteSchema

mergeSchema = schema evolution that ADDS new columns. Safe for incremental pipelines. overwriteSchema = replaces the entire schema including changing column types. Drops existing data — use with caution. The exam will ask which option is appropriate for a given schema change scenario.

Change Data Feed (CDF)Time Travel

CDF = gives you the DIFF (what changed between versions). Row-level change records with _change_type. Must be enabled BEFORE changes occur. Perfect for incremental downstream processing. Time Travel = gives you the SNAPSHOT (full table as it existed at a version or timestamp). No enablement needed. Depends on VACUUM retention.

Scenario Tips

If the question asks about:

When the question asks about processing a large backlog from cloud storage that caused OOM with trigger(once=True)...

Answer:

Switch to trigger(availableNow=True). It processes the backlog across multiple smaller micro-batches, preventing memory issues while still stopping when done.

Distractor to avoid:

Increasing driver/executor memory is a wrong answer — it treats the symptom, not the cause, and is expensive. The root issue is single-batch processing.

If the question asks about:

When the question describes a streaming MERGE pipeline and asks how to implement it...

Answer:

Use writeStream.foreachBatch(). Inside the function, create a temp view from the batch DataFrame and execute MERGE INTO using sparkSession.sql().

Distractor to avoid:

writeStream.format('delta').mode('merge') does not exist. This is a popular wrong answer. MERGE cannot be called directly from writeStream.

If the question asks about:

When the question shows a stream-static join and asks why new dimension records are missing from output...

Answer:

Stream-static joins snapshot the static table at stream start. New records added to the static table during execution are invisible. Fix: restart the stream or convert to a stream-stream join.

Distractor to avoid:

Corrupt checkpoint or wrong watermark setting are common distractors. The checkpoint and watermark have nothing to do with static table staleness.

If the question asks about:

When the question asks about deduplication in a streaming pipeline writing to a Delta table...

Answer:

Use both dropDuplicates(['key']) to deduplicate within each micro-batch AND an insert-only MERGE to avoid reinserting records already in the target table.

Distractor to avoid:

Just using dropDuplicates() alone seems sufficient but only handles within-batch duplicates. A question that asks for 'complete deduplication' requires the MERGE check too.

If the question asks about:

When the question asks about enriching a streaming event with user profiles that change infrequently (daily updates)...

Answer:

Stream-static join is appropriate. The profile table changes rarely — daily restarts of the stream will pick up new profiles. If profiles updated every few minutes, a stream-stream join would be needed.

Distractor to avoid:

Stream-stream join is not wrong here, but it adds unnecessary complexity for infrequent dimension changes. The exam looks for the SIMPLEST correct solution.

Last-Minute Facts

1CDF _change_type values: insert, update_preimage, update_postimage, delete
2trigger(availableNow=True) = multiple micro-batches, safe; trigger(once=True) = single micro-batch, deprecated, risky
3foreachBatch function signature: (batch_df: DataFrame, batch_id: int) -> None
4complete output mode rewrites the full result every micro-batch — aggregations only
5Stream-stream joins require watermarks on BOTH streams
6Two streams can NEVER share the same checkpoint location
7Right/outer joins are NOT supported when the streaming side is on the right
8mergeSchema = add columns only; overwriteSchema = change types (and drops data)
Domain 220% of exam

Testing and Deployment

Must-Know Facts

  • The /jobs/create REST API is NOT idempotent — every call creates a new job with a unique job ID. Use /jobs/reset to update an existing job's configuration
  • Repair Run re-executes only failed tasks and their downstream dependents — successful tasks are preserved. Fix the root cause before repairing or the same failure will repeat
  • Databricks Asset Bundles targets inherit from the default config — only specify overrides per target, not the entire config. Duplicating config across targets is an anti-pattern
  • dbutils.widgets.get() retrieves parameters passed from the Jobs API. It does NOT access environment variables or Spark configs
  • Shallow clones for testing: zero-copy, fast to create, breaks if source runs VACUUM. The exam judgment call: when a question says 'without duplicating data' or 'minimal storage' → shallow clone. When it says 'independent copy', 'long-lived environment', or 'migration' → deep clone. Never run VACUUM on the clone itself.
  • Deep clones for long-lived testing or archives: fully independent copy, uses full storage, changes do not affect the source
  • pytest with a local SparkSession (SparkSession.builder.master('local[*]')) is the standard pattern for unit testing PySpark transformations
  • Test streaming pipelines with memory sink: writeStream.format('memory').queryName('test_table') — then query spark.table('test_table') for assertions
  • Multi-environment promotion: use separate Unity Catalog catalogs (dev_catalog, staging_catalog, prod_catalog) — same code, different catalog parameter
  • Job clusters (ephemeral, auto-terminate) for all production jobs. All-purpose clusters only for interactive development — never schedule production jobs on all-purpose clusters
  • VACUUM on a shallow clone produces an error — do not run VACUUM on a shallow clone directly

Common Traps

TrapCalling /jobs/create multiple times with the same config thinking it's idempotent
Reality/jobs/create is not idempotent. Each call creates a new job with a unique ID. You end up with duplicate job definitions. Use /jobs/reset to update existing jobs, or check for existence first.
TrapChoosing deep clone when the question says 'avoid duplicating data' or 'minimal storage'
RealityDeep clones copy all data files — they duplicate storage. When the question mentions cost or storage efficiency, shallow clone is the answer. When the question mentions independence or long-lived environments, deep clone is correct.
TrapThinking Repair Run skips ALL previous tasks including successful ones
RealityRepair Run re-executes ONLY failed tasks and their downstream dependents. Successfully completed tasks are preserved and not re-run. This is the key cost and time saving feature.
TrapDuplicating the entire databricks.yml configuration for each target environment
RealityBundle targets inherit from the default configuration. Only specify the overrides (e.g., workspace host, service principal, cluster size) in each target. Full duplication causes maintenance problems and is flagged as an anti-pattern.
TrapUsing dbutils.widgets.get() to access Spark configs or OS environment variables
Realitydbutils.widgets.get() retrieves parameters passed to the notebook through the Jobs API widget configuration only. It cannot access spark.conf values or os.environ. Use spark.conf.get() for Spark configs.

Confusing Pairs

/jobs/create/jobs/reset

create = creates a NEW job definition with a new job ID every call. Not idempotent — calling twice creates duplicate jobs. reset = updates an EXISTING job's definition using its job ID. Idempotent — safe to call repeatedly. The exam tests whether you know to use reset for updates.

Shallow CloneDeep Clone

Shallow Clone = zero-copy, references source files. Fast to create, minimal storage. Breaks if source runs VACUUM. Use for short-lived testing. Deep Clone = copies all data files. Fully independent. Changes to clone don't affect source. Use for archiving, migration, or long-lived independent environments.

Unit TestIntegration Test

Unit Test = test individual transformation functions in isolation using a local SparkSession. No external dependencies — fast and cheap. Integration Test = test the full pipeline end-to-end using a dedicated test catalog in Unity Catalog. Validates interactions between components against real (but isolated) infrastructure.

Job ClusterAll-Purpose Cluster

Job Cluster = created per run, auto-terminates when done. Cost-efficient. Always use for scheduled production jobs. All-Purpose Cluster = persists between runs, billed continuously. Use only for interactive development and exploration. Never use for scheduled production jobs.

Scenario Tips

If the question asks about:

When the question asks how to test against production-like data for a 5 TB table without incurring storage costs...

Answer:

Create a shallow clone. It references the source files without copying them — zero storage cost, production-identical data schema and content.

Distractor to avoid:

Deep clone duplicates all 5 TB of storage. Delta Sharing is for cross-organization sharing. Both are wrong for 'no storage duplication' scenarios.

If the question asks about:

When a multi-task job fails at task 3 and you've fixed the bug — how do you re-run efficiently...

Answer:

Use Repair Run. It re-executes task 3 and all downstream tasks (4, 5, etc.) while preserving the results of tasks 1 and 2.

Distractor to avoid:

Re-running the full job wastes compute re-executing successful tasks 1 and 2. Manually running tasks 3–5 as separate jobs loses dependency tracking.

If the question asks about:

When a CI/CD pipeline calls the /jobs/create API on every deployment and the team notices accumulating duplicate jobs...

Answer:

Replace /jobs/create with /jobs/reset. First check if the job exists, then reset its configuration. Or use Databricks Asset Bundles, which handle idempotent deployment internally.

Distractor to avoid:

Adding job name uniqueness validation doesn't help — /jobs/create does not enforce name uniqueness and will still create duplicates.

If the question asks about:

When the question asks how to parameterize a notebook to use different Unity Catalog catalogs in dev vs. production...

Answer:

Use dbutils.widgets.get('catalog') in the notebook and pass the catalog name via the Jobs API widget configuration. In Asset Bundles, use variables per target environment.

Distractor to avoid:

Hardcoding catalog names or using IF/ELSE on cluster environment tags are anti-patterns that bypass the proper parameterization mechanism.

Last-Minute Facts

1/jobs/create = new job every call (not idempotent). /jobs/reset = update existing job. /jobs/run-now = trigger run
2Repair Run = re-run failed + downstream tasks only. Successful tasks preserved
3Shallow clone = zero-copy, breaks on source VACUUM. Deep clone = full copy, fully independent
4VACUUM on a shallow clone = ERROR
5pytest + SparkSession.builder.master('local[*]') = unit test pattern
6Memory sink = writeStream.format('memory').queryName('name') for streaming tests
7Job clusters auto-terminate. All-purpose clusters are always-on. Production = job clusters only
8dbutils.widgets.get() = job parameter only, not env vars or spark.conf
Domain 318% of exam

Data Governance and Security

Must-Know Facts

  • Column mask functions MUST return the same data type as the masked column. A STRING mask on an INT column compiles but fails at query time — not at mask creation time
  • Row filters evaluate per-row and affect query performance on large tables. Complex filter predicates can cause full table scans
  • Secrets are redacted when printed ('[REDACTED]') but the actual value is available in code — this is intentional security design. Exam judgment: a question about why a connection string works but the printed value shows '[REDACTED]' is testing this exact behavior. This is NOT a bug and NOT a reason to use a different secret approach.
  • Job ownership cannot be assigned to groups — must be individual users. This limits automation for team-based workflows
  • Unity Catalog audit logs: query system.access.audit for compliance reporting — captures who accessed what data and when. Exam judgment: audit logs answer 'who accessed this?' while information_schema answers 'what masks/policies exist?'. Questions about compliance investigations → audit logs. Questions about inventorying security configurations → information_schema.
  • DLT pipeline permissions default to owner-only. Other users cannot read materialized views created by DLT unless explicitly granted
  • Dynamic views are an alternative to column masking — useful when you want view-level access control rather than table-level masks
  • GDPR deletion: DELETE FROM removes from the current version, but old versions in the transaction log still reference the data. Run VACUUM to permanently remove files and enable delta.retentionDurationCheck.enabled = false for immediate removal
  • information_schema is the programmatic way to audit metadata: query system.information_schema.columns, tables, table_privileges, and column_masks at scale
  • Lakehouse Federation provides READ-ONLY access to external databases via Unity Catalog foreign catalogs — you cannot write back to external sources

Common Traps

TrapAssuming column mask type mismatches fail at mask creation time
RealityColumn masks are created as SQL functions — Databricks validates only the function definition, not its application. A type mismatch (e.g., a mask function returning STRING applied to an INT column) only fails at query time when a user actually queries the table.
TrapBelieving DELETE + VACUUM immediately makes deleted data unrecoverable via time travel
RealityDELETE only removes data from the current version. Time travel can still see old versions until VACUUM runs. After VACUUM with a short retention, old data files are removed. To force immediate removal, set delta.retentionDurationCheck.enabled = false and VACUUM RETAIN 0 HOURS.
TrapThinking secrets can be printed for debugging purposes
RealityDatabricks permanently redacts secret values whenever they are converted to a string or printed. '[REDACTED]' will always appear. The actual value is available for programmatic use in connection strings and API calls — but you can never view it in output.
TrapUsing row filters as a substitute for column masking
RealityRow filters HIDE entire rows — users who fail the filter simply don't see those rows at all. Column masking shows all rows but obfuscates specific column values. These solve different problems: row filters for data classification, column masks for PII within visible rows.
TrapAssigning Unity Catalog job ownership to a group for team management
RealityJob ownership in Unity Catalog is assigned to individual users only, not groups. For team-owned jobs, use a service principal as the job owner, not a group.

Confusing Pairs

Column MaskingRow Filters

Column Masking = ALL rows are visible, but specific column values are obfuscated (e.g., SSN shows as 'XXX-XX-XXXX'). The row itself is returned. Row Filters = rows that fail the filter are COMPLETELY HIDDEN from query results. The user sees no trace of filtered rows. Use column masks for PII visibility control, row filters for data classification or multi-tenant data isolation.

Dynamic ViewColumn Masking Function

Dynamic View = separate object, access-controlled separately from the base table. Users are granted SELECT on the view, not the underlying table. Column Masking Function = applied directly to a column on the base table. Users query the table directly — the mask is transparent. Masking scales better for large numbers of users; dynamic views require managing view-level permissions separately.

DELETE + time travelDELETE + VACUUM

DELETE alone = data is logically removed from the current version but physically still in old data files. Time travel can still retrieve it. DELETE + VACUUM = physical data files referenced by old versions are removed after VACUUM. Time travel is now blocked for those versions. Required for GDPR right-to-erasure.

Secrets scope (Databricks-backed)Secrets scope (Azure Key Vault-backed)

Databricks-backed scope = secrets stored in Databricks. Managed via dbutils.secrets API. Simpler but tied to the Databricks workspace. Azure Key Vault-backed scope = secrets stored in Azure Key Vault, referenced from Databricks. Preferred in enterprise environments for centralized secret management and audit trails.

Scenario Tips

If the question asks about:

When the question asks how to show real SSN values to compliance team members and masked values to everyone else, without creating separate tables...

Answer:

Create a column mask function using is_member('compliance_team') and apply it to the SSN column with ALTER TABLE ... ALTER COLUMN ... SET MASK.

Distractor to avoid:

A dynamic view also works but creates a separate object and requires managing view permissions separately — column masking is simpler and applies directly to the table.

If the question asks about:

When the question asks about permanently deleting a customer's PII for GDPR compliance...

Answer:

DELETE FROM the table, then run VACUUM with delta.retentionDurationCheck.enabled = false (or RETAIN 0 HOURS to force immediate removal). Verify by querying old versions via time travel — data should be inaccessible.

Distractor to avoid:

DELETE alone is not sufficient for GDPR — the data is still accessible via time travel until VACUUM removes the old files.

If the question asks about:

When the question asks how to programmatically find all tables with column masking applied across an entire catalog...

Answer:

Query system.information_schema.column_masks (or system.information_schema.columns filtering on masking policy). This is faster and more scalable than running DESCRIBE on each table.

Distractor to avoid:

Checking audit logs shows when masks were created but requires parsing events — information_schema provides direct structured metadata access.

Last-Minute Facts

1Column mask type mismatch fails at QUERY TIME, not creation time
2system.access.audit = Unity Catalog audit log table
3system.information_schema.column_masks = programmatic mask metadata
4Job owners = individual users only, not groups. Use service principals for team-owned jobs
5DLT materialized view permissions = owner-only by default
6Lakehouse Federation = READ-ONLY access to external sources
7VACUUM removes physical files; DELETE alone only removes from current version
8Secrets print as '[REDACTED]' but the variable holds the actual value
Domain 416% of exam

Data Modeling and Design

Must-Know Facts

  • SCD Type 1: overwrite current values with MERGE. No history is preserved. WHEN MATCHED THEN UPDATE SET *. Simple.
  • SCD Type 2: preserve full history. Requires TWO operations in one MERGE: close the current record (set end_date, is_current=false) AND insert a new version (start_date, is_current=true). Both operations can be expressed in a single MERGE statement.
  • Standalone materialized views exist in Databricks: CREATE MATERIALIZED VIEW works in a Pro or Serverless SQL warehouse — no DLT pipeline required. Databricks auto-creates a serverless pipeline for refreshes. Does NOT work with Classic SQL warehouses. The exam may still have questions written when this was DLT-only; recognize the updated behavior.
  • Generated columns are computed at WRITE time, not read time. Changing the generation expression does NOT retroactively update existing rows. Manual backfill is required.
  • CHECK constraints are enforced on WRITE only — existing rows that violate a newly added constraint are NOT flagged or rejected. Exam judgment: if a question asks 'what happens to existing data when you add a CHECK constraint?', the answer is always 'nothing — existing rows are unaffected'. If a question asks 'what happens when you INSERT a row violating the constraint?', the answer is 'the write fails with an AnalysisException'.
  • Liquid Clustering replaces both static partitioning and Z-ORDER. Keys can be changed without rewriting the entire table. Applied incrementally during OPTIMIZE. Exam judgment: any question describing a table with 'changing query patterns' or 'evolving access needs' → Liquid Clustering. Z-ORDER requires re-running OPTIMIZE to re-cluster after every key change and cannot be changed incrementally.
  • Star schema: fact tables (large, append-heavy transactions/events) surrounded by dimension tables (smaller, slowly changing attributes). Analytical queries join fact to dimensions.
  • Gold layer tables should be pre-aggregated and denormalized for fast consumption — not raw joins on Silver. Views on Silver are too slow for large-scale analytics.
  • Foreign key constraints in Delta Lake are informational only — they are NOT enforced. Referential integrity must be managed by application logic.

Common Traps

TrapThinking materialized views require a DLT pipeline to be useful in production
RealityStandalone materialized views CAN be created outside of DLT — using a Pro or Serverless SQL warehouse with CREATE MATERIALIZED VIEW. Behind the scenes, Databricks automatically creates a serverless pipeline to handle refreshes. The trap is assuming you need to build a full DLT pipeline just to get a precomputed materialized view. You do not — but you still need a Pro or Serverless warehouse, not a Classic SQL warehouse.
TrapThinking generated columns update automatically when the source column changes
RealityGenerated columns are computed at write time. If you change the generation expression, only new writes use the new expression. Existing rows retain their original computed values. Backfilling requires rewriting those rows.
TrapAssuming SCD Type 2 requires two separate MERGE statements
RealityA single MERGE statement can handle both the close-current-record (WHEN MATCHED) and insert-new-version (WHEN NOT MATCHED) operations simultaneously. The exam tests whether you know this is possible in one MERGE.
TrapAdding a CHECK constraint expecting it to validate existing data
RealityCHECK constraints in Delta Lake are enforced on new writes only. Adding a constraint to a table with existing violating data does NOT raise an error for those existing rows. Only future inserts or updates are checked.
TrapTreating foreign keys in Delta Lake as enforced referential integrity
RealityForeign key constraints in Delta Lake are metadata annotations only — they are informational for documentation and tooling but NOT enforced by the runtime. Invalid foreign key values will be inserted without error.

Confusing Pairs

SCD Type 1SCD Type 2

Type 1 = overwrite. One row per entity, always shows current state. Simple MERGE with UPDATE. No history. Use for attributes where history doesn't matter (e.g., preference settings). Type 2 = preserve history. Multiple rows per entity with effective dates (start_date, end_date, is_current). Use when historical accuracy matters (e.g., customer address at time of order).

ViewMaterialized View

View = SQL definition only, recomputes the full query on every access. Always fresh but slow for complex aggregations over large data. Materialized View = precomputed and stored as a Delta table. Fast reads but may be stale until the pipeline refreshes. Can be created EITHER inside a DLT pipeline OR as a standalone object via CREATE MATERIALIZED VIEW using a Pro/Serverless SQL warehouse (Databricks auto-creates a serverless pipeline behind the scenes). The key question is freshness vs. performance — views are always fresh, materialized views are fast but potentially stale.

Liquid ClusteringZ-ORDER

Liquid Clustering = modern, incremental, replaces both static partitioning and Z-ORDER. Keys can be changed without full rewrite. Applied automatically during OPTIMIZE. Z-ORDER = legacy optimization, applied per OPTIMIZE run, cannot be changed incrementally, less efficient for changing access patterns. Prefer Liquid Clustering for new tables.

Fact TableDimension Table

Fact Table = records events/transactions at fine granularity (one row per sale, click, order). Large volume, append-heavy, references dimension keys. Dimension Table = descriptive attributes of entities (customer, product, date). Smaller, updated via SCD patterns, joined to fact tables via foreign keys.

Scenario Tips

If the question asks about:

When the question asks how to track customer address changes so historical orders show the address at time of order...

Answer:

SCD Type 2. Each address change creates a new row with start_date/end_date and is_current flag. Historical orders join to the dimension row active at the order's timestamp.

Distractor to avoid:

SCD Type 1 only keeps the current address — historical orders would show the wrong (current) address. A separate address_history table works but is not the canonical pattern.

If the question asks about:

When the question asks which object to use for a Gold-layer pre-aggregated summary table that BI tools read frequently...

Answer:

A pre-computed Delta table refreshed on a schedule. Not a view (too slow on large Silver tables) and not a DLT materialized view unless the pipeline is already in DLT.

Distractor to avoid:

Views are always 'fresh' but that freshness comes at the cost of recomputing aggregations over billions of rows on every dashboard query.

If the question asks about:

When the question asks which clustering approach to use for a large Delta table with changing query patterns...

Answer:

Liquid Clustering. It is applied incrementally, keys can be changed without rewriting all data, and it adapts to evolving access patterns without maintenance overhead.

Distractor to avoid:

Z-ORDER requires re-running OPTIMIZE to re-cluster after each change, and changing the Z-ORDER columns requires rewriting more data. It is the legacy approach.

Last-Minute Facts

1SCD Type 2 MERGE: WHEN MATCHED (close current) + WHEN NOT MATCHED (insert new) — both in one MERGE
2Materialized views: can be standalone (Pro/Serverless warehouse) OR inside DLT. Both use a serverless pipeline for refreshes. NOT Classic warehouse.
3Generated columns: computed at write time, NOT read time, no retroactive update
4CHECK constraints: new writes only, NOT enforced on existing rows
5Foreign key constraints in Delta Lake = informational only, NOT enforced
6Liquid Clustering = incremental, replaces partitioning + Z-ORDER, keys changeable
7Fact tables = events/transactions, large, append-heavy. Dimension tables = attributes, smaller, SCD patterns
Domain 516% of exam

Monitoring, Logging, and Optimization

Must-Know Facts

  • Data skew signature in Spark UI: one or a few tasks process 10–100x more data than others. Fix: repartition, salting on the skewed key, or enable AQE skew handling
  • Shuffle spill to disk means executor memory is insufficient. Fix: increase spark.executor.memory or reduce the number of partitions (to make each partition smaller). Adding more executors does NOT help spill.
  • AQE (Adaptive Query Execution) is ENABLED by default in Databricks Runtime. It can change the query plan at runtime — the EXPLAIN output may not match actual execution
  • AQE three key optimizations: (1) coalesce small shuffle partitions, (2) convert sort-merge join to broadcast hash join if one side is small at runtime, (3) detect and split skewed partitions
  • EXPLAIN shows the static query plan. After AQE optimizations, the actual plan in Spark UI may differ. This is expected and correct behavior
  • VACUUM removes data files older than the retention period (default 7 days). VACUUM does NOT remove transaction log entries — that is controlled by delta.logRetentionDuration
  • Auto Compaction with streaming MERGE workloads auto-tunes target file size DOWN to 32–48 MB instead of the default 1 GB. This is expected and correct — not a problem requiring intervention
  • Databricks SQL alerts: schedule a query, set a threshold condition, configure notifications (email, Slack, webhook). Used for SLA monitoring and data quality alerting
  • Spark UI task metrics: look at 'Shuffle Read Size / Records' and 'Spill (Memory)' columns for performance diagnosis. High spill = memory pressure. Large shuffle = wide transformation overhead
  • Job cluster auto-termination saves cost. All-purpose cluster idle time is a significant cost source in production — always auto-terminate all-purpose clusters after a period of inactivity

Common Traps

TrapAdding more executors to fix shuffle spill to disk
RealityShuffle spill to disk means each executor lacks memory to hold its partition data. More executors don't help because each executor still processes the same partition size. Fix is to increase spark.executor.memory OR increase the number of partitions (smaller data per executor).
TrapExpecting EXPLAIN output to exactly predict the actual execution plan
RealityEXPLAIN generates a STATIC plan before AQE runtime statistics are collected. At runtime, AQE may convert a sort-merge join to a broadcast hash join, coalesce partitions, or split skewed partitions. The Spark UI shows the actual executed plan, which may differ from EXPLAIN.
TrapTreating 32–48 MB streaming MERGE files as a misconfiguration to fix
RealityAuto Compaction dynamically adjusts target file size based on write volume per micro-batch. Streaming MERGE with small micro-batches legitimately produces smaller files. This is expected, correct behavior. Forcing 1 GB files would cause overhead.
TrapConfusing VACUUM retention with Delta log retention
RealityVACUUM retention period (default 7 days) controls how long DATA FILES are kept before deletion, enabling time travel. delta.logRetentionDuration controls how long the TRANSACTION LOG is kept (default 30 days). These are separate settings with separate defaults.

Confusing Pairs

Data SkewShuffle Spill

Data Skew = uneven data distribution across partitions. One task processes vastly more data. Symptom: one very slow task in the Spark UI. Fix: repartition, salting, AQE skew handling. Shuffle Spill = executor RAM insufficient to hold partition during shuffle. Data spills to disk. Symptom: 'Spill (Memory)' in task metrics. Fix: more memory per executor or smaller partition size.

VACUUM retentiondelta.logRetentionDuration

VACUUM retention = controls how long physical DATA FILES are kept. Default 7 days. Affects time travel window. After VACUUM, you cannot time travel before the retention cutoff. logRetentionDuration = controls how long the TRANSACTION LOG (JSON/checkpoint files) is kept. Default 30 days. Affects the ability to replay transaction history.

OPTIMIZEVACUUM

OPTIMIZE = compacts many small Delta table files into fewer larger files. Improves READ performance. Does not remove data. VACUUM = removes DATA FILES that are no longer referenced by the current or any version within the retention period. Improves STORAGE efficiency and frees disk space. Affects time travel.

AQE (Adaptive Query Execution)Static Query Plan (EXPLAIN)

EXPLAIN = shows the optimizer's STATIC plan generated before execution. Does not reflect runtime statistics or AQE adjustments. AQE = optimizes the plan at RUNTIME using actual shuffle statistics — can coalesce partitions, convert sort-merge join to broadcast join, and split skewed partitions. When EXPLAIN and Spark UI disagree, AQE is the reason. The Spark UI shows the ACTUAL executed plan — always trust Spark UI over EXPLAIN for post-execution diagnosis.

Scenario Tips

If the question asks about:

When Spark UI shows one task taking 45 minutes while 199 others take 30 seconds, and that task processes 50 GB vs 200 MB for others...

Answer:

Data skew. Repartition the DataFrame by a different key, apply salting to distribute the skewed key, or enable AQE (already on by default) which can automatically split skewed partitions.

Distractor to avoid:

Increasing executor count doesn't help with skew — the skewed partition is still one partition that goes to one executor. The bottleneck is partition size, not cluster size.

If the question asks about:

When the question asks why streaming MERGE files are 35 MB instead of 1 GB and whether to fix it...

Answer:

Auto Compaction auto-tunes file size to match write patterns. For streaming MERGE with small micro-batches, 32–48 MB files are expected and correct. No action needed.

Distractor to avoid:

Manually running OPTIMIZE to force 1 GB files is unnecessary overhead. The examinees who fail this question try to 'fix' something that isn't broken.

If the question asks about:

When EXPLAIN shows a SortMergeJoin but the Spark UI shows a BroadcastHashJoin was actually executed...

Answer:

AQE converted the join at runtime. At runtime, after collecting actual data size statistics, AQE determined one side was small enough to broadcast. This is normal AQE behavior — not a bug.

Distractor to avoid:

A broadcast hint being added between EXPLAIN and execution is a possible but unlikely scenario. AQE runtime optimization is the canonical explanation for plan discrepancies.

Last-Minute Facts

1AQE is enabled by default in Databricks Runtime
2AQE changes: (1) coalesce partitions, (2) SortMerge -> BroadcastHash, (3) split skewed partitions
3EXPLAIN = static plan; actual execution (Spark UI) may differ due to AQE
4Spill to disk fix: more executor memory OR more partitions (smaller per executor). NOT more executors
5VACUUM default retention: 7 days. delta.logRetentionDuration default: 30 days
6Auto Compaction streaming MERGE files: 32–48 MB is expected, not a bug
7Data skew: one slow task processing much more data than others. Fix: salting or repartition
8Databricks SQL alerts: query + threshold + notification channel

Feeling confident?

Put your knowledge to the test with a timed DB DE Professional mock exam.