CertPrepNow
DatabricksDB DE Associate7 domains

DB DE Associate Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Data Engineer Associate exam.

General Exam Tips

  • 1.Read ALL answer options before choosing — two options often look correct; the distinction lives in a single word
  • 2.Flag and skip questions you're unsure about. With 45 questions in 90 minutes you have 2 minutes each — don't burn 5 minutes on one
  • 3.Watch for negative phrasing: 'Which is NOT correct?' or 'Which would FAIL?' — easy to miss under pressure
  • 4.The exam is SQL-first. When a question can be solved with SQL or PySpark, the intended answer is almost always SQL
  • 5.Every wrong answer option is a potential separate question — study why each distractor is wrong, not just why the right answer is right
  • 6.When two answers seem valid, ask yourself: which one has a constraint, edge case, or failure mode the other doesn't?
  • 7.Keywords to watch for: 'MOST efficient', 'BEST practice', 'minimum cost' — these flip the answer from correct to most-correct
  • 8.No penalty for wrong answers — always guess rather than leaving blank
  • 9.Aim for 80%+ on practice exams before scheduling. The real exam is slightly harder than most free practice tests
Domain 16% of exam

Databricks Intelligence Platform

Must-Know Facts

  • Lakehouse = data lake flexibility (open formats, cheap storage) + data warehouse reliability (ACID, schema enforcement) — Delta Lake is the storage layer enabling this
  • Unity Catalog three-level namespace is catalog.schema.table — the top level is 'catalog', NOT 'database' or 'workspace'
  • Control plane is Databricks-managed (web app, job scheduler, cluster manager). Data plane is customer cloud (clusters, actual data) — data never leaves the customer's cloud account
  • Compute types: All-Purpose Clusters (interactive dev/notebooks), Job Clusters (automated workloads, auto-terminate), SQL Warehouses (SQL analytics/BI tools), Serverless Compute (instant startup, no config)
  • Serverless compute is available for SQL Warehouses, Jobs, and DLT pipelines — not just SQL
  • Databricks Connect allows running Spark code from local IDEs (VS Code, IntelliJ) against remote Databricks compute — distinct from notebooks

Common Traps

TrapSelecting 'database' as the top-level Unity Catalog namespace
RealityThe three levels are catalog, schema, table — in that order. There is no 'database' level; the legacy hive_metastore had databases but Unity Catalog uses 'catalog' at the top
TrapThinking Databricks is a database or stores data itself
RealityDatabricks is a platform. Data lives in cloud object storage (S3, ADLS, GCS). Delta Lake is the format/protocol on top. The platform processes but doesn't own the storage
TrapConfusing the control plane and data plane when asked about data security
RealityCustomer data always stays in the customer's cloud account (data plane). Only metadata, job configs, and access logs live in the Databricks control plane. This is the key security guarantee

Confusing Pairs

All-Purpose ClustersJob Clusters

All-Purpose = interactive/shared, persist until manually stopped, billed while idle. Job Clusters = ephemeral, created for a run and auto-terminated, more cost-effective for production. Exam question: 'reduce compute cost for scheduled jobs' → Job Cluster

SQL WarehousesAll-Purpose Clusters

SQL Warehouses = optimized for SQL analytics/BI tools, auto-scale, serverless option available. All-Purpose Clusters = run Python/Scala/SQL, used in notebooks. When the scenario mentions dashboards or BI tools, the answer is SQL Warehouse

Scenario Tips

If the question asks about:

Question asks which architecture component enables ACID transactions and time travel on files in S3

Answer:

Delta Lake — it's the storage format/protocol layer that adds transaction log, ACID, and time travel capabilities to raw files in object storage

Distractor to avoid:

Unity Catalog handles governance/access, not ACID storage. Databricks Runtime is the execution engine, not the storage layer

If the question asks about:

Question asks about 'centralized governance' or 'unified access control across catalogs'

Answer:

Unity Catalog — it's the single governance layer across all workspaces in an account

Distractor to avoid:

Workspace-level access control is the old model; Unity Catalog is the centralized modern answer

Last-Minute Facts

1Three-level namespace: catalog > schema > table (never database.schema.table)
2Databricks Runtime versions include Databricks Runtime (DR), Databricks Runtime ML (includes ML libs), and Databricks Runtime for Genomics
3Unity Catalog was introduced in 2022; pre-Unity Catalog used workspace-level Hive metastore
Domain 221% of exam

Data Ingestion and Loading

Must-Know Facts

  • COPY INTO is idempotent: it tracks files already loaded by path and skips them on subsequent runs. But it scans the full directory every run — not scalable for millions of files
  • Auto Loader uses a checkpoint (not directory listing) to track processed files — much more scalable for large directories and continuous ingestion
  • Auto Loader requires a checkpoint location; forgetting to set one in production is the #1 streaming mistake
  • rescuedDataColumn captures records that don't match the schema instead of failing the stream — stored in a _rescued_data column
  • cloudFiles.schemaEvolutionMode = 'addNewColumns' lets Auto Loader automatically add new fields to the schema when new columns appear in source data
  • trigger(availableNow=True) processes all available data then stops — combines streaming checkpoint benefits (exactly-once, file tracking) with batch semantics. Use when you want incremental batch runs
  • Lakeflow Connect provides managed CDC connectors for SaaS sources (Salesforce, JDBC databases) — replaces manual JDBC patterns for supported sources
  • Lakehouse Federation: query external systems (PostgreSQL, Snowflake, BigQuery) in-place via Unity Catalog foreign tables without copying data
  • FORMAT_OPTIONS values must use single quotes: FORMAT_OPTIONS ('header' = 'true') — double quotes cause errors

Common Traps

TrapThinking COPY INTO and Auto Loader are both streaming
RealityCOPY INTO is a batch SQL command — it runs once and exits. Auto Loader is streaming (uses readStream). COPY INTO with trigger(availableNow=True) doesn't exist — that's for streaming queries only
TrapThinking Auto Loader is idempotent because it uses a checkpoint
RealityAuto Loader IS idempotent — but the mechanism is the checkpoint, not file path tracking. If you delete the checkpoint, Auto Loader reprocesses files from the beginning. COPY INTO tracks by file path in the table metadata
TrapUsing double quotes in FORMAT_OPTIONS
RealityFORMAT_OPTIONS requires single-quoted values: FORMAT_OPTIONS ('header' = 'true'). Double quotes cause a syntax error. This is a common SQL syntax trap on the exam
TrapAssuming cloudFiles.schemaLocation and checkpoint location are the same
RealityThey are separate options. schemaLocation stores the inferred schema. The checkpoint location (set via writeStream.option('checkpointLocation', ...)) tracks processed files. Both are required in production
TrapThinking Lakeflow Connect is just another name for COPY INTO
RealityLakeflow Connect provides managed CDC (change data capture) — it continuously replicates changes from source systems, not just one-time batch copies

Confusing Pairs

Auto LoaderCOPY INTO

Auto Loader = streaming, checkpoint-based file tracking, scalable to millions of files, ideal for continuous ingestion. COPY INTO = batch SQL command, directory-scan-based tracking, simpler but slower at scale. Rule: if the question says 'millions of files' or 'continuous', answer is Auto Loader

trigger(availableNow=True)trigger(once=True)

availableNow=True processes all pending data in multiple micro-batches then stops — more efficient. once=True processes all pending data in a single micro-batch then stops — being deprecated. Exam prefers availableNow=True as the modern answer

cloudFiles.schemaEvolutionMode = 'addNewColumns'cloudFiles.schemaEvolutionMode = 'rescue'

addNewColumns = new fields in source get added to the table schema automatically. rescue = new/mismatched fields go into rescuedDataColumn but schema doesn't change. Use addNewColumns when you want the schema to grow; use rescue when you want a stable schema with overflow capture

Lakeflow ConnectCOPY INTO

Lakeflow Connect = managed CDC, continuous replication from SaaS/databases, handles inserts+updates+deletes. COPY INTO = batch file ingestion from cloud storage only, append or overwrite. Use Lakeflow Connect for source systems; COPY INTO for files already in cloud storage

Scenario Tips

If the question asks about:

A directory in S3 receives 10,000 new JSON files per day and currently has 2 million total files. Which ingestion method should be used?

Answer:

Auto Loader — uses file notification to detect new files without listing the full directory. COPY INTO scans all 2M files each run to find unprocessed ones

Distractor to avoid:

COPY INTO is wrong here despite being 'simpler' — directory listing at this scale would be extremely slow

If the question asks about:

An Auto Loader stream encounters records with fields not in the original schema. The stream must not fail and new fields should be added to the table

Answer:

Set cloudFiles.schemaEvolutionMode = 'addNewColumns'. New columns automatically get added to the Delta table schema

Distractor to avoid:

Setting rescuedDataColumn captures the records but does NOT add new columns to the schema — schema stays frozen

If the question asks about:

A team wants to run an incremental data load every night using streaming semantics (checkpoint-based tracking) but stop the stream when done

Answer:

Use trigger(availableNow=True) — processes all pending files, checkpoints progress, then terminates. Gets incremental + exactly-once without a continuously running stream

Distractor to avoid:

Continuous streaming would keep the cluster alive all night. COPY INTO would work but lacks checkpoint-based exactly-once semantics

If the question asks about:

A data engineer needs to ingest data from a Salesforce CRM system on an ongoing basis with change data capture

Answer:

Lakeflow Connect — provides managed CDC connectors for SaaS sources including Salesforce

Distractor to avoid:

COPY INTO only works for files in cloud storage, not SaaS APIs. JDBC reads are point-in-time snapshots, not CDC

Last-Minute Facts

1COPY INTO default behavior: idempotent (skips loaded files), appends by default
2COPY INTO COPY_OPTIONS ('mergeSchema' = 'true') enables schema evolution for batch loads
3Auto Loader default schemaEvolutionMode is 'addNewColumns' when a schemaLocation is provided
4Supported COPY INTO formats: CSV, JSON, PARQUET, AVRO, ORC, TEXT, BINARYFILE
5Auto Loader file notification mode (cloudFiles.useNotifications = 'true') requires cloud-side setup (SNS+SQS on AWS) — file listing mode works without cloud setup
6Lakehouse Federation creates foreign tables — read-only, no data is copied into the lakehouse
Domain 321% of exam

Data Transformation and Modeling

Must-Know Facts

  • MERGE INTO is the only SQL command for upserts in Delta Lake. Know all three clauses: WHEN MATCHED, WHEN NOT MATCHED (source has no match in target), WHEN NOT MATCHED BY SOURCE (target has no match in source — use to delete orphaned rows)
  • Window functions cannot be used directly in WHERE — wrap in a subquery or CTE. SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1
  • Schema evolution with MERGE requires spark.databricks.delta.schema.autoMerge.enabled = true — without it, MERGE fails if source has new columns
  • CHECK constraints and NOT NULL are enforced on write. PRIMARY KEY and UNIQUE are informational only — they are NOT enforced in Delta Lake
  • MERGE fails if multiple source rows match the same target row — deduplicate the source before running MERGE
  • DataFrames are immutable — transformations return a new DataFrame; they do not modify the original. This is a source of bugs that the exam tests
  • Lazy evaluation: all DataFrame transformations (filter, select, withColumn) do nothing until an action (show, count, write) is called
  • Default join type in both Spark SQL and PySpark is INNER JOIN — if no join type is specified, inner is assumed
  • QUALIFY clause (available in Databricks SQL) filters on window function results directly — avoids the subquery wrapping pattern

Common Traps

TrapUsing SELECT DISTINCT to deduplicate when you need the most recent record per key
RealityDISTINCT only removes fully identical rows. To keep the latest record per key, use ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) in a subquery, then filter WHERE rn = 1
TrapAssuming PRIMARY KEY and UNIQUE constraints are enforced in Delta Lake
RealityThey are declared as informational metadata only. Only CHECK and NOT NULL constraints are actually enforced on write. Duplicate rows can exist even with a PRIMARY KEY declared
TrapCalling a DataFrame transformation and expecting it to execute immediately
RealityTransformations (filter, select, join, withColumn) are lazy — they build an execution plan but don't run. Execution happens when you call an action: show(), count(), collect(), write()
TrapForgetting to set autoMerge when MERGE INTO encounters new columns in the source
RealityWithout spark.databricks.delta.schema.autoMerge.enabled = true, MERGE INTO throws an error when source columns don't exist in the target. Set this property when schema evolution is expected
TrapUsing MAX(timestamp) GROUP BY key thinking it returns the full latest row
RealityGROUP BY with MAX only returns the grouping key and the max value — not the full row. To get the full latest row, use ROW_NUMBER() window function in a subquery

Confusing Pairs

WHEN NOT MATCHEDWHEN NOT MATCHED BY SOURCE

WHEN NOT MATCHED = source row has no match in target → INSERT. WHEN NOT MATCHED BY SOURCE = target row has no match in source → typically DELETE (handle deletes from source). Both can appear in the same MERGE statement

ViewsMaterialized Views

Views = saved query, recomputed fresh every access, always current, no storage cost. Materialized Views = precomputed and stored, must be refreshed, fast to read. 'Always current data' → View. 'Fast dashboard reads' → Materialized View

SEMI JOINANTI JOIN

SEMI JOIN = return rows from left table that HAVE a match in right (like INNER but returns only left columns). ANTI JOIN = return rows from left table that have NO match in right. Use ANTI JOIN to find orphaned records or items not in another table

dropDuplicates()distinct()

dropDuplicates(['col']) = dedup based on specific columns (most common need). distinct() = dedup on ALL columns (only removes fully identical rows). dropDuplicates without args is equivalent to distinct()

Scenario Tips

If the question asks about:

A Silver table has duplicate customer records (same customer_id, different timestamps). The question asks to keep only the most recent record

Answer:

Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn in a subquery, then WHERE rn = 1

Distractor to avoid:

SELECT DISTINCT is wrong — it only removes fully identical rows. MAX(updated_at) GROUP BY customer_id is wrong — returns only two columns, not the full record

If the question asks about:

A MERGE INTO statement fails with an ambiguous merge error when run against a staging table

Answer:

The source (staging) table has multiple rows matching the same target row. Deduplicate the source first using ROW_NUMBER() or dropDuplicates() before the MERGE

Distractor to avoid:

This is not a schema mismatch error. Adding WHEN NOT MATCHED BY SOURCE won't fix it — the issue is multiple matches to one target row

If the question asks about:

A data engineer wants to enforce that an 'amount' column always contains values greater than zero

Answer:

ALTER TABLE t ADD CONSTRAINT positive_amount CHECK (amount > 0) — CHECK constraint is enforced on write

Distractor to avoid:

PRIMARY KEY and UNIQUE constraints are NOT enforced in Delta Lake. NOT NULL only blocks null values, not negative numbers

Last-Minute Facts

1MERGE INTO requires the match condition to be on a unique key — multiple source rows matching one target row causes a runtime error
2spark.databricks.delta.schema.autoMerge.enabled = true enables schema evolution for MERGE INTO
3Window functions: ROW_NUMBER() assigns unique sequential numbers, RANK() assigns same rank to ties (leaves gaps), DENSE_RANK() assigns same rank but no gaps
4Broadcast join threshold default: 10MB — auto-broadcast applies to the small side of a join. Explicitly hint with /*+ BROADCAST(t) */ when AQE doesn't pick it up automatically
5spark.sql.shuffle.partitions default 200 is almost always wrong — the exam may test knowing to tune this, not just state it
6CHECK constraints are enforced on write only — existing violating data is NOT retroactively flagged
Domain 412% of exam

Working with Lakeflow Jobs

Must-Know Facts

  • Continuous trigger: job starts immediately after the previous run finishes — does NOT run concurrently with itself. NOT the same as 'always running'
  • Job clusters are created fresh for each run and auto-terminated afterward — use job clusters for production workloads to avoid paying for idle compute
  • If a task fails, all downstream dependent tasks are SKIPPED by default. They do not fail — they are skipped. Configure 'run if dependencies failed' to change this
  • Task values (dbutils.jobs.taskValues.set/get) are the recommended way to pass data between tasks — lightweight, no Delta table needed
  • Repair Run: re-executes only failed tasks and their downstream dependents, preserving successful task results
  • DLT expectations: @dlt.expect (warn only), @dlt.expect_or_drop (filter out bad rows), @dlt.expect_or_fail (abort entire pipeline). Know which to use for each scenario
  • DLT/Lakeflow Declarative Pipelines support both notebooks and workspace files (.py, .sql) as source libraries — the notebook-only requirement is a legacy constraint no longer applicable
  • A Lakeflow Job CAN include a DLT pipeline as one of its tasks — they are complementary, not competing

Common Traps

TrapThinking 'continuous' trigger means the job runs in parallel with previous runs
RealityContinuous trigger waits for the previous run to complete before starting the next one. It is not concurrent — it just removes the delay between runs
TrapUsing an all-purpose cluster for a production job to avoid startup delay
RealityAll-purpose clusters are billed while idle. Job clusters are more cost-effective for production because they terminate after the run. Startup latency can be reduced with cluster pools, not by using all-purpose clusters
TrapThinking failed upstream tasks cause downstream tasks to fail
RealityBy default, downstream tasks are SKIPPED (not failed) when an upstream dependency fails. The job shows 'Skipped' status for those tasks
TrapUsing @dlt.expect when you actually need @dlt.expect_or_drop
Reality@dlt.expect records bad rows as a metric but KEEPS them in the output. If the requirement is to remove invalid records from the table, you need @dlt.expect_or_drop
TrapAssuming DLT pipelines only accept notebooks as source libraries
RealityDLT/Lakeflow Declarative Pipelines support workspace files (.py and .sql files) as source libraries in addition to notebooks. Older exam questions may reflect the legacy notebook-only behavior

Confusing Pairs

@dlt.expect@dlt.expect_or_drop@dlt.expect_or_fail

expect = log violation metric, keep the row. expect_or_drop = log violation, DROP the row from output. expect_or_fail = if any violation, FAIL the entire pipeline. Use expect for monitoring, expect_or_drop for data quality enforcement, expect_or_fail for critical constraints

Lakeflow JobsDLT (Delta Live Tables / Lakeflow Declarative Pipelines)

Lakeflow Jobs = imperative orchestration, arbitrary tasks (notebooks, scripts, SQL, JARs), control flow logic. DLT = declarative ETL where you define the output, system handles orchestration and data quality. A Job can RUN a DLT pipeline as one of its tasks

Cron triggerContinuous triggerFile arrival trigger

Cron = runs on a fixed schedule (time-based). Continuous = runs immediately after previous run completes (gap-minimizing). File arrival = triggers when new files land in a path (event-driven). A job can have multiple triggers simultaneously

Job clusterAll-purpose clusterCluster pool

Job cluster = ephemeral, created per run, cheapest for prod. All-purpose = shared/persistent, best for interactive dev. Cluster pool = pre-warmed instances, reduces job cluster startup time while keeping cost efficiency

Scenario Tips

If the question asks about:

A pipeline needs to run as frequently as possible but only one run at a time. Which trigger?

Answer:

Continuous trigger — starts the next run immediately when the previous one finishes, but never runs two instances concurrently

Distractor to avoid:

'Scheduled every 1 minute' could cause overlapping runs if a run takes longer than 1 minute; continuous avoids this

If the question asks about:

A job has 5 tasks and fails at task 3. Task 4 and 5 are downstream of task 3. The engineer re-runs the job after fixing the bug

Answer:

Use Repair Run to re-execute only task 3 and its dependents (4 and 5), preserving the successful output of tasks 1 and 2

Distractor to avoid:

Running the full job re-executes all 5 tasks, wasting compute and potentially causing duplicate processing of tasks 1 and 2

If the question asks about:

An Extract task computes a row count that a downstream Validate task needs to check against a threshold

Answer:

Use dbutils.jobs.taskValues.set(key='row_count', value=n) in Extract, and dbutils.jobs.taskValues.get(taskKey='Extract', key='row_count') in Validate

Distractor to avoid:

Writing to a Delta table works but is overkill for a simple integer. Notebook widgets don't reliably pass values between separate task executions

If the question asks about:

A DLT pipeline must remove records where amount is negative but continue processing other records

Answer:

@dlt.expect_or_drop('positive_amount', 'amount > 0') — drops violating rows and continues

Distractor to avoid:

@dlt.expect_or_fail aborts the entire pipeline on any violation — wrong if the goal is to continue. @dlt.expect keeps the bad rows in the output

Last-Minute Facts

1DLT pipeline development mode: starts cluster interactively, updates tables. DLT production mode: auto-restarts on failure, optimized for reliability
2DLT STREAMING LIVE TABLE processes incrementally (new records only). Regular LIVE TABLE is a full recompute each pipeline run
3Task value size limit: 48KB per value — not for large datasets, only lightweight signals
4Repair Run preserves successful task results and re-runs from the first failed task forward
5A job can combine multiple trigger types: a daily cron AND a file arrival trigger on the same job
Domain 512% of exam

Implementing CI/CD

Must-Know Facts

  • Git Folders (formerly Repos): version control for notebooks and code files inside Databricks workspace. Supports clone, branch, commit, push, pull from the UI
  • Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs): infrastructure-as-code for Databricks resources — defines jobs, pipelines, clusters in databricks.yml
  • Bundle config file is always databricks.yml — NOT JSON, not Python, not TOML
  • Deploy command: databricks bundle deploy -t <target_name>. Validate before deploy: databricks bundle validate
  • Environment isolation best practice: separate Unity Catalog catalogs per environment (dev_catalog, staging_catalog, prod_catalog)
  • Git Folders is for code versioning; Automation Bundles is for resource deployment — they are complementary, not alternatives

Common Traps

TrapUsing Git Folders to deploy infrastructure changes to production
RealityGit Folders manages code (notebooks, Python files) — it does not deploy or configure Databricks resources (jobs, clusters, pipelines). Use Automation Bundles for deployment automation
TrapThinking the bundle deploy command is 'databricks deploy bundle'
RealityThe correct CLI command order is: databricks bundle deploy (not deploy bundle). The exam tests this exact syntax
TrapUsing schemas instead of catalogs for environment isolation
RealityBest practice is separate catalogs per environment (dev_catalog, prod_catalog), not just separate schemas. Catalog-level separation provides stronger governance isolation
TrapAssuming bundle.yml is the correct filename
RealityThe configuration file must be named databricks.yml — not bundle.yml or databricks.yaml (though .yaml is technically also accepted)

Confusing Pairs

Git FoldersDeclarative Automation Bundles

Git Folders = version control for code and notebooks (what you write). Automation Bundles = infrastructure-as-code for Databricks resources (how it's deployed). In a CI/CD pipeline: code lives in Git Folders, deployment is managed by Bundles

databricks bundle deploydatabricks bundle run

bundle deploy = provisions/updates resources in the workspace (creates/updates jobs, pipelines, clusters). bundle run = actually executes a deployed job or pipeline. Deploy first, then run

Scenario Tips

If the question asks about:

A team wants to promote the same job configuration from dev to staging to prod with different cluster sizes

Answer:

Declarative Automation Bundles with multiple targets defined in databricks.yml — each target specifies environment-specific overrides (cluster size, workspace URL, permissions)

Distractor to avoid:

Git Folders handles code, not deployment config. Manually copying jobs via JSON export is error-prone and not repeatable

If the question asks about:

A data engineer asks what file format Automation Bundles use

Answer:

YAML (databricks.yml) — this is frequently tested as a single-answer question

Distractor to avoid:

JSON (common for other DevOps tools like Terraform providers), Python (wrong entirely)

If the question asks about:

A team needs to ensure notebook changes are reviewed before going to production

Answer:

Use Git Folders with a pull request workflow — develop on a feature branch, open PR for review, merge to main, then deploy via Automation Bundles

Distractor to avoid:

Automation Bundles alone don't enforce code review. They deploy whatever is in the bundle, regardless of review status

Last-Minute Facts

1databricks bundle init → creates a new bundle project from template
2databricks bundle validate → checks YAML syntax and references before deploying
3databricks bundle deploy -t dev/staging/production → deploys to a named target
4Bundle targets can inherit and override a base configuration — avoids repeating common settings
5Git Folders supports GitHub, GitLab, Bitbucket, Azure DevOps — not just GitHub
Domain 615% of exam

Troubleshooting, Monitoring, and Optimization

Must-Know Facts

  • One task in a stage takes 10x longer than others = data skew (not a slow cluster, not a network issue). Fix: salt the join key or repartition
  • Disk spill visible in Spark UI = executor memory is insufficient. Fix: increase spark.executor.memory or reduce partition size — NOT adding more nodes
  • Liquid Clustering uses CLUSTER BY at table creation — clustering keys can be changed after creation without rewriting the entire table (unlike partitioning)
  • Liquid Clustering still requires OPTIMIZE to actually co-locate the data — it's lazy by default and optimizes on writes incrementally
  • VACUUM with retention < 7 days requires setting delta.retentionDurationCheck.enabled = false — this is dangerous and can break concurrent queries
  • After VACUUM, time travel to versions older than the retention window is impossible because the underlying data files are gone
  • Predictive Optimization automatically runs OPTIMIZE and VACUUM — only available for managed tables in Unity Catalog, not external tables
  • Shuffle operations (join, groupBy, distinct) are wide transformations that require data redistribution — they are the primary source of performance bottlenecks

Common Traps

TrapAdding more executor nodes to fix a slow task caused by data skew
RealitySkew is about one partition having too much data — adding nodes doesn't help because that one oversized partition still runs on one task on one core. Fix: salt the key to distribute the data, or use AQE (adaptive query execution) skew handling
TrapConfusing OPTIMIZE with VACUUM
RealityOPTIMIZE compacts small files into larger ones for read performance. VACUUM deletes old data files no longer needed by the transaction log. They do different things — OPTIMIZE is about reads, VACUUM is about storage reclamation
TrapThinking Liquid Clustering eliminates the need for OPTIMIZE
RealityLiquid Clustering changes HOW data is organized (clustering columns vs partition columns), but OPTIMIZE is still required to trigger the actual compaction and co-location. Without OPTIMIZE, clustering benefits don't fully materialize
TrapRunning VACUUM and then asking why time travel beyond 7 days no longer works
RealityVACUUM deletes files older than the retention period (default 7 days = 168 hours). Time travel to 10 days ago requires those deleted files. VACUUM didn't corrupt anything — the files are simply gone

Confusing Pairs

OPTIMIZEVACUUM

OPTIMIZE = compact small files → improves read speed. VACUUM = delete old data files beyond retention period → reduces storage cost. Run OPTIMIZE frequently, VACUUM less frequently. Never run VACUUM with < 7 days retention without understanding the time travel impact

Liquid ClusteringZ-ORDER (ZORDER BY)

ZORDER = manual, run with OPTIMIZE ZORDER BY, multi-dimensional co-location, static. Liquid Clustering = declarative CLUSTER BY, incremental optimization, adaptive to query changes, can change keys without table rewrite. Exam prefers Liquid Clustering as the modern answer

Data skewDisk spill

Skew = one partition has much more data, causing one task to be a straggler. Spill = executor memory is exceeded, causing writes to disk. Skew fix = repartition/salting. Spill fix = increase memory or reduce partition size. Both show up in Spark UI but look different

Scenario Tips

If the question asks about:

A Spark job has 200 tasks. Task 187 takes 45 minutes; all others complete in 2 minutes

Answer:

Data skew — partition 187 has disproportionately more data. Fix with salting the join key, repartition(), or enable AQE with spark.sql.adaptive.skewJoin.enabled=true

Distractor to avoid:

This is NOT a cluster capacity issue (all other tasks are fast). Not a code bug (other tasks succeed). Not a network issue (would affect all tasks)

If the question asks about:

A Delta table with 5 years of data partitioned by date receives queries filtered by customer_region (not date). Performance is slow

Answer:

Convert to Liquid Clustering with CLUSTER BY (customer_region) — optimizes data layout for region-based queries without rewriting the entire table

Distractor to avoid:

Adding customer_region as a second partition column requires full table rewrite. Z-ORDER works but is less adaptive. Creating a copy doubles storage

If the question asks about:

After running VACUUM with default settings, users report they cannot query the table as of 10 days ago

Answer:

VACUUM with default 7-day retention deleted the data files needed for 10-day time travel. The fix going forward is to increase the retention period (VACUUM RETAIN 720 HOURS) before running VACUUM

Distractor to avoid:

VACUUM does not disable time travel permanently. VACUUM does not corrupt the transaction log. The issue is simply that the required files were deleted

If the question asks about:

A Spark UI shows a stage where tasks are writing significant amounts to 'Spill (Disk)'

Answer:

Executor memory is insufficient — tasks are spilling shuffle data to disk. Fix: increase spark.executor.memory or reduce spark.sql.shuffle.partitions to make each partition smaller

Distractor to avoid:

Adding nodes (executors) won't help if each individual partition doesn't fit in memory. The fix is about memory per task, not total cluster memory

Last-Minute Facts

1Default VACUUM retention: 7 days (168 hours) — the exam tests that you know time travel beyond 7 days fails after VACUUM with defaults
2Default shuffle partitions: 200 — exam trick: a question about slow groupBy on a small dataset is often solved by reducing this, not adding nodes
3Default broadcast join threshold: 10MB — exam trick: a large table join question asking 'why did broadcast hint fail?' is because the table exceeded 10MB
4Predictive Optimization: managed tables in Unity Catalog only, not external tables
5Liquid Clustering: CLUSTER BY at CREATE TABLE or ALTER TABLE t CLUSTER BY (col). OPTIMIZE still needed to apply clustering
6AQE (Adaptive Query Execution): enabled by default in Databricks Runtime — automatically handles skew joins and coalesces small partitions
7OPTIMIZE without ZORDER compacts files but does not reorder data for multi-column filter performance
Domain 712% of exam

Governance and Security

Must-Know Facts

  • GRANT SELECT on a table is NOT enough by itself — users also need USAGE on the catalog and USAGE on the schema to access the table
  • Full GRANT chain: GRANT USE CATALOG ON CATALOG c TO user, then GRANT USE SCHEMA ON SCHEMA c.s TO user, then GRANT SELECT ON TABLE c.s.t TO user
  • Managed table DROP: deletes both metadata AND data files permanently. External table DROP: removes only metadata; data files in cloud storage are untouched
  • Column masking executes at query time — underlying data is never changed. The mask function must return the same data type as the column it masks
  • Row filters apply at query time based on the querying user's identity — rows the user is not authorized to see are simply not returned
  • ABAC (Attribute-Based Access Control) uses tags to control access dynamically — complements, does NOT replace, explicit GRANT statements
  • Data lineage in Unity Catalog is automatic and column-level — tracks which output columns derive from which source columns, across tables, notebooks, and jobs
  • Delta Sharing is read-only for recipients — they cannot modify shared data. Recipients do not need Databricks to access shared data (open protocol)

Common Traps

TrapGranting SELECT on a table and assuming the user can now query it
RealityAccess is inherited but permissions are NOT. A user needs USE CATALOG + USE SCHEMA + SELECT to actually read a table. Missing any one of these three grants will deny access
TrapThinking GRANT OPTION is the same as giving all privileges
RealityGRANT privilege WITH GRANT OPTION allows the grantee to grant that specific privilege to others — it does not grant all privileges. The grantee can only pass on privileges they themselves have been granted
TrapExpecting dropped external table data to be gone
RealityDropping an external table only removes the catalog entry (metadata). The data files remain in the customer's cloud storage. Conversely, dropping a managed table permanently deletes the data files
TrapThinking column masking changes the stored data
RealityColumn masks are applied dynamically at query time — the underlying data is never modified. The mask is a SQL function that transforms the value in the result set only

Confusing Pairs

Managed TablesExternal Tables

Managed = Unity Catalog owns storage location, DROP TABLE deletes data permanently. External = customer owns storage, DROP TABLE removes only catalog metadata, data files remain. Rule: if the question asks 'what happens to data files on DROP' → managed deletes, external preserves

Column MaskingRow Filters

Column masking = user sees all rows but sensitive COLUMNS are obscured (e.g., email shows as '***'). Row filters = user sees all columns but only authorized ROWS are returned. Use masking for PII field protection, row filters for multi-tenant data access

ABAC policiesGRANT/REVOKE (RBAC)

RBAC (GRANT/REVOKE) = explicit grants to specific users/groups. ABAC = tag-based, dynamic — access determined by object tags and user attributes at query time. ABAC scales better for large organizations; RBAC is explicit and easier to audit. Both coexist in Unity Catalog

Data LineageAudit Logs

Data Lineage = tracks how data flows between tables/notebooks (which source columns feed which target columns). Audit Logs (system.access.audit) = records who accessed what and when. Lineage is about data flow; audit logs are about user activity

Scenario Tips

If the question asks about:

An analyst reports they cannot query a table even though a GRANT SELECT was issued to their group

Answer:

They are also missing GRANT USE CATALOG on the catalog and GRANT USE SCHEMA on the schema. All three grants are required before a user can read a table

Distractor to avoid:

SELECT privilege alone is necessary but not sufficient. This is the #1 governance trap on the exam

If the question asks about:

A managed table is accidentally dropped. Can the data be recovered?

Answer:

No — dropping a managed table in Unity Catalog permanently deletes the data files. Data is unrecoverable unless Delta time travel was used before VACUUM deleted old versions

Distractor to avoid:

External table data survives a DROP. Managed table data does not. There is no 30-day recycle bin for managed tables

If the question asks about:

A company needs analysts to see customer purchase amounts but not email addresses. Same underlying table

Answer:

Apply column masking on the email column — analysts see '***' while authorized users see the real value. The single table serves both groups

Distractor to avoid:

Creating a view without the email column works but requires managing a separate object per group. A separate table without email duplicates data. Row filters control rows, not columns

If the question asks about:

A security team needs to dynamically grant or revoke access to hundreds of tables based on data classification tags

Answer:

ABAC policies — tag tables with classification labels and define policies that grant access based on user attributes and table tags, without writing hundreds of individual GRANT statements

Distractor to avoid:

RBAC (individual GRANT statements) doesn't scale to hundreds of tables with dynamic classification changes

Last-Minute Facts

1GRANT syntax: GRANT <privilege> ON <object_type> <object_name> TO <principal>
2Privilege hierarchy: USE CATALOG → USE SCHEMA → SELECT/MODIFY/CREATE (in that order, all required)
3Object owner has ALL privileges by default and can grant to others
4Column mask function must return the SAME data type as the masked column
5Delta Sharing: recipients are READ-ONLY and do not need a Databricks account — exam trap is confusing Delta Sharing (external, open protocol) with Unity Catalog sharing (internal, Databricks-to-Databricks)
6system.access.audit table contains automatic audit logs for all Unity Catalog data access events
7Unity Catalog lineage is column-level (not just table-level) — captured automatically, no setup required

Feeling confident?

Put your knowledge to the test with a timed DB DE Associate mock exam.