General Exam Tips
- 1.Read ALL answer options before choosing — two options often look correct; the distinction lives in a single word
- 2.Flag and skip questions you're unsure about. With 45 questions in 90 minutes you have 2 minutes each — don't burn 5 minutes on one
- 3.Watch for negative phrasing: 'Which is NOT correct?' or 'Which would FAIL?' — easy to miss under pressure
- 4.The exam is SQL-first. When a question can be solved with SQL or PySpark, the intended answer is almost always SQL
- 5.Every wrong answer option is a potential separate question — study why each distractor is wrong, not just why the right answer is right
- 6.When two answers seem valid, ask yourself: which one has a constraint, edge case, or failure mode the other doesn't?
- 7.Keywords to watch for: 'MOST efficient', 'BEST practice', 'minimum cost' — these flip the answer from correct to most-correct
- 8.No penalty for wrong answers — always guess rather than leaving blank
- 9.Aim for 80%+ on practice exams before scheduling. The real exam is slightly harder than most free practice tests
Quick Navigation
Databricks Intelligence Platform
Must-Know Facts
- Lakehouse = data lake flexibility (open formats, cheap storage) + data warehouse reliability (ACID, schema enforcement) — Delta Lake is the storage layer enabling this
- Unity Catalog three-level namespace is catalog.schema.table — the top level is 'catalog', NOT 'database' or 'workspace'
- Control plane is Databricks-managed (web app, job scheduler, cluster manager). Data plane is customer cloud (clusters, actual data) — data never leaves the customer's cloud account
- Compute types: All-Purpose Clusters (interactive dev/notebooks), Job Clusters (automated workloads, auto-terminate), SQL Warehouses (SQL analytics/BI tools), Serverless Compute (instant startup, no config)
- Serverless compute is available for SQL Warehouses, Jobs, and DLT pipelines — not just SQL
- Databricks Connect allows running Spark code from local IDEs (VS Code, IntelliJ) against remote Databricks compute — distinct from notebooks
Common Traps
Confusing Pairs
Scenario Tips
Question asks which architecture component enables ACID transactions and time travel on files in S3
Delta Lake — it's the storage format/protocol layer that adds transaction log, ACID, and time travel capabilities to raw files in object storage
Unity Catalog handles governance/access, not ACID storage. Databricks Runtime is the execution engine, not the storage layer
Question asks about 'centralized governance' or 'unified access control across catalogs'
Unity Catalog — it's the single governance layer across all workspaces in an account
Workspace-level access control is the old model; Unity Catalog is the centralized modern answer
Last-Minute Facts
Data Ingestion and Loading
Must-Know Facts
- COPY INTO is idempotent: it tracks files already loaded by path and skips them on subsequent runs. But it scans the full directory every run — not scalable for millions of files
- Auto Loader uses a checkpoint (not directory listing) to track processed files — much more scalable for large directories and continuous ingestion
- Auto Loader requires a checkpoint location; forgetting to set one in production is the #1 streaming mistake
- rescuedDataColumn captures records that don't match the schema instead of failing the stream — stored in a _rescued_data column
- cloudFiles.schemaEvolutionMode = 'addNewColumns' lets Auto Loader automatically add new fields to the schema when new columns appear in source data
- trigger(availableNow=True) processes all available data then stops — combines streaming checkpoint benefits (exactly-once, file tracking) with batch semantics. Use when you want incremental batch runs
- Lakeflow Connect provides managed CDC connectors for SaaS sources (Salesforce, JDBC databases) — replaces manual JDBC patterns for supported sources
- Lakehouse Federation: query external systems (PostgreSQL, Snowflake, BigQuery) in-place via Unity Catalog foreign tables without copying data
- FORMAT_OPTIONS values must use single quotes: FORMAT_OPTIONS ('header' = 'true') — double quotes cause errors
Common Traps
Confusing Pairs
Scenario Tips
A directory in S3 receives 10,000 new JSON files per day and currently has 2 million total files. Which ingestion method should be used?
Auto Loader — uses file notification to detect new files without listing the full directory. COPY INTO scans all 2M files each run to find unprocessed ones
COPY INTO is wrong here despite being 'simpler' — directory listing at this scale would be extremely slow
An Auto Loader stream encounters records with fields not in the original schema. The stream must not fail and new fields should be added to the table
Set cloudFiles.schemaEvolutionMode = 'addNewColumns'. New columns automatically get added to the Delta table schema
Setting rescuedDataColumn captures the records but does NOT add new columns to the schema — schema stays frozen
A team wants to run an incremental data load every night using streaming semantics (checkpoint-based tracking) but stop the stream when done
Use trigger(availableNow=True) — processes all pending files, checkpoints progress, then terminates. Gets incremental + exactly-once without a continuously running stream
Continuous streaming would keep the cluster alive all night. COPY INTO would work but lacks checkpoint-based exactly-once semantics
A data engineer needs to ingest data from a Salesforce CRM system on an ongoing basis with change data capture
Lakeflow Connect — provides managed CDC connectors for SaaS sources including Salesforce
COPY INTO only works for files in cloud storage, not SaaS APIs. JDBC reads are point-in-time snapshots, not CDC
Last-Minute Facts
Data Transformation and Modeling
Must-Know Facts
- MERGE INTO is the only SQL command for upserts in Delta Lake. Know all three clauses: WHEN MATCHED, WHEN NOT MATCHED (source has no match in target), WHEN NOT MATCHED BY SOURCE (target has no match in source — use to delete orphaned rows)
- Window functions cannot be used directly in WHERE — wrap in a subquery or CTE. SELECT * FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) WHERE rn = 1
- Schema evolution with MERGE requires spark.databricks.delta.schema.autoMerge.enabled = true — without it, MERGE fails if source has new columns
- CHECK constraints and NOT NULL are enforced on write. PRIMARY KEY and UNIQUE are informational only — they are NOT enforced in Delta Lake
- MERGE fails if multiple source rows match the same target row — deduplicate the source before running MERGE
- DataFrames are immutable — transformations return a new DataFrame; they do not modify the original. This is a source of bugs that the exam tests
- Lazy evaluation: all DataFrame transformations (filter, select, withColumn) do nothing until an action (show, count, write) is called
- Default join type in both Spark SQL and PySpark is INNER JOIN — if no join type is specified, inner is assumed
- QUALIFY clause (available in Databricks SQL) filters on window function results directly — avoids the subquery wrapping pattern
Common Traps
Confusing Pairs
Scenario Tips
A Silver table has duplicate customer records (same customer_id, different timestamps). The question asks to keep only the most recent record
Use ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn in a subquery, then WHERE rn = 1
SELECT DISTINCT is wrong — it only removes fully identical rows. MAX(updated_at) GROUP BY customer_id is wrong — returns only two columns, not the full record
A MERGE INTO statement fails with an ambiguous merge error when run against a staging table
The source (staging) table has multiple rows matching the same target row. Deduplicate the source first using ROW_NUMBER() or dropDuplicates() before the MERGE
This is not a schema mismatch error. Adding WHEN NOT MATCHED BY SOURCE won't fix it — the issue is multiple matches to one target row
A data engineer wants to enforce that an 'amount' column always contains values greater than zero
ALTER TABLE t ADD CONSTRAINT positive_amount CHECK (amount > 0) — CHECK constraint is enforced on write
PRIMARY KEY and UNIQUE constraints are NOT enforced in Delta Lake. NOT NULL only blocks null values, not negative numbers
Last-Minute Facts
Working with Lakeflow Jobs
Must-Know Facts
- Continuous trigger: job starts immediately after the previous run finishes — does NOT run concurrently with itself. NOT the same as 'always running'
- Job clusters are created fresh for each run and auto-terminated afterward — use job clusters for production workloads to avoid paying for idle compute
- If a task fails, all downstream dependent tasks are SKIPPED by default. They do not fail — they are skipped. Configure 'run if dependencies failed' to change this
- Task values (dbutils.jobs.taskValues.set/get) are the recommended way to pass data between tasks — lightweight, no Delta table needed
- Repair Run: re-executes only failed tasks and their downstream dependents, preserving successful task results
- DLT expectations: @dlt.expect (warn only), @dlt.expect_or_drop (filter out bad rows), @dlt.expect_or_fail (abort entire pipeline). Know which to use for each scenario
- DLT/Lakeflow Declarative Pipelines support both notebooks and workspace files (.py, .sql) as source libraries — the notebook-only requirement is a legacy constraint no longer applicable
- A Lakeflow Job CAN include a DLT pipeline as one of its tasks — they are complementary, not competing
Common Traps
Confusing Pairs
Scenario Tips
A pipeline needs to run as frequently as possible but only one run at a time. Which trigger?
Continuous trigger — starts the next run immediately when the previous one finishes, but never runs two instances concurrently
'Scheduled every 1 minute' could cause overlapping runs if a run takes longer than 1 minute; continuous avoids this
A job has 5 tasks and fails at task 3. Task 4 and 5 are downstream of task 3. The engineer re-runs the job after fixing the bug
Use Repair Run to re-execute only task 3 and its dependents (4 and 5), preserving the successful output of tasks 1 and 2
Running the full job re-executes all 5 tasks, wasting compute and potentially causing duplicate processing of tasks 1 and 2
An Extract task computes a row count that a downstream Validate task needs to check against a threshold
Use dbutils.jobs.taskValues.set(key='row_count', value=n) in Extract, and dbutils.jobs.taskValues.get(taskKey='Extract', key='row_count') in Validate
Writing to a Delta table works but is overkill for a simple integer. Notebook widgets don't reliably pass values between separate task executions
A DLT pipeline must remove records where amount is negative but continue processing other records
@dlt.expect_or_drop('positive_amount', 'amount > 0') — drops violating rows and continues
@dlt.expect_or_fail aborts the entire pipeline on any violation — wrong if the goal is to continue. @dlt.expect keeps the bad rows in the output
Last-Minute Facts
Implementing CI/CD
Must-Know Facts
- Git Folders (formerly Repos): version control for notebooks and code files inside Databricks workspace. Supports clone, branch, commit, push, pull from the UI
- Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs): infrastructure-as-code for Databricks resources — defines jobs, pipelines, clusters in databricks.yml
- Bundle config file is always databricks.yml — NOT JSON, not Python, not TOML
- Deploy command: databricks bundle deploy -t <target_name>. Validate before deploy: databricks bundle validate
- Environment isolation best practice: separate Unity Catalog catalogs per environment (dev_catalog, staging_catalog, prod_catalog)
- Git Folders is for code versioning; Automation Bundles is for resource deployment — they are complementary, not alternatives
Common Traps
Confusing Pairs
Scenario Tips
A team wants to promote the same job configuration from dev to staging to prod with different cluster sizes
Declarative Automation Bundles with multiple targets defined in databricks.yml — each target specifies environment-specific overrides (cluster size, workspace URL, permissions)
Git Folders handles code, not deployment config. Manually copying jobs via JSON export is error-prone and not repeatable
A data engineer asks what file format Automation Bundles use
YAML (databricks.yml) — this is frequently tested as a single-answer question
JSON (common for other DevOps tools like Terraform providers), Python (wrong entirely)
A team needs to ensure notebook changes are reviewed before going to production
Use Git Folders with a pull request workflow — develop on a feature branch, open PR for review, merge to main, then deploy via Automation Bundles
Automation Bundles alone don't enforce code review. They deploy whatever is in the bundle, regardless of review status
Last-Minute Facts
Troubleshooting, Monitoring, and Optimization
Must-Know Facts
- One task in a stage takes 10x longer than others = data skew (not a slow cluster, not a network issue). Fix: salt the join key or repartition
- Disk spill visible in Spark UI = executor memory is insufficient. Fix: increase spark.executor.memory or reduce partition size — NOT adding more nodes
- Liquid Clustering uses CLUSTER BY at table creation — clustering keys can be changed after creation without rewriting the entire table (unlike partitioning)
- Liquid Clustering still requires OPTIMIZE to actually co-locate the data — it's lazy by default and optimizes on writes incrementally
- VACUUM with retention < 7 days requires setting delta.retentionDurationCheck.enabled = false — this is dangerous and can break concurrent queries
- After VACUUM, time travel to versions older than the retention window is impossible because the underlying data files are gone
- Predictive Optimization automatically runs OPTIMIZE and VACUUM — only available for managed tables in Unity Catalog, not external tables
- Shuffle operations (join, groupBy, distinct) are wide transformations that require data redistribution — they are the primary source of performance bottlenecks
Common Traps
Confusing Pairs
Scenario Tips
A Spark job has 200 tasks. Task 187 takes 45 minutes; all others complete in 2 minutes
Data skew — partition 187 has disproportionately more data. Fix with salting the join key, repartition(), or enable AQE with spark.sql.adaptive.skewJoin.enabled=true
This is NOT a cluster capacity issue (all other tasks are fast). Not a code bug (other tasks succeed). Not a network issue (would affect all tasks)
A Delta table with 5 years of data partitioned by date receives queries filtered by customer_region (not date). Performance is slow
Convert to Liquid Clustering with CLUSTER BY (customer_region) — optimizes data layout for region-based queries without rewriting the entire table
Adding customer_region as a second partition column requires full table rewrite. Z-ORDER works but is less adaptive. Creating a copy doubles storage
After running VACUUM with default settings, users report they cannot query the table as of 10 days ago
VACUUM with default 7-day retention deleted the data files needed for 10-day time travel. The fix going forward is to increase the retention period (VACUUM RETAIN 720 HOURS) before running VACUUM
VACUUM does not disable time travel permanently. VACUUM does not corrupt the transaction log. The issue is simply that the required files were deleted
A Spark UI shows a stage where tasks are writing significant amounts to 'Spill (Disk)'
Executor memory is insufficient — tasks are spilling shuffle data to disk. Fix: increase spark.executor.memory or reduce spark.sql.shuffle.partitions to make each partition smaller
Adding nodes (executors) won't help if each individual partition doesn't fit in memory. The fix is about memory per task, not total cluster memory
Last-Minute Facts
Governance and Security
Must-Know Facts
- GRANT SELECT on a table is NOT enough by itself — users also need USAGE on the catalog and USAGE on the schema to access the table
- Full GRANT chain: GRANT USE CATALOG ON CATALOG c TO user, then GRANT USE SCHEMA ON SCHEMA c.s TO user, then GRANT SELECT ON TABLE c.s.t TO user
- Managed table DROP: deletes both metadata AND data files permanently. External table DROP: removes only metadata; data files in cloud storage are untouched
- Column masking executes at query time — underlying data is never changed. The mask function must return the same data type as the column it masks
- Row filters apply at query time based on the querying user's identity — rows the user is not authorized to see are simply not returned
- ABAC (Attribute-Based Access Control) uses tags to control access dynamically — complements, does NOT replace, explicit GRANT statements
- Data lineage in Unity Catalog is automatic and column-level — tracks which output columns derive from which source columns, across tables, notebooks, and jobs
- Delta Sharing is read-only for recipients — they cannot modify shared data. Recipients do not need Databricks to access shared data (open protocol)
Common Traps
Confusing Pairs
Scenario Tips
An analyst reports they cannot query a table even though a GRANT SELECT was issued to their group
They are also missing GRANT USE CATALOG on the catalog and GRANT USE SCHEMA on the schema. All three grants are required before a user can read a table
SELECT privilege alone is necessary but not sufficient. This is the #1 governance trap on the exam
A managed table is accidentally dropped. Can the data be recovered?
No — dropping a managed table in Unity Catalog permanently deletes the data files. Data is unrecoverable unless Delta time travel was used before VACUUM deleted old versions
External table data survives a DROP. Managed table data does not. There is no 30-day recycle bin for managed tables
A company needs analysts to see customer purchase amounts but not email addresses. Same underlying table
Apply column masking on the email column — analysts see '***' while authorized users see the real value. The single table serves both groups
Creating a view without the email column works but requires managing a separate object per group. A separate table without email duplicates data. Row filters control rows, not columns
A security team needs to dynamically grant or revoke access to hundreds of tables based on data classification tags
ABAC policies — tag tables with classification labels and define policies that grant access based on user attributes and table tags, without writing hundreds of individual GRANT statements
RBAC (individual GRANT statements) doesn't scale to hundreds of tables with dynamic classification changes