CertPrepNowFREE
DatabricksDB DE ProfessionalUpdated 2026-06-05

DB DE Professional Study Guide

Everything you need to pass the Databricks Certified Data Engineer Professional exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The Databricks Certified Data Engineer Professional exam is passable with free resources, but requires significant hands-on production experience (6–12 months minimum):

  • Databricks Academy free learning paths (Data Engineer Professional track)
  • Official Databricks documentation — Delta Lake internals, Structured Streaming, Unity Catalog advanced features
  • Databricks Community Edition for practicing advanced Delta Lake and streaming patterns
  • 500+ free practice questions on this site covering all 5 professional-level domains
  • Databricks open-source repos on GitHub — review Delta Lake, Spark, and DABs source code for deeper understanding

This is a professional-level exam. Unlike the Associate, most questions present complex production scenarios — you need real-world experience debugging streaming pipelines, designing data models, and implementing CI/CD. Book knowledge alone is not enough.

Choose Your Study Path

You passed the DE Associate exam and have 6–12 months of Databricks experience. You need to level up on advanced topics like streaming internals, testing strategies, and production monitoring.

Day 1–2Review the official Professional exam guide. Compare domains to the Associate exam — note what's new: advanced streaming, CI/CD testing, production monitoring, SCD patterns, and data governance at scale
Day 3–4Advanced Delta Lake: Change Data Feed (CDF), table clones (shallow vs deep), generated columns, table constraints, Delta Lake internals (transaction log, checkpoint files)
Day 5–7Structured Streaming deep dive: watermarks, output modes (append/update/complete), stream-static joins vs stream-stream joins, trigger modes, checkpoint recovery, exactly-once semantics
Day 8–9Data modeling patterns: SCD Type 1 vs Type 2 implementations, star schema vs data vault, slowly changing dimensions with MERGE INTO, fact table design
Day 10–11Advanced MERGE INTO: handling duplicates in source, conditional deletes, schema evolution during merge, multi-table CDC pipelines
Day 12–13Testing and CI/CD: unit testing notebooks, integration testing with temporary tables, Databricks Asset Bundles for multi-environment deployment, pytest patterns for PySpark
Day 14–15Production monitoring: Spark UI deep dive (DAG visualization, stage analysis, task metrics), query profiling, event log analysis, alert configuration
Day 16–17Data governance at scale: Unity Catalog advanced features — dynamic views, column masking functions, row-level security, information_schema queries, audit logging
Day 18–19Performance optimization: Adaptive Query Execution (AQE), broadcast hints, partition pruning, Liquid Clustering vs Z-ORDER, file compaction strategies
Day 20–21Practice exams across all 5 domains. Review explanations carefully. Target 80%+ before scheduling the real exam

Exam Overview

Format

60 questions, 120 minutes. Multiple choice (single select and multiple select). Scenario-heavy — most questions present 3–5 sentence production scenarios requiring you to choose the best approach. Covers advanced topics not on the Associate exam.

Scoring

Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure. Questions are weighted equally across all domains.

Domains & Weights

  • Data Modeling and Design16%
  • Data Processing30%
  • Data Governance and Security18%
  • Monitoring, Logging, and Optimization16%
  • Testing and Deployment20%

Registration

$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification. Prerequisite: None officially required, but Databricks recommends passing the Associate exam first and having 1+ years of production Databricks experience.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowDeep understanding required — these appear across multiple domains and form the foundation of professional-level questions. Know internals, edge cases, and production patterns.
Tier 2: Should KnowUnderstand use cases, configuration, and key behaviors. May appear in 3–8 questions each.
Tier 3: Recognize OnlyKnow at a high level — what it does and when to use it. Rarely more than 1–2 questions each.
Domain 116% of exam

Data Modeling and Design

Covers ~16% with focus on production data modeling patterns. Tests your ability to design star schemas, implement SCD Type 1 and Type 2 patterns, choose between views and materialized views, design fact and dimension tables, and apply the Medallion Architecture at scale. Expect scenario-based questions about trade-offs.

Key Topics

Delta LakeMERGE INTOViewsMaterialized ViewsStar SchemaSCD Patterns

Must-Know Concepts

  • SCD Type 1: overwrite current values using MERGE INTO with WHEN MATCHED THEN UPDATE SET * — no history preserved
  • SCD Type 2: preserve history by closing the current record (set end_date, is_current=false) and inserting a new version with start_date and is_current=true
  • Star schema design: central fact tables (transactions, events) surrounded by dimension tables (customers, products, dates) joined via foreign keys
  • Views vs materialized views: views recompute on every query (always fresh), materialized views precompute results (faster reads, may be stale)
  • Generated columns: automatically computed from other columns using GENERATED ALWAYS AS syntax — useful for derived fields like date parts or hashes
  • Medallion Architecture at scale: Bronze (raw/append-only), Silver (cleaned/validated/deduplicated), Gold (business aggregates) — professional exam tests advanced Silver-to-Gold patterns

Common Exam Traps

SCD Type 2 with MERGE requires careful handling: WHEN MATCHED AND target.is_current = true THEN UPDATE to close the old record, and WHEN NOT MATCHED THEN INSERT for the new version
Materialized views in Databricks are managed by Delta Live Tables — they are NOT standalone objects you create with CREATE MATERIALIZED VIEW in regular SQL notebooks
Generated columns are computed on write, not read. Changing the generation expression requires dropping and recreating the column — existing data is not retroactively updated
Quick Check: Data Modeling and Design

Question 1 of 3

A retail company needs to track customer address changes over time so that historical orders can be associated with the address that was active at the time of the order. Which data modeling pattern should the engineer implement?

Domain 230% of exam

Data Processing

The largest domain at ~30%. Tests advanced data processing patterns: complex Structured Streaming pipelines, CDC processing with MERGE and Change Data Feed, stream-static and stream-stream joins, watermark handling, incremental processing with trigger modes, batch optimization, and multi-hop pipeline design. Expect deep scenario questions.

Key Topics

Structured StreamingMERGE INTOChange Data FeedAuto LoaderforeachBatchWatermarks

Must-Know Concepts

  • Watermarks define the maximum allowed lateness for event-time data: withWatermark('event_time', '10 minutes') drops data arriving more than 10 minutes late
  • Stream-static joins: the static DataFrame is loaded once at stream start and NOT refreshed. Use for slowly-changing reference data only
  • Stream-stream joins require watermarks on both streams to bound state and prevent unbounded memory growth
  • foreachBatch pattern: apply batch operations (MERGE, JDBC writes) to each streaming micro-batch — standard for streaming-to-Delta upserts
  • Change Data Feed: enable with delta.enableChangeDataFeed = true, read with spark.read.option('readChangeFeed', 'true').table('my_table') or table_changes('my_table', start_version)
  • trigger(availableNow=True) processes all available data across multiple micro-batches then stops — preferred over trigger(once=True) for large backlogs
  • CDC processing pattern: stage incoming changes, deduplicate by key taking the latest change, then MERGE into the target table
  • Output modes: append (new rows only, default for non-aggregation), update (changed rows only), complete (full result table, only for aggregations)

Common Exam Traps

Stream-static joins do NOT refresh the static table — if the dimension table changes during stream execution, the stream uses stale data. Restart the stream or switch to a stream-stream join
MERGE with duplicate source keys produces unexpected results (duplicate inserts, not errors) — always deduplicate the source DataFrame before merging
trigger(once=True) processes everything in ONE micro-batch, which can cause OOM for large backlogs. trigger(availableNow=True) splits into multiple micro-batches — safer and more efficient
Watermarks are approximate — Spark may still process some late data beyond the watermark threshold. The watermark is a lower bound on late data rejection, not an exact cutoff
Quick Check: Data Processing

Question 1 of 3

A streaming pipeline enriches clickstream events with user profile data stored in a Delta table. The profile table is updated daily. During a 24-hour streaming run, the engineer notices that new user profiles added during the day are not appearing in the enriched output. What is the cause?

Domain 318% of exam

Data Governance and Security

Covers ~18% of the exam. Tests advanced governance: column masking with custom SQL functions, row-level security with row filters, information_schema for metadata queries, audit logging, secrets management, dynamic views for access control, data lineage, and compliance patterns like GDPR deletion.

Key Topics

Unity CatalogColumn MaskingRow FiltersSecrets APIInformation SchemaAudit LogsDynamic Views

Must-Know Concepts

  • Column masking: CREATE FUNCTION mask_email(email STRING) RETURNS STRING RETURN CASE WHEN is_member('admin') THEN email ELSE '***@***' END; ALTER TABLE t ALTER COLUMN email SET MASK mask_email
  • Row filters: CREATE FUNCTION region_filter(region STRING) RETURNS BOOLEAN RETURN (is_member('global_access') OR region = current_user_attribute('region'))
  • Information schema: query system.information_schema.columns, tables, table_privileges to audit metadata programmatically
  • Secrets management: dbutils.secrets.createScope(), dbutils.secrets.put(), dbutils.secrets.get() — secrets are redacted when printed
  • Dynamic views: CREATE VIEW secure_view AS SELECT CASE WHEN is_member('admin') THEN ssn ELSE 'XXX-XX-XXXX' END AS ssn FROM table
  • Data lineage in Unity Catalog: automatically tracked at column level — visible in Catalog Explorer, queryable via REST API
  • GDPR deletion: use DELETE FROM + VACUUM to permanently remove personal data — verify with time travel that data is inaccessible after VACUUM

Common Exam Traps

Column mask functions must return the same data type as the column — returning STRING for an INT column fails at query time, not at mask creation
Secrets are redacted when printed (shows '[REDACTED]') but the actual value is available for use in connection strings and API calls
Row filters evaluate per-row — complex filter functions can significantly impact query performance on large tables
Unity Catalog audit logs capture who accessed what data and when — they are stored in the system.access.audit table and are essential for compliance
Quick Check: Data Governance and Security

Question 1 of 3

A compliance team requires that customer SSN values are visible only to users in the 'compliance_team' group. All other users should see 'XXX-XX-XXXX'. The solution must not require creating separate tables or views. What is the best approach?

Domain 416% of exam

Monitoring, Logging, and Optimization

Covers ~16% of the exam. Tests your ability to monitor production pipelines, interpret Spark UI metrics, diagnose performance issues (skew, spill, shuffle), configure alerts, optimize queries with AQE, analyze query plans, and manage Delta table maintenance (OPTIMIZE, VACUUM, file sizing).

Key Topics

Spark UIAdaptive Query ExecutionOPTIMIZEVACUUMQuery PlansDatabricks SQL AlertsEvent Logs

Must-Know Concepts

  • Spark UI stages view: identify data skew (one task much slower than others), shuffle spill to disk (memory insufficient), and straggler tasks
  • Adaptive Query Execution (AQE): auto-coalesces shuffle partitions, converts sort-merge joins to broadcast joins at runtime, handles skewed joins automatically
  • EXPLAIN command: analyze query plans to identify full table scans, missing predicates, and suboptimal join strategies
  • Auto Compaction file sizing: adapts target file size based on write patterns — streaming MERGE jobs may produce 32–48 MB files instead of the default 1 GB
  • Databricks SQL alerts: schedule queries and trigger notifications when results exceed thresholds — used for data quality monitoring and SLA tracking
  • OPTIMIZE compacts small files for better read performance. VACUUM removes files older than the retention period. Both are essential for table maintenance
  • Job monitoring: track run duration trends, failure rates, cluster utilization, and cost per job run. Use Databricks SQL dashboards for fleet-wide monitoring

Common Exam Traps

Spill to disk means executor memory is insufficient — the fix is increasing spark.executor.memory or reducing partition size, NOT adding more executors
AQE can change join strategies at runtime — a query plan shown by EXPLAIN may differ from the actual execution plan if AQE optimizations activate
Auto Compaction with streaming MERGE auto-tunes file sizes down based on micro-batch write volume — this is expected behavior, not a problem to fix
VACUUM retention period is different from the Delta log retention: VACUUM removes data files, while delta.logRetentionDuration controls how long the transaction log is kept
Quick Check: Monitoring, Logging, and Optimization

Question 1 of 3

A data engineer observes that a production Spark job has one stage with 200 tasks where 199 complete in 30 seconds but one task takes 45 minutes. The Spark UI shows this task processing 50 GB while others process 200 MB each. What is the root cause and fix?

Domain 520% of exam

Testing and Deployment

Covers ~20% of the exam. Tests production deployment practices: unit testing PySpark code, integration testing with temporary tables, CI/CD with Databricks Asset Bundles, multi-environment promotion (dev/staging/prod), the Databricks REST API, notebook parameterization, job cluster configuration, and repair runs for partial failures.

Key Topics

Databricks Asset BundlesDatabricks CLIREST APIpytestJob ConfigurationRepair RunsNotebook Widgets

Must-Know Concepts

  • Unit testing PySpark: use pytest with a local SparkSession for testing transformation functions. Isolate logic into testable functions, not monolithic notebooks
  • Integration testing: use temporary tables or dedicated test catalogs in Unity Catalog to validate end-to-end pipeline behavior without affecting production data
  • Shallow clones for testing: CREATE TABLE test_table SHALLOW CLONE prod_table — provides a zero-copy snapshot for testing without duplicating production data
  • Databricks Asset Bundles: define resources in databricks.yml with dev/staging/prod targets. Deploy with 'databricks bundle deploy -t production'
  • Notebook parameterization: use dbutils.widgets.get() to retrieve parameters passed via the Jobs API or widget configuration
  • REST API /jobs/create: creates a new job definition each call (not idempotent). Use /jobs/reset to update an existing job. /jobs/run-now triggers immediate execution
  • Job repair runs: re-execute only failed and downstream tasks without re-running successful tasks — use the Repair Run feature from the Jobs UI or API
  • Multi-environment promotion: use separate Unity Catalog catalogs (dev_catalog, staging_catalog, prod_catalog) and Asset Bundle targets for environment isolation

Common Exam Traps

The /jobs/create API is NOT idempotent — calling it twice with the same config creates two separate jobs with different IDs. Always check if the job exists first
dbutils.widgets.get() retrieves parameters passed to notebooks from the Jobs API. It does NOT access environment variables or Spark config values
Shallow clones for testing depend on the source table's files — if the source runs VACUUM, the test clone may break. Use deep clones for long-lived test environments
Repair Run only retries failed tasks and their downstream dependents. If the root cause isn't fixed, the repair will fail again in the same way
Quick Check: Testing and Deployment

Question 1 of 3

A data engineering team wants to test their production ETL pipeline against production-like data without duplicating the 5 TB production table. Which approach is most efficient?

Key Professional-Level Concepts Compared

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

Change Data Feed (CDF) vs Delta Lake Time Travel

Use Change Data Feed (CDF) when…

You need to efficiently read only the rows that changed between two versions of a table. CDF provides row-level change records with _change_type (insert, update_preimage, update_postimage, delete).

Use Delta Lake Time Travel when…

You need to query the entire table as it existed at a specific point in time — for auditing, debugging, or reproducing results from a past state.

Exam trap

CDF gives you the diff (what changed). Time travel gives you the snapshot (what the table looked like). CDF must be enabled before changes occur — it cannot retroactively capture past changes. Time travel depends on VACUUM retention.

Shallow Clone vs Deep Clone

Use Shallow Clone when…

You need a quick copy for testing or development that doesn't duplicate data. Zero-copy — references source files. Fast to create, minimal storage cost.

Use Deep Clone when…

You need a fully independent copy for migration, disaster recovery, or archiving. Copies all data files — changes to the clone don't affect the source and vice versa.

Exam trap

Shallow clones break if the source table runs VACUUM and removes referenced files. Deep clones are independent but use full storage. For production testing, use shallow clones on recent data; for archiving, use deep clones.

Stream-Static Join vs Stream-Stream Join

Use Stream-Static Join when…

You're enriching streaming data with a reference/dimension table that changes infrequently. The static table is loaded once at stream start.

Use Stream-Stream Join when…

You're joining two continuously updating data sources where both sides produce new records over time. Requires watermarks to bound state.

Exam trap

Stream-static joins do NOT refresh the static table during execution — they use the snapshot from stream initialization. If the static table updates frequently, consider a stream-stream join or restart the stream periodically.

SCD Type 1 vs SCD Type 2

Use SCD Type 1 when…

You want the dimension table to always reflect the current state. Overwrites old values with new ones — no history is preserved. Simple MERGE with UPDATE SET *.

Use SCD Type 2 when…

You need to preserve the full history of changes. Each change creates a new row with effective dates (start_date, end_date) and a current flag.

Exam trap

SCD Type 1 is a straightforward MERGE. SCD Type 2 requires a more complex MERGE: close the current record (set end_date, is_current=false) AND insert the new version. The exam tests whether you can implement both patterns correctly.

foreachBatch vs foreach

Use foreachBatch when…

You need to apply arbitrary batch operations (MERGE, JDBC writes, API calls) to each micro-batch of streaming data. The function receives a DataFrame and batch ID.

Use foreach when…

You need to process each individual row in a streaming query — used for row-level custom logic like sending notifications per record.

Exam trap

foreachBatch is the standard pattern for streaming MERGE INTO operations — you write the MERGE logic inside the foreachBatch function. foreach processes one row at a time and is much slower for bulk operations.

trigger(availableNow=True) vs trigger(once=True)

Use trigger(availableNow=True) when…

Process all available data in multiple micro-batches, then stop. More efficient for large backlogs — splits work into manageable batches. Preferred for scheduled incremental processing.

Use trigger(once=True) when…

Process all available data in a single micro-batch, then stop. Simpler but may cause OOM errors with large backlogs since everything is processed at once.

Exam trap

trigger(availableNow=True) replaced trigger(once=True) as the recommended approach. availableNow processes data incrementally across multiple micro-batches, while once tries to process everything in one batch — risking memory issues with large datasets.

Top Mistakes to Avoid

Using trigger(once=True) for large backlogs — it processes everything in one micro-batch and can cause OOM. Use trigger(availableNow=True) instead
Assuming stream-static joins refresh the static table — they do NOT. The static table is loaded once at stream initialization. Restart the stream to pick up changes
Not deduplicating source data before MERGE INTO — duplicate source keys matching the same target row produce unexpected duplicates, not errors
Forgetting to enable Change Data Feed BEFORE making changes — CDF does not retroactively capture past changes
Calling /jobs/create API multiple times thinking it's idempotent — each call creates a new job. Use /jobs/reset to update existing jobs
Expecting column mask functions to fail at creation time — type mismatches (e.g., returning STRING for an INT column) only fail at query time
Confusing shallow clones with deep clones — shallow clones break if the source runs VACUUM. Use deep clones for long-lived independent copies
Thinking EXPLAIN output matches actual execution — Adaptive Query Execution (AQE) may change join strategies at runtime

Exam-Ready Checklist

Can explain all 5 exam domains and their relative weights (Data Processing is 30% — spend the most time here)
Can implement SCD Type 1 and Type 2 patterns with MERGE INTO from memory
Understand Structured Streaming internals: watermarks, output modes, trigger types, checkpoint recovery
Know stream-static vs stream-stream join semantics and when to use each
Can implement the foreachBatch pattern for streaming MERGE operations
Understand Change Data Feed: how to enable, read, and use in incremental processing pipelines
Know shallow vs deep clones and their use cases for testing and development
Can configure multi-environment deployments with Databricks Asset Bundles
Know the Databricks REST API basics: /jobs/create vs /jobs/reset, /jobs/run-now, parameterization
Can interpret Spark UI to diagnose data skew, shuffle spill, and straggler tasks
Understand AQE and its three key optimizations: partition coalescing, join conversion, skew handling
Can implement column masking and row-level security with Unity Catalog
Know secrets management: create scopes, set secrets, retrieve with dbutils.secrets.get(), redaction behavior
Scored 80%+ on at least two full practice exams across all domains
Can complete the exam within time: average 2 minutes per question for 60 questions in 120 minutes

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions