CertPrepNowFREE
DatabricksDB DE AssociateUpdated 2026-05-27

DB DE Associate Study Guide

Everything you need to pass the Databricks Certified Data Engineer Associate exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The Databricks Certified Data Engineer Associate exam is passable with free resources if you study consistently for 2–3 weeks:

  • Databricks Academy free learning paths (Data Engineer Associate track)
  • Official Databricks documentation and SQL language reference
  • Databricks Community Edition (free cluster for hands-on practice)
  • 500+ free practice questions on this site

Hands-on practice is essential — the exam tests SQL syntax and PySpark operations, not just theory. Spin up a free Community Edition cluster and write real queries.

Choose Your Study Path

No prior Spark or data engineering experience. You'll build foundational knowledge from scratch over 3 weeks.

Day 1–2Learn data engineering fundamentals: data lakes vs data warehouses, ETL vs ELT, batch vs streaming, the Medallion Architecture (Bronze/Silver/Gold)
Day 3–4Databricks platform basics: workspaces, clusters, notebooks, Unity Catalog overview. Sign up for Databricks Community Edition
Day 5–7SQL deep dive: CREATE TABLE, SELECT, JOIN types, GROUP BY, window functions, MERGE INTO. Practice in a notebook
Day 8–9Delta Lake fundamentals: ACID transactions, time travel, schema enforcement vs schema evolution, DESCRIBE HISTORY
Day 10–11Data ingestion: COPY INTO vs Auto Loader, file formats (Parquet, JSON, CSV), Lakeflow Connect basics
Day 12–13PySpark DataFrame operations: select, filter, withColumn, groupBy, join. Understand when Python is needed vs SQL
Day 14–15Lakeflow Jobs: creating jobs, task orchestration, DAG dependencies, scheduling triggers, retry policies
Day 16–17Governance and CI/CD: Unity Catalog permissions (GRANT/REVOKE), Git Folders, Declarative Automation Bundles, Databricks CLI
Day 18–19Troubleshooting: Spark UI, stage/task analysis, Liquid Clustering, cluster diagnostics, monitoring job health
Day 20Practice questions across all 7 domains, review explanations carefully
Day 21Take a full mock exam. Review all wrong answers. Retake if below 75%

Exam Overview

Format

45 questions, 90 minutes. Multiple choice (single select and multiple select). SQL-first — questions prefer SQL syntax, falling back to Python when SQL isn't applicable.

Scoring

Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure.

Domains & Weights

  • Databricks Intelligence Platform6%
  • Data Ingestion and Loading21%
  • Data Transformation and Modeling21%
  • Working with Lakeflow Jobs12%
  • Implementing CI/CD12%
  • Troubleshooting, Monitoring, and Optimization15%
  • Governance and Security12%

Registration

$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowYou must understand these deeply — what they do, how to use them, and their exact syntax. These appear across multiple domains.
Tier 2: Should KnowUnderstand what they do and key use cases. May appear in 2–5 questions.
Tier 3: Recognize OnlyKnow what they do at a high level. Rarely more than 1–2 questions each.
Domain 16% of exam

Databricks Intelligence Platform

The smallest domain at ~6%, but it sets the foundation. Tests your understanding of the Databricks Lakehouse architecture, Delta Lake basics, Unity Catalog fundamentals, and compute services. These are conceptual questions — know the 'what' and 'why' rather than detailed syntax.

Key Topics

Delta LakeUnity CatalogDatabricks WorkspaceCompute ClustersSQL WarehousesServerless Compute

Must-Know Concepts

  • The Lakehouse combines data lake flexibility (open formats, cheap storage) with data warehouse reliability (ACID transactions, schema enforcement)
  • Delta Lake is the storage format that enables Lakehouse features: ACID, time travel, schema enforcement, and audit history
  • Unity Catalog provides centralized governance: three-level namespace (catalog.schema.table), access controls, data lineage
  • Compute types: All-Purpose Clusters (interactive development), Job Clusters (automated workloads), SQL Warehouses (SQL analytics)
  • Serverless compute: instant startup, no cluster configuration required, billed per DBU — available for SQL Warehouses, Jobs, and DLT pipelines
  • Databricks Connect: allows running Spark code from a local IDE (VS Code, IntelliJ) against remote Databricks compute

Common Exam Traps

Databricks is NOT a database — it's a platform that uses Delta Lake as its storage format on top of cloud object storage
Unity Catalog's three-level namespace is catalog.schema.table — not database.schema.table (Databricks uses 'catalog' at the top level)
Control plane (Databricks-managed) hosts the web app, job scheduler, and cluster manager. Data plane (customer cloud account) hosts clusters and data — data never leaves the customer's cloud account
Quick Check: Databricks Intelligence Platform

Question 1 of 2

Which feature of the Databricks Lakehouse architecture enables ACID transactions and time travel on data stored in cloud object storage?

Domain 221% of exam

Data Ingestion and Loading

A major domain at ~21% with 7 sub-objectives. Tests your ability to load data into Delta tables using COPY INTO, Auto Loader, Lakeflow Connect, and JDBC/ODBC. You must know the SQL syntax for each approach, when to use which, and how to handle schema evolution and error records.

Key Topics

Auto LoaderCOPY INTOLakeflow ConnectStructured StreamingDelta Lake

Must-Know Concepts

  • COPY INTO syntax: COPY INTO target_table FROM 'source_path' FILEFORMAT = format FORMAT_OPTIONS ('key' = 'value')
  • Auto Loader uses cloudFiles format: spark.readStream.format('cloudFiles').option('cloudFiles.format', 'json').load('path')
  • Auto Loader schema inference automatically detects schema from data. Use rescuedDataColumn to capture schema mismatches instead of failing
  • COPY INTO is idempotent per file path — it tracks which files have been loaded and skips them on subsequent runs
  • Lakeflow Connect provides managed connectors for SaaS sources (Salesforce, databases) with built-in CDC support
  • File formats supported: JSON, CSV, Parquet, Avro, ORC, text. Know FORMAT_OPTIONS for each (header, delimiter, multiLine)
  • Lakehouse Federation: query external databases (PostgreSQL, Snowflake, BigQuery) in-place via Unity Catalog foreign tables without copying data
  • trigger(availableNow=True): processes all available data then stops — combines streaming benefits (checkpoint tracking) with batch semantics

Common Exam Traps

Auto Loader is more efficient than COPY INTO for directories with many files because it uses file notification instead of directory listing
COPY INTO FORMAT_OPTIONS use single quotes for values, not double quotes: FORMAT_OPTIONS ('header' = 'true')
Auto Loader's rescuedDataColumn captures malformed records in a separate column instead of failing the entire load
When using Auto Loader with schema evolution, set cloudFiles.schemaEvolutionMode to 'addNewColumns' to automatically add new fields
Quick Check: Data Ingestion and Loading

Question 1 of 3

A data engineer needs to continuously ingest JSON files as they arrive in cloud storage. The source directory contains over 1 million files and grows by 10,000 files daily. Which approach is most efficient?

Domain 321% of exam

Data Transformation and Modeling

Another major domain at ~21% with 7 sub-objectives. Covers data cleaning, joins, column manipulation, deduplication, Spark tuning for transformations, building Gold-layer aggregations, and data quality enforcement. Expect SQL-heavy questions on MERGE INTO, window functions, and data quality constraints.

Key Topics

Spark SQLPySpark DataFramesDelta Lake MERGE INTOWindow FunctionsData Quality Constraints

Must-Know Concepts

  • MERGE INTO syntax for upserts: MERGE INTO target USING source ON condition WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT
  • Deduplication using window functions: ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) to keep the latest record
  • Column manipulation: ALTER TABLE ADD/DROP COLUMN, CAST for type changes, COALESCE for null handling
  • Data quality: CHECK constraints (ALTER TABLE ADD CONSTRAINT), NOT NULL constraints, and expectations in Delta Live Tables
  • Gold layer design: pre-aggregated tables optimized for business reporting, built from Silver-layer cleaned data
  • Join types: INNER, LEFT, RIGHT, FULL OUTER, CROSS, SEMI, ANTI — know when to use each

Common Exam Traps

MERGE INTO requires a deterministic match — if multiple source rows match a single target row, MERGE fails. Deduplicate the source first
Schema evolution with MERGE requires setting spark.databricks.delta.schema.autoMerge.enabled = true to add new columns automatically
Window functions require OVER() clause — forgetting PARTITION BY applies the function to the entire dataset, not per group
CHECK constraints on Delta tables are enforced on write, not on read — existing violating data is not retroactively flagged
Quick Check: Data Transformation and Modeling

Question 1 of 3

A Silver table contains customer records with duplicates due to multiple source systems. Which SQL approach correctly keeps only the most recent record per customer_id?

Domain 412% of exam

Working with Lakeflow Jobs

This domain covers ~12% with 4 sub-objectives. Tests your ability to configure and manage multi-task workflows including control flows (if/else conditions), task configuration, DAG dependencies between tasks, and job scheduling triggers. Know how to build, schedule, and troubleshoot production workflows.

Key Topics

Lakeflow JobsTask OrchestrationJob ClustersScheduling TriggersDelta Live Tables

Must-Know Concepts

  • Jobs consist of one or more tasks with DAG dependencies: sequential, parallel, or conditional (if/else) execution
  • Task types: Notebook, Python script, SQL, JAR, Delta Live Tables pipeline, dbt — know when to use each
  • Control flow tasks: If/Else conditions based on task values, allowing dynamic pipeline branching
  • Scheduling triggers: cron-based (periodic), continuous (runs immediately after previous completes), file arrival (triggers on new data)
  • Job clusters vs all-purpose clusters: job clusters are created for the run and terminated after — more cost-effective for production
  • Retry policies: configure max retries and timeout per task to handle transient failures
  • Delta Live Tables (DLT): declarative ETL framework — define transformations with @dlt.table (Python) or CREATE STREAMING LIVE TABLE (SQL), and the system manages orchestration and data quality
  • DLT expectations: data quality constraints using @dlt.expect, @dlt.expect_or_drop, @dlt.expect_or_fail to validate data during pipeline runs
  • Repair Run: re-execute only failed and downstream tasks in a job, preserving results of successful tasks to avoid reprocessing

Common Exam Traps

Continuous trigger does NOT mean the job runs in parallel with itself — it waits for the previous run to finish before starting the next
Job clusters are ephemeral — they start fresh each run. If you need cached data between runs, use an all-purpose cluster or persist to storage
If a task in a DAG fails, downstream dependent tasks are skipped by default. Configure 'run if dependencies failed' to override this
Task values (dbutils.jobs.taskValues.set/get) pass data between tasks — they are the recommended way to share results within a job
DLT pipelines require a notebook as the source library — you cannot point a DLT pipeline directly at a SQL file or Python script outside a notebook
Quick Check: Working with Lakeflow Jobs

Question 1 of 3

A data engineer has a Lakeflow Job with three tasks: Extract, Transform, and Load. Transform should only run after Extract succeeds, and Load should only run after Transform succeeds. How should the DAG be configured?

Domain 512% of exam

Implementing CI/CD

This domain covers ~12% with 4 sub-objectives. Tests your ability to implement CI/CD workflows using Git Folders (formerly Repos), Declarative Automation Bundles (formerly DABs), the Databricks CLI, and environment configuration for dev/staging/prod promotion.

Key Topics

Git FoldersDeclarative Automation BundlesDatabricks CLIEnvironment Configuration

Must-Know Concepts

  • Git Folders: clone remote repos, create branches, commit changes, push/pull — all within the Databricks workspace UI
  • Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs): define Databricks resources in databricks.yml, deploy across environments with 'databricks bundle deploy'
  • Databricks CLI: authenticate, deploy bundles, manage workspace resources from the command line
  • Environment promotion: use separate catalogs or schemas per environment (dev/staging/prod) in Unity Catalog
  • Bundle targets: define dev, staging, and prod targets in databricks.yml with different workspace URLs and permissions

Common Exam Traps

Git Folders is for version controlling code — Automation Bundles is for deploying infrastructure. They are complementary, not alternatives
Automation Bundles use YAML (databricks.yml), not JSON or Python. The exam may test this
The Databricks CLI command for deploying bundles is 'databricks bundle deploy', not 'databricks deploy bundle'
Environment isolation: use separate Unity Catalog catalogs (dev_catalog, prod_catalog) to isolate data across environments
Quick Check: Implementing CI/CD

Question 1 of 2

A team wants to deploy the same Lakeflow Job to development, staging, and production environments with different cluster sizes and permissions. Which approach is recommended?

Domain 615% of exam

Troubleshooting, Monitoring, and Optimization

This domain covers ~15% with 5 sub-objectives. Tests your ability to analyze job performance trends, monitor pipeline health, interpret Spark UI to find bottlenecks, use Liquid Clustering for optimization, and diagnose cluster issues. Expect scenario-based questions about fixing slow or failing jobs.

Key Topics

Spark UILiquid ClusteringPredictive OptimizationJob Run MonitoringCluster Diagnostics

Must-Know Concepts

  • Spark UI: understand stages, tasks, shuffle read/write, spill to disk, and executor timeline for diagnosing performance issues
  • Data skew: one partition has significantly more data than others, causing a single task to be much slower — fix with salting or repartitioning
  • Shuffle operations: wide transformations (joins, groupBy, distinct) require data redistribution across executors — minimize unnecessary shuffles
  • Liquid Clustering: replaces static partitioning with CLUSTER BY — adapts to query patterns without full data rewrites
  • OPTIMIZE: compacts small files into larger ones for better read performance. VACUUM: removes old data files beyond the retention period

Common Exam Traps

Spill to disk in Spark UI means memory is insufficient — the fix is increasing executor memory or reducing partition size, not adding more executors
Liquid Clustering columns can be changed after table creation — unlike partitioning, which requires a full table rewrite
VACUUM with a retention period shorter than the default 7 days requires setting delta.retentionDurationCheck.enabled = false — this is dangerous and may break time travel
A slow stage with few tasks processing most of the data indicates a partition skew problem — repartition the data or use adaptive query execution
Quick Check: Troubleshooting, Monitoring, and Optimization

Question 1 of 3

A data engineer notices that a Spark job's stage has 200 tasks, but one task takes 30 minutes while the rest finish in 2 minutes. What is the most likely cause?

Domain 712% of exam

Governance and Security

This domain covers ~12% with 4 sub-objectives. Tests your knowledge of managed vs external tables, Unity Catalog access controls (GRANT/REVOKE), column masking for sensitive data, row-level security with row filters, and Attribute-Based Access Control (ABAC) policies. Know the exact SQL syntax for granting and revoking permissions.

Key Topics

Unity CatalogGRANT/REVOKEColumn MaskingRow FiltersABAC PoliciesData Lineage

Must-Know Concepts

  • GRANT syntax: GRANT privilege ON object_type object_name TO principal (e.g., GRANT SELECT ON TABLE catalog.schema.orders TO analyst_group)
  • REVOKE syntax: REVOKE privilege ON object_type object_name FROM principal
  • Managed tables: data stored in Unity Catalog-managed location. DROP TABLE deletes both metadata and data files
  • External tables: data stored in customer-managed location. DROP TABLE removes only metadata from the catalog
  • Column masking: apply a SQL function to mask column values based on the querying user's identity or group membership
  • Row filters: apply a SQL predicate to filter rows based on the querying user — users only see rows they're authorized to access
  • Data lineage: Unity Catalog automatically captures column-level lineage — tracks how data flows between tables, notebooks, and jobs, viewable in Catalog Explorer

Common Exam Traps

GRANT SELECT does not imply GRANT USAGE — users also need USAGE permission on the catalog and schema to access a table
Column masks execute at query time — the underlying data is unchanged. The mask function must return the same data type as the column
ABAC policies use tags to control access dynamically — they complement, not replace, explicit GRANT statements
The owner of a securable object has all privileges by default and can grant privileges to others
Quick Check: Governance and Security

Question 1 of 3

A data engineer needs to give the 'analysts' group read-only access to a table called 'prod.sales.orders'. Which SQL statement is correct?

Key Databricks Concepts Compared

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

Auto Loader vs COPY INTO

Use Auto Loader when…

You need to continuously ingest new files as they arrive in cloud storage. Best for large directories with many files — uses file notification to efficiently detect new files.

Use COPY INTO when…

You need a simple, one-time or periodic batch load of files from cloud storage. Best for small directories with predictable file volumes.

Exam trap

Auto Loader uses a checkpoint to track files (efficient, scalable). COPY INTO scans the source directory each run to detect unprocessed files (simpler but slower at scale). If the question mentions 'millions of files' or 'continuous ingestion,' choose Auto Loader.

Managed Tables vs External Tables

Use Managed Tables when…

You want Databricks to fully manage the table lifecycle. Data files are stored in a Unity Catalog-managed location. Dropping the table deletes the data.

Use External Tables when…

You need the table to reference data stored in your own cloud storage location. Dropping the table removes only metadata — the underlying files remain.

Exam trap

DROP TABLE on a managed table deletes the data permanently. DROP TABLE on an external table only removes it from the catalog — files are untouched. The exam tests this distinction heavily.

Batch Processing vs Streaming

Use Batch Processing when…

Your data arrives in complete, bounded datasets and latency of minutes to hours is acceptable. Simpler to reason about, easier to reprocess.

Use Streaming when…

You need near-real-time processing with second-to-minute latency. Data arrives continuously and must be processed incrementally.

Exam trap

Auto Loader is streaming (uses readStream). COPY INTO is batch (runs as a single command). Structured Streaming with trigger(availableNow=True) bridges the gap — it processes all available data then stops, like batch but with streaming benefits.

Lakeflow Jobs vs Delta Live Tables

Use Lakeflow Jobs when…

You need general-purpose workflow orchestration: multi-task DAGs with notebooks, JARs, Python scripts, SQL queries, and control flow (if/else).

Use Delta Live Tables when…

You need declarative ETL pipelines where you define what the data should look like and the system handles orchestration, error handling, and data quality.

Exam trap

Lakeflow Jobs orchestrate tasks. Delta Live Tables define data transformations declaratively. A Lakeflow Job CAN include a Delta Live Tables pipeline as one of its tasks — they are complementary, not competing.

Views vs Materialized Views

Use Views when…

You want a saved query that runs fresh every time it's accessed. No storage cost but recomputes on each read.

Use Materialized Views when…

You want a precomputed query result stored as a table for faster reads. Trades storage for speed — must be refreshed to reflect source changes.

Exam trap

Views are always up-to-date (recompute each time). Materialized views can be stale until refreshed. If the question asks about 'always current data,' choose a view. If it asks about 'fast dashboard reads,' choose materialized view.

Git Folders vs Declarative Automation Bundles

Use Git Folders when…

You need version control for notebooks and code files within the Databricks workspace. Enables branching, committing, and pulling from remote Git repos.

Use Declarative Automation Bundles when…

You need infrastructure-as-code to define, test, and deploy Databricks resources (jobs, pipelines, clusters) across environments (dev/staging/prod).

Exam trap

Git Folders manage code versioning. Automation Bundles manage resource deployment. They serve different purposes in CI/CD: Git Folders are for source control, Bundles are for deployment automation.

Top Mistakes to Avoid

Confusing Auto Loader (streaming, checkpoint-based, scalable) with COPY INTO (batch, directory-scan, simpler) — know when to use each
Forgetting that dropping a managed table deletes the data, while dropping an external table only removes metadata
Using window functions in WHERE clauses directly — they must be wrapped in a subquery or CTE
Not setting spark.databricks.delta.schema.autoMerge.enabled = true when MERGE INTO encounters new columns
Thinking VACUUM is safe with very short retention — it breaks time travel and can cause concurrent query failures
Confusing Git Folders (code version control) with Declarative Automation Bundles (resource deployment) — they are complementary
Assuming GRANT SELECT on a table is sufficient — users also need USAGE on the catalog and schema
Not deduplicating source data before MERGE INTO — duplicate source matches cause the MERGE to fail

Exam-Ready Checklist

Can explain all 7 exam domains and their relative weights (updated May 2026)
Know Delta Lake fundamentals cold: ACID, time travel, schema evolution, VACUUM, DESCRIBE HISTORY
Can write COPY INTO and MERGE INTO statements from memory with correct syntax
Understand Auto Loader: cloudFiles format, schema inference, rescuedDataColumn, when to use vs COPY INTO
Know Unity Catalog three-level namespace and GRANT/REVOKE syntax
Can configure Lakeflow Jobs with DAG dependencies, control flows, and scheduling triggers
Understand CI/CD: Git Folders for code, Automation Bundles for deployment, CLI for automation
Can interpret Spark UI to diagnose data skew, shuffle spill, and straggler tasks
Know the difference between managed and external tables and what happens on DROP
Scored 80%+ on at least two full practice exams
Reviewed all incorrect answers and understand why the right answer is right
Can complete the exam within time: average 2 minutes per question

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions