CertPrepNowFREE
DatabricksDB DE Associate59 concepts

DB DE Associate Cheat Sheet

Quick reference for the Databricks Certified Data Engineer Associate exam.

Delta Lake Core Commands

CREATE TABLE my_table (id INT, name STRING) USING DELTA
Create a managed Delta table in Unity Catalog
DESCRIBE HISTORY my_table
View table version history — shows operations, timestamps, and user info
SELECT * FROM my_table VERSION AS OF 3
Time travel — query a specific version of the table
SELECT * FROM my_table TIMESTAMP AS OF '2026-01-01'
Time travel — query table state at a specific timestamp
RESTORE TABLE my_table TO VERSION AS OF 3
Restore table to a previous version
OPTIMIZE my_table
Compact small files into larger ones for better read performance
OPTIMIZE my_table ZORDER BY (col1, col2)
Compact files and co-locate related data for faster filter queries
VACUUM my_table RETAIN 168 HOURS
Delete old files no longer referenced by the transaction log (default 7 days)

Data Ingestion — Auto Loader

spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .option('cloudFiles.schemaLocation', '/schema') .load('/data/input')
Auto Loader with schema inference — automatically detects and processes new files
.option('cloudFiles.schemaEvolutionMode', 'addNewColumns')
Enable schema evolution — automatically adds new columns from source data
.option('cloudFiles.useNotifications', 'true')
Use file notification mode (faster for large directories, requires cloud setup)
spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .schema(defined_schema) .load('/data/input')
Auto Loader with explicit schema — no schema inference or evolution
trigger(availableNow=True): Processes all available data then stops — combines streaming checkpoint tracking with batch semantics
Use with Auto Loader or Structured Streaming to process all pending files in one micro-batch then stop
rescuedDataColumn: Captures malformed records in a _rescued_data column instead of failing the stream
Enabled via .option('rescuedDataColumn', '_rescued_data') — keeps bad records for later inspection

Data Ingestion — COPY INTO

COPY INTO my_table FROM '/path/to/files' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') COPY_OPTIONS ('mergeSchema' = 'true')
Incremental file loading — only processes new files since last COPY INTO
COPY INTO my_table FROM 's3://bucket/path' FILEFORMAT = JSON
Load JSON files from S3 cloud storage into a Delta table
Lakeflow Connect: Managed connectors for ingesting from SaaS sources (Salesforce, databases) with built-in CDC. Use for continuous replication from external sources — replaces manual JDBC/ODBC patterns
Fully managed ingestion service for replicating data from external systems into the lakehouse

DLT / Lakeflow Declarative Pipelines

CREATE STREAMING LIVE TABLE my_table AS SELECT * FROM STREAM(source_table)
Create a streaming live table that incrementally processes new data from the source
@dlt.table def my_table(): return spark.read.table('source')
Python DLT syntax for defining a materialized view
Expectations: @dlt.expect('valid_id', 'id IS NOT NULL') @dlt.expect_or_drop('positive', 'amount > 0') @dlt.expect_or_fail('unique', 'count = 1')
Data quality constraints: expect (warn), expect_or_drop (filter), expect_or_fail (abort pipeline)

MERGE INTO (Upsert Pattern)

MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
Upsert — update existing rows, insert new ones. Core SCD Type 1 pattern.
MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED AND s.deleted = true THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
Upsert with conditional delete — handle soft deletes from source

Medallion Architecture

Bronze → Silver → Gold
Three-layer data architecture. Bronze = raw ingestion, Silver = cleaned/validated, Gold = business-level aggregates
Bronze: Raw data, append-only, no transformations
Ingest as-is from source systems. Keep original schema. Used for replay and audit.
Silver: Cleaned, deduplicated, schema-enforced
Apply data quality rules, join reference data, standardize types. Most transformations happen here.
Gold: Business-level aggregates and features
Materialized views, dashboards, ML features. Optimized for consumption by analysts and BI tools.

Spark SQL Transformations

SELECT *, explode(array_col) AS item FROM my_table
Flatten array column — creates one row per array element
SELECT *, col1:nested_key AS extracted FROM my_table
Extract value from JSON string or VARIANT column using colon syntax (use dot notation for STRUCT fields)
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id
Inner join — only matching rows from both tables
SELECT /*+ BROADCAST(small_table) */ * FROM big JOIN small_table ON big.id = small_table.id
Broadcast join hint — replicate small table to all nodes for faster join
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rn FROM my_table QUALIFY rn = 1
Deduplication — keep only the latest row per id

PySpark DataFrame Operations

df.filter(col('status') == 'active') .select('id', 'name', 'amount') .withColumn('amount_usd', col('amount') * 1.1) .dropDuplicates(['id'])
Filter, select, add column, and deduplicate in one chain
df.groupBy('category').agg( count('*').alias('total'), avg('amount').alias('avg_amount'), approx_count_distinct('user_id').alias('unique_users') )
Aggregate with count, average, and approximate distinct count
df.join(other_df, on='id', how='left')
Left join DataFrames on a common column

Lakeflow Jobs & Scheduling

Trigger Types: - Scheduled (cron) - File arrival - Table update
Three job trigger types. Use scheduled for regular cadence, file arrival for event-driven, table update for dependency-based.
Task Types: - Notebook task - SQL query task - Pipeline task (Lakeflow Declarative Pipelines, formerly DLT) - Dashboard task
Common task types in a Lakeflow Job DAG
Control Flow: - Retries (max_retries, retry_on_timeout) - Conditional tasks (if/else branching) - For-each loops
Pipeline orchestration patterns for error handling and dynamic workflows
dbutils.jobs.taskValues.set(key='row_count', value=1000) dbutils.jobs.taskValues.get(taskKey='extract', key='row_count')
Pass lightweight data between tasks in a job. Recommended over Delta tables for simple values.
Repair Run: Re-execute only failed and downstream tasks, preserving results of successful tasks
Use from the Jobs UI or API to retry a partial failure without re-running the entire job

CI/CD & Automation Bundles

databricks bundle init
Initialize a new Declarative Automation Bundle (formerly Databricks Asset Bundles/DABs) project
databricks bundle validate
Validate bundle configuration before deployment
databricks bundle deploy -t production
Deploy bundle to production target environment
Git Folders (formerly Repos): - Clone, branch, commit, push from workspace UI - Pull request workflow via Git integration
Version control integration in Databricks workspace for notebook development

Unity Catalog & Governance

GRANT SELECT ON TABLE catalog.schema.table TO `group_name`
Grant read access to a specific table
GRANT USE SCHEMA ON SCHEMA catalog.schema TO `group_name`
Grant ability to access objects within a schema (required before table grants)
GRANT USE CATALOG ON CATALOG my_catalog TO `group_name`
Required before users can access any schema or table within the catalog
REVOKE ALL PRIVILEGES ON TABLE my_table FROM `user@email.com`
Remove all permissions from a user on a table
3-level namespace: catalog.schema.table
Unity Catalog organizes data in a three-level hierarchy. All data access goes through this namespace.
Managed table: data stored in Unity Catalog managed storage External table: data stored in customer-managed cloud storage
Key difference: dropping a managed table deletes the data; dropping an external table only removes the metadata.
Column Masking: ALTER TABLE t ALTER COLUMN email SET MASK mask_fn
Dynamically mask column values at query time based on user identity. Underlying data unchanged.
Row Filters: ALTER TABLE t SET ROW FILTER filter_fn ON (region)
Filter rows based on querying user's identity — users only see authorized rows
Delta Sharing: Read-only data sharing across organizations using an open protocol. Recipients don't need Databricks.
Share data externally without copying — recipients access via open Delta Sharing protocol
Data Lineage: Automatically captured by Unity Catalog at column level. View in Catalog Explorer — tracks data flow across tables, notebooks, and jobs.
Built-in lineage tracking helps understand data dependencies and impact analysis

Spark Performance & Troubleshooting

spark.sql.shuffle.partitions = 200 (default)
Number of partitions after a shuffle. Reduce for small datasets, increase for large ones.
spark.sql.autoBroadcastJoinThreshold = 10MB
Tables smaller than this threshold are automatically broadcast to all nodes during joins.
Data Skew: One partition has much more data than others
Look for in Spark UI — causes one task to run much longer. Fix with salting or repartitioning.
Disk Spill: Data exceeds executor memory
Visible in Spark UI stage details. Fix by increasing spark.executor.memory or reducing partition size.
Liquid Clustering: ALTER TABLE t CLUSTER BY (col1, col2)
Modern alternative to partitioning and ZORDER. Still uses OPTIMIZE for maintenance. Clustering keys can be changed without full table rewrite.
Predictive Optimization: Auto-runs OPTIMIZE and VACUUM on managed tables based on usage patterns. Requires Unity Catalog managed tables.
Enable at account or catalog level to automatically maintain tables without manual scheduling

Ready to test yourself?

Start a timed DB DE Associate mock exam or review practice questions by domain.