DatabricksDB DE Associate59 concepts

DB DE Associate Cheat Sheet

Quick reference for the Databricks Certified Data Engineer Associate exam.

Quick Navigation

Delta Lake Core Commands Data Ingestion — Auto Loader Data Ingestion — COPY INTO DLT / Lakeflow Declarative Pipelines MERGE INTO (Upsert Pattern)Medallion Architecture Spark SQL Transformations PySpark DataFrame Operations Lakeflow Jobs & Scheduling CI/CD & Automation Bundles Unity Catalog & Governance Spark Performance & Troubleshooting

Delta Lake Core Commands

CREATE TABLE my_table (id INT, name STRING) USING DELTA: Create a managed Delta table in Unity Catalog
DESCRIBE HISTORY my_table: View table version history — shows operations, timestamps, and user info
SELECT * FROM my_table VERSION AS OF 3: Time travel — query a specific version of the table
SELECT * FROM my_table TIMESTAMP AS OF '2026-01-01': Time travel — query table state at a specific timestamp
RESTORE TABLE my_table TO VERSION AS OF 3: Restore table to a previous version
OPTIMIZE my_table: Compact small files into larger ones for better read performance
OPTIMIZE my_table ZORDER BY (col1, col2): Compact files and co-locate related data for faster filter queries
VACUUM my_table RETAIN 168 HOURS: Delete old files no longer referenced by the transaction log (default 7 days)

Data Ingestion — Auto Loader

spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .option('cloudFiles.schemaLocation', '/schema') .load('/data/input'): Auto Loader with schema inference — automatically detects and processes new files
.option('cloudFiles.schemaEvolutionMode', 'addNewColumns'): Enable schema evolution — automatically adds new columns from source data
.option('cloudFiles.useNotifications', 'true'): Use file notification mode (faster for large directories, requires cloud setup)
spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .schema(defined_schema) .load('/data/input'): Auto Loader with explicit schema — no schema inference or evolution
trigger(availableNow=True): Processes all available data then stops — combines streaming checkpoint tracking with batch semantics: Use with Auto Loader or Structured Streaming to process all pending files in one micro-batch then stop
rescuedDataColumn: Captures malformed records in a _rescued_data column instead of failing the stream: Enabled via .option('rescuedDataColumn', '_rescued_data') — keeps bad records for later inspection

Data Ingestion — COPY INTO

COPY INTO my_table FROM '/path/to/files' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') COPY_OPTIONS ('mergeSchema' = 'true'): Incremental file loading — only processes new files since last COPY INTO
COPY INTO my_table FROM 's3://bucket/path' FILEFORMAT = JSON: Load JSON files from S3 cloud storage into a Delta table
Lakeflow Connect: Managed connectors for ingesting from SaaS sources (Salesforce, databases) with built-in CDC. Use for continuous replication from external sources — replaces manual JDBC/ODBC patterns: Fully managed ingestion service for replicating data from external systems into the lakehouse

DLT / Lakeflow Declarative Pipelines

CREATE STREAMING LIVE TABLE my_table AS SELECT * FROM STREAM(source_table): Create a streaming live table that incrementally processes new data from the source
@dlt.table def my_table(): return spark.read.table('source'): Python DLT syntax for defining a materialized view
Expectations: @dlt.expect('valid_id', 'id IS NOT NULL') @dlt.expect_or_drop('positive', 'amount > 0') @dlt.expect_or_fail('unique', 'count = 1'): Data quality constraints: expect (warn), expect_or_drop (filter), expect_or_fail (abort pipeline)

MERGE INTO (Upsert Pattern)

MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *: Upsert — update existing rows, insert new ones. Core SCD Type 1 pattern.
MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED AND s.deleted = true THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *: Upsert with conditional delete — handle soft deletes from source

Medallion Architecture

Bronze → Silver → Gold: Three-layer data architecture. Bronze = raw ingestion, Silver = cleaned/validated, Gold = business-level aggregates
Bronze: Raw data, append-only, no transformations: Ingest as-is from source systems. Keep original schema. Used for replay and audit.
Silver: Cleaned, deduplicated, schema-enforced: Apply data quality rules, join reference data, standardize types. Most transformations happen here.
Gold: Business-level aggregates and features: Materialized views, dashboards, ML features. Optimized for consumption by analysts and BI tools.

Spark SQL Transformations

SELECT *, explode(array_col) AS item FROM my_table: Flatten array column — creates one row per array element
SELECT *, col1:nested_key AS extracted FROM my_table: Extract value from JSON string or VARIANT column using colon syntax (use dot notation for STRUCT fields)
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id: Inner join — only matching rows from both tables
SELECT /*+ BROADCAST(small_table) */ * FROM big JOIN small_table ON big.id = small_table.id: Broadcast join hint — replicate small table to all nodes for faster join
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rn FROM my_table QUALIFY rn = 1: Deduplication — keep only the latest row per id

PySpark DataFrame Operations

df.filter(col('status') == 'active') .select('id', 'name', 'amount') .withColumn('amount_usd', col('amount') * 1.1) .dropDuplicates(['id']): Filter, select, add column, and deduplicate in one chain
df.groupBy('category').agg( count('*').alias('total'), avg('amount').alias('avg_amount'), approx_count_distinct('user_id').alias('unique_users') ): Aggregate with count, average, and approximate distinct count
df.join(other_df, on='id', how='left'): Left join DataFrames on a common column

Lakeflow Jobs & Scheduling

Trigger Types: - Scheduled (cron) - File arrival - Table update: Three job trigger types. Use scheduled for regular cadence, file arrival for event-driven, table update for dependency-based.
Task Types: - Notebook task - SQL query task - Pipeline task (Lakeflow Declarative Pipelines, formerly DLT) - Dashboard task: Common task types in a Lakeflow Job DAG
Control Flow: - Retries (max_retries, retry_on_timeout) - Conditional tasks (if/else branching) - For-each loops: Pipeline orchestration patterns for error handling and dynamic workflows
dbutils.jobs.taskValues.set(key='row_count', value=1000) dbutils.jobs.taskValues.get(taskKey='extract', key='row_count'): Pass lightweight data between tasks in a job. Recommended over Delta tables for simple values.
Repair Run: Re-execute only failed and downstream tasks, preserving results of successful tasks: Use from the Jobs UI or API to retry a partial failure without re-running the entire job

CI/CD & Automation Bundles

databricks bundle init: Initialize a new Declarative Automation Bundle (formerly Databricks Asset Bundles/DABs) project
databricks bundle validate: Validate bundle configuration before deployment
databricks bundle deploy -t production: Deploy bundle to production target environment
Git Folders (formerly Repos): - Clone, branch, commit, push from workspace UI - Pull request workflow via Git integration: Version control integration in Databricks workspace for notebook development

Unity Catalog & Governance

GRANT SELECT ON TABLE catalog.schema.table TO `group_name`: Grant read access to a specific table
GRANT USE SCHEMA ON SCHEMA catalog.schema TO `group_name`: Grant ability to access objects within a schema (required before table grants)
GRANT USE CATALOG ON CATALOG my_catalog TO `group_name`: Required before users can access any schema or table within the catalog
REVOKE ALL PRIVILEGES ON TABLE my_table FROM `user@email.com`: Remove all permissions from a user on a table
3-level namespace: catalog.schema.table: Unity Catalog organizes data in a three-level hierarchy. All data access goes through this namespace.
Managed table: data stored in Unity Catalog managed storage External table: data stored in customer-managed cloud storage: Key difference: dropping a managed table deletes the data; dropping an external table only removes the metadata.
Column Masking: ALTER TABLE t ALTER COLUMN email SET MASK mask_fn: Dynamically mask column values at query time based on user identity. Underlying data unchanged.
Row Filters: ALTER TABLE t SET ROW FILTER filter_fn ON (region): Filter rows based on querying user's identity — users only see authorized rows
Delta Sharing: Read-only data sharing across organizations using an open protocol. Recipients don't need Databricks.: Share data externally without copying — recipients access via open Delta Sharing protocol
Data Lineage: Automatically captured by Unity Catalog at column level. View in Catalog Explorer — tracks data flow across tables, notebooks, and jobs.: Built-in lineage tracking helps understand data dependencies and impact analysis

Spark Performance & Troubleshooting

spark.sql.shuffle.partitions = 200 (default): Number of partitions after a shuffle. Reduce for small datasets, increase for large ones.
spark.sql.autoBroadcastJoinThreshold = 10MB: Tables smaller than this threshold are automatically broadcast to all nodes during joins.
Data Skew: One partition has much more data than others: Look for in Spark UI — causes one task to run much longer. Fix with salting or repartitioning.
Disk Spill: Data exceeds executor memory: Visible in Spark UI stage details. Fix by increasing spark.executor.memory or reducing partition size.
Liquid Clustering: ALTER TABLE t CLUSTER BY (col1, col2): Modern alternative to partitioning and ZORDER. Still uses OPTIMIZE for maintenance. Clustering keys can be changed without full table rewrite.
Predictive Optimization: Auto-runs OPTIMIZE and VACUUM on managed tables based on usage patterns. Requires Unity Catalog managed tables.: Enable at account or catalog level to automatically maintain tables without manual scheduling

Ready to test yourself?

Start a timed DB DE Associate mock exam or review practice questions by domain.