DatabricksDB DE Associate59 concepts
DB DE Associate Cheat Sheet
Quick reference for the Databricks Certified Data Engineer Associate exam.
Quick Navigation
Delta Lake Core CommandsData Ingestion — Auto LoaderData Ingestion — COPY INTODLT / Lakeflow Declarative PipelinesMERGE INTO (Upsert Pattern)Medallion ArchitectureSpark SQL TransformationsPySpark DataFrame OperationsLakeflow Jobs & SchedulingCI/CD & Automation BundlesUnity Catalog & GovernanceSpark Performance & Troubleshooting
Delta Lake Core Commands
- CREATE TABLE my_table (id INT, name STRING) USING DELTA
- Create a managed Delta table in Unity Catalog
- DESCRIBE HISTORY my_table
- View table version history — shows operations, timestamps, and user info
- SELECT * FROM my_table VERSION AS OF 3
- Time travel — query a specific version of the table
- SELECT * FROM my_table TIMESTAMP AS OF '2026-01-01'
- Time travel — query table state at a specific timestamp
- RESTORE TABLE my_table TO VERSION AS OF 3
- Restore table to a previous version
- OPTIMIZE my_table
- Compact small files into larger ones for better read performance
- OPTIMIZE my_table ZORDER BY (col1, col2)
- Compact files and co-locate related data for faster filter queries
- VACUUM my_table RETAIN 168 HOURS
- Delete old files no longer referenced by the transaction log (default 7 days)
Data Ingestion — Auto Loader
- spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .option('cloudFiles.schemaLocation', '/schema') .load('/data/input')
- Auto Loader with schema inference — automatically detects and processes new files
- .option('cloudFiles.schemaEvolutionMode', 'addNewColumns')
- Enable schema evolution — automatically adds new columns from source data
- .option('cloudFiles.useNotifications', 'true')
- Use file notification mode (faster for large directories, requires cloud setup)
- spark.readStream.format('cloudFiles') .option('cloudFiles.format', 'json') .schema(defined_schema) .load('/data/input')
- Auto Loader with explicit schema — no schema inference or evolution
- trigger(availableNow=True): Processes all available data then stops — combines streaming checkpoint tracking with batch semantics
- Use with Auto Loader or Structured Streaming to process all pending files in one micro-batch then stop
- rescuedDataColumn: Captures malformed records in a _rescued_data column instead of failing the stream
- Enabled via .option('rescuedDataColumn', '_rescued_data') — keeps bad records for later inspection
Data Ingestion — COPY INTO
- COPY INTO my_table FROM '/path/to/files' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true') COPY_OPTIONS ('mergeSchema' = 'true')
- Incremental file loading — only processes new files since last COPY INTO
- COPY INTO my_table FROM 's3://bucket/path' FILEFORMAT = JSON
- Load JSON files from S3 cloud storage into a Delta table
- Lakeflow Connect: Managed connectors for ingesting from SaaS sources (Salesforce, databases) with built-in CDC. Use for continuous replication from external sources — replaces manual JDBC/ODBC patterns
- Fully managed ingestion service for replicating data from external systems into the lakehouse
DLT / Lakeflow Declarative Pipelines
- CREATE STREAMING LIVE TABLE my_table AS SELECT * FROM STREAM(source_table)
- Create a streaming live table that incrementally processes new data from the source
- @dlt.table def my_table(): return spark.read.table('source')
- Python DLT syntax for defining a materialized view
- Expectations: @dlt.expect('valid_id', 'id IS NOT NULL') @dlt.expect_or_drop('positive', 'amount > 0') @dlt.expect_or_fail('unique', 'count = 1')
- Data quality constraints: expect (warn), expect_or_drop (filter), expect_or_fail (abort pipeline)
MERGE INTO (Upsert Pattern)
- MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
- Upsert — update existing rows, insert new ones. Core SCD Type 1 pattern.
- MERGE INTO target AS t USING source AS s ON t.id = s.id WHEN MATCHED AND s.deleted = true THEN DELETE WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *
- Upsert with conditional delete — handle soft deletes from source
Medallion Architecture
- Bronze → Silver → Gold
- Three-layer data architecture. Bronze = raw ingestion, Silver = cleaned/validated, Gold = business-level aggregates
- Bronze: Raw data, append-only, no transformations
- Ingest as-is from source systems. Keep original schema. Used for replay and audit.
- Silver: Cleaned, deduplicated, schema-enforced
- Apply data quality rules, join reference data, standardize types. Most transformations happen here.
- Gold: Business-level aggregates and features
- Materialized views, dashboards, ML features. Optimized for consumption by analysts and BI tools.
Spark SQL Transformations
- SELECT *, explode(array_col) AS item FROM my_table
- Flatten array column — creates one row per array element
- SELECT *, col1:nested_key AS extracted FROM my_table
- Extract value from JSON string or VARIANT column using colon syntax (use dot notation for STRUCT fields)
- SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id
- Inner join — only matching rows from both tables
- SELECT /*+ BROADCAST(small_table) */ * FROM big JOIN small_table ON big.id = small_table.id
- Broadcast join hint — replicate small table to all nodes for faster join
- SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC) AS rn FROM my_table QUALIFY rn = 1
- Deduplication — keep only the latest row per id
PySpark DataFrame Operations
- df.filter(col('status') == 'active') .select('id', 'name', 'amount') .withColumn('amount_usd', col('amount') * 1.1) .dropDuplicates(['id'])
- Filter, select, add column, and deduplicate in one chain
- df.groupBy('category').agg( count('*').alias('total'), avg('amount').alias('avg_amount'), approx_count_distinct('user_id').alias('unique_users') )
- Aggregate with count, average, and approximate distinct count
- df.join(other_df, on='id', how='left')
- Left join DataFrames on a common column
Lakeflow Jobs & Scheduling
- Trigger Types: - Scheduled (cron) - File arrival - Table update
- Three job trigger types. Use scheduled for regular cadence, file arrival for event-driven, table update for dependency-based.
- Task Types: - Notebook task - SQL query task - Pipeline task (Lakeflow Declarative Pipelines, formerly DLT) - Dashboard task
- Common task types in a Lakeflow Job DAG
- Control Flow: - Retries (max_retries, retry_on_timeout) - Conditional tasks (if/else branching) - For-each loops
- Pipeline orchestration patterns for error handling and dynamic workflows
- dbutils.jobs.taskValues.set(key='row_count', value=1000) dbutils.jobs.taskValues.get(taskKey='extract', key='row_count')
- Pass lightweight data between tasks in a job. Recommended over Delta tables for simple values.
- Repair Run: Re-execute only failed and downstream tasks, preserving results of successful tasks
- Use from the Jobs UI or API to retry a partial failure without re-running the entire job
CI/CD & Automation Bundles
- databricks bundle init
- Initialize a new Declarative Automation Bundle (formerly Databricks Asset Bundles/DABs) project
- databricks bundle validate
- Validate bundle configuration before deployment
- databricks bundle deploy -t production
- Deploy bundle to production target environment
- Git Folders (formerly Repos): - Clone, branch, commit, push from workspace UI - Pull request workflow via Git integration
- Version control integration in Databricks workspace for notebook development
Unity Catalog & Governance
- GRANT SELECT ON TABLE catalog.schema.table TO `group_name`
- Grant read access to a specific table
- GRANT USE SCHEMA ON SCHEMA catalog.schema TO `group_name`
- Grant ability to access objects within a schema (required before table grants)
- GRANT USE CATALOG ON CATALOG my_catalog TO `group_name`
- Required before users can access any schema or table within the catalog
- REVOKE ALL PRIVILEGES ON TABLE my_table FROM `user@email.com`
- Remove all permissions from a user on a table
- 3-level namespace: catalog.schema.table
- Unity Catalog organizes data in a three-level hierarchy. All data access goes through this namespace.
- Managed table: data stored in Unity Catalog managed storage External table: data stored in customer-managed cloud storage
- Key difference: dropping a managed table deletes the data; dropping an external table only removes the metadata.
- Column Masking: ALTER TABLE t ALTER COLUMN email SET MASK mask_fn
- Dynamically mask column values at query time based on user identity. Underlying data unchanged.
- Row Filters: ALTER TABLE t SET ROW FILTER filter_fn ON (region)
- Filter rows based on querying user's identity — users only see authorized rows
- Delta Sharing: Read-only data sharing across organizations using an open protocol. Recipients don't need Databricks.
- Share data externally without copying — recipients access via open Delta Sharing protocol
- Data Lineage: Automatically captured by Unity Catalog at column level. View in Catalog Explorer — tracks data flow across tables, notebooks, and jobs.
- Built-in lineage tracking helps understand data dependencies and impact analysis
Spark Performance & Troubleshooting
- spark.sql.shuffle.partitions = 200 (default)
- Number of partitions after a shuffle. Reduce for small datasets, increase for large ones.
- spark.sql.autoBroadcastJoinThreshold = 10MB
- Tables smaller than this threshold are automatically broadcast to all nodes during joins.
- Data Skew: One partition has much more data than others
- Look for in Spark UI — causes one task to run much longer. Fix with salting or repartitioning.
- Disk Spill: Data exceeds executor memory
- Visible in Spark UI stage details. Fix by increasing spark.executor.memory or reducing partition size.
- Liquid Clustering: ALTER TABLE t CLUSTER BY (col1, col2)
- Modern alternative to partitioning and ZORDER. Still uses OPTIMIZE for maintenance. Clustering keys can be changed without full table rewrite.
- Predictive Optimization: Auto-runs OPTIMIZE and VACUUM on managed tables based on usage patterns. Requires Unity Catalog managed tables.
- Enable at account or catalog level to automatically maintain tables without manual scheduling