CertPrepNowFREE
DatabricksDatabricks Intelligence Platform

Delta Lake Architecture Complete Guide

What is Delta Lake?

Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes. It runs on top of your existing cloud storage (S3, ADLS, GCS) and adds a transaction log that tracks every change to your data. Delta Lake is the default table format in Databricks — when you CREATE TABLE, you get a Delta table unless you specify otherwise.

Exam Tip: On the exam, assume all tables are Delta tables unless explicitly stated otherwise. Delta Lake is the foundational technology for the entire Databricks platform.

ACID Transactions

Delta Lake provides full ACID transaction support: - Atomicity: Each write operation either fully succeeds or fully fails — no partial writes - Consistency: Data always reflects a valid state after transactions - Isolation: Concurrent reads and writes don't interfere with each other (snapshot isolation) - Durability: Committed changes are permanent and survive failures The transaction log (_delta_log directory) records every operation as a JSON file. Every 10 commits, a checkpoint Parquet file is created for faster log reading.

my_table/
  _delta_log/
    00000000000000000000.json    # First transaction
    00000000000000000001.json    # Second transaction
    ...
    00000000000000000010.checkpoint.parquet  # Checkpoint at commit 10
  part-00000-xxxx.parquet         # Data files
  part-00001-xxxx.parquet
Exam Tip: The transaction log is stored in a _delta_log directory. Checkpoints are created every 10 commits. This is a common exam question topic.

Time Travel

Delta Lake stores historical versions of your data. You can query any past version by version number or timestamp. Time travel is limited by the VACUUM retention period — once you vacuum old files, you can't time travel to those versions. The default retention is 7 days (168 hours).

-- Query a specific version
SELECT * FROM my_table VERSION AS OF 5;

-- Query by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '2026-01-15';

-- View version history
DESCRIBE HISTORY my_table;

-- Restore to a previous version
RESTORE TABLE my_table TO VERSION AS OF 5;
Exam Tip: VACUUM deletes old data files. After VACUUM, you cannot time travel to versions that required those files. The default retention is 168 hours (7 days). Never set retention below 7 days in production.

Schema Enforcement and Evolution

Schema enforcement ensures that every write to a Delta table matches the table's schema. If you try to write data with extra columns or wrong types, the write fails. Schema evolution allows the schema to automatically change when new columns appear in source data. You enable it with mergeSchema option. - Schema enforcement: ON by default, prevents bad data from entering the table - Schema evolution: OFF by default, must be explicitly enabled per write operation

# Schema enforcement (default) — this fails if schema doesn't match
df.write.format('delta').mode('append').saveAsTable('my_table')

# Schema evolution — automatically adds new columns
df.write.format('delta') \
  .option('mergeSchema', 'true') \
  .mode('append') \
  .saveAsTable('my_table')
Exam Tip: Schema enforcement is ON by default. Schema evolution must be explicitly enabled. The exam tests whether you know which is the default behavior.

OPTIMIZE and Z-ORDER

Delta tables can accumulate many small files over time (the 'small file problem'). OPTIMIZE compacts these into larger files for better read performance. Z-ORDER co-locates related data in the same files based on column values. This dramatically speeds up queries that filter on Z-ORDER columns. OPTIMIZE is idempotent — running it multiple times is safe.

-- Compact small files
OPTIMIZE my_table;

-- Compact and co-locate data by columns
OPTIMIZE my_table ZORDER BY (country, date);

-- Delete old files (vacuum)
VACUUM my_table RETAIN 168 HOURS;
Exam Tip: Know the difference between OPTIMIZE (compacts files), ZORDER (co-locates data), and VACUUM (deletes old files). They are separate operations with different purposes.

Liquid Clustering (New in 2026 Exam)

Liquid Clustering is the modern replacement for OPTIMIZE + ZORDER. Instead of manually running OPTIMIZE, Liquid Clustering automatically reorganizes data layout as data is written. Key differences from Z-ORDER: - No manual OPTIMIZE commands needed - Clustering columns can be changed without rewriting all data - Works incrementally — only reorganizes new data - Recommended for all new tables

-- Create a table with Liquid Clustering
CREATE TABLE my_table (id INT, country STRING, amount DECIMAL)
CLUSTER BY (country);

-- Change clustering columns (no rewrite needed)
ALTER TABLE my_table CLUSTER BY (country, date);

-- Remove clustering
ALTER TABLE my_table CLUSTER BY NONE;
Exam Tip: Liquid Clustering is new to the May 2026 exam blueprint. It replaces manual OPTIMIZE + ZORDER. Know that it works incrementally and that clustering columns can be changed without rewriting existing data.