What is Medallion Architecture?
The Medallion Architecture is a data design pattern that organizes data in a lakehouse into three layers of increasing quality: Bronze, Silver, and Gold. This is the recommended approach for structuring data in Databricks and is heavily tested on the exam.
Bronze Layer (Raw)
The Bronze layer holds raw, unprocessed data exactly as it arrives from source systems. Characteristics: - Append-only — never modify raw data - Preserves original schema and data types - Includes metadata columns (ingestion timestamp, source file name) - Used for data replay, auditing, and debugging - Minimal or no transformations
# Bronze: Ingest raw data with metadata
spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'json') \
.option('cloudFiles.schemaLocation', '/schema/bronze') \
.load('/data/raw/') \
.withColumn('_ingestion_time', current_timestamp()) \
.withColumn('_source_file', input_file_name()) \
.writeStream \
.option('checkpointLocation', '/checkpoints/bronze') \
.toTable('bronze_events')Silver Layer (Cleaned)
The Silver layer contains cleaned, validated, and standardized data. Transformations at this layer: - Remove duplicates - Handle null values - Standardize data types and formats - Join with reference/dimension data - Apply data quality rules - Filter out invalid records
-- Silver: Clean and deduplicate
CREATE OR REPLACE TABLE silver_events AS
SELECT
event_id,
CAST(event_date AS DATE) AS event_date,
UPPER(TRIM(country)) AS country,
COALESCE(amount, 0) AS amount,
_ingestion_time
FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY event_id ORDER BY _ingestion_time DESC
) AS rn
FROM bronze_events
WHERE event_id IS NOT NULL
)
WHERE rn = 1;Gold Layer (Business)
The Gold layer contains business-level aggregates and features optimized for consumption by analysts, dashboards, and ML models. Characteristics: - Pre-aggregated metrics and KPIs - Denormalized for query performance - Materialized views for BI tools - Optimized for specific business use cases
-- Gold: Business-level aggregates
CREATE OR REPLACE TABLE gold_daily_revenue AS
SELECT
event_date,
country,
COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM silver_events
GROUP BY event_date, country;