CertPrepNowFREE
DatabricksData Transformation and Modeling

Medallion Architecture (Bronze/Silver/Gold)

What is Medallion Architecture?

The Medallion Architecture is a data design pattern that organizes data in a lakehouse into three layers of increasing quality: Bronze, Silver, and Gold. This is the recommended approach for structuring data in Databricks and is heavily tested on the exam.

Exam Tip: Medallion Architecture is a core exam concept. Know which transformations happen at each layer and what each layer is used for.

Bronze Layer (Raw)

The Bronze layer holds raw, unprocessed data exactly as it arrives from source systems. Characteristics: - Append-only — never modify raw data - Preserves original schema and data types - Includes metadata columns (ingestion timestamp, source file name) - Used for data replay, auditing, and debugging - Minimal or no transformations

# Bronze: Ingest raw data with metadata
spark.readStream.format('cloudFiles') \
  .option('cloudFiles.format', 'json') \
  .option('cloudFiles.schemaLocation', '/schema/bronze') \
  .load('/data/raw/') \
  .withColumn('_ingestion_time', current_timestamp()) \
  .withColumn('_source_file', input_file_name()) \
  .writeStream \
  .option('checkpointLocation', '/checkpoints/bronze') \
  .toTable('bronze_events')

Silver Layer (Cleaned)

The Silver layer contains cleaned, validated, and standardized data. Transformations at this layer: - Remove duplicates - Handle null values - Standardize data types and formats - Join with reference/dimension data - Apply data quality rules - Filter out invalid records

-- Silver: Clean and deduplicate
CREATE OR REPLACE TABLE silver_events AS
SELECT
  event_id,
  CAST(event_date AS DATE) AS event_date,
  UPPER(TRIM(country)) AS country,
  COALESCE(amount, 0) AS amount,
  _ingestion_time
FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY event_id ORDER BY _ingestion_time DESC
  ) AS rn
  FROM bronze_events
  WHERE event_id IS NOT NULL
)
WHERE rn = 1;
Exam Tip: Silver layer is where most data quality work happens. Deduplication, type casting, null handling, and joins with reference data all happen here.

Gold Layer (Business)

The Gold layer contains business-level aggregates and features optimized for consumption by analysts, dashboards, and ML models. Characteristics: - Pre-aggregated metrics and KPIs - Denormalized for query performance - Materialized views for BI tools - Optimized for specific business use cases

-- Gold: Business-level aggregates
CREATE OR REPLACE TABLE gold_daily_revenue AS
SELECT
  event_date,
  country,
  COUNT(*) AS total_orders,
  SUM(amount) AS total_revenue,
  AVG(amount) AS avg_order_value
FROM silver_events
GROUP BY event_date, country;
Exam Tip: Gold layer objects can be tables, views, materialized views, or streaming tables. Know the difference between these — the exam tests when to use each.