CertPrepNowFREE
DatabricksData Ingestion and Loading

Auto Loader vs COPY INTO: When to Use Each

Overview

Both Auto Loader and COPY INTO are used to incrementally load new files from cloud storage into Delta tables. They track which files have been processed and only load new ones. The key difference: Auto Loader is a streaming solution that runs continuously, while COPY INTO is a batch SQL command that runs on-demand.

Exam Tip: The exam frequently tests when to choose Auto Loader vs COPY INTO. Auto Loader is preferred for most production use cases due to better scalability and schema evolution support.

Auto Loader (cloudFiles)

Auto Loader uses Structured Streaming with the cloudFiles format. It processes new files as they arrive and can run continuously or in triggered batches. Two file detection modes: - Directory listing: Scans the directory for new files (simpler, works everywhere) - File notification: Uses cloud event notifications to detect new files (faster for large directories, requires cloud setup) Auto Loader supports schema inference and schema evolution natively.

spark.readStream.format('cloudFiles') \
  .option('cloudFiles.format', 'json') \
  .option('cloudFiles.schemaLocation', '/checkpoints/schema') \
  .option('cloudFiles.schemaEvolutionMode', 'addNewColumns') \
  .load('/data/landing/') \
  .writeStream \
  .option('checkpointLocation', '/checkpoints/data') \
  .trigger(availableNow=True) \
  .toTable('bronze_table')
Exam Tip: Auto Loader requires a schema location (for schema tracking) and a checkpoint location (for file tracking). Both are required — the exam tests this.

COPY INTO

COPY INTO is a SQL command for incremental batch loading. It's idempotent — running it multiple times only loads files that haven't been processed yet. Best for: - Simple batch loading scenarios - When you want SQL-only pipelines - Small to medium file volumes - Ad-hoc data loading

COPY INTO my_table
FROM '/data/landing/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

When to Use Which

Use Auto Loader when: - You have millions of files or files arriving continuously - You need schema evolution (new columns in source data) - You want streaming semantics (exactly-once guarantees) - You're building production pipelines Use COPY INTO when: - You have a small number of files (thousands, not millions) - You want a simple SQL command - You're doing ad-hoc or infrequent loads - You don't need streaming semantics

Exam Tip: The exam's default recommendation is Auto Loader for production workloads. COPY INTO is simpler but doesn't scale as well for large file volumes. If the question mentions 'millions of files' or 'production pipeline', choose Auto Loader.