Overview
Both Auto Loader and COPY INTO are used to incrementally load new files from cloud storage into Delta tables. They track which files have been processed and only load new ones. The key difference: Auto Loader is a streaming solution that runs continuously, while COPY INTO is a batch SQL command that runs on-demand.
Auto Loader (cloudFiles)
Auto Loader uses Structured Streaming with the cloudFiles format. It processes new files as they arrive and can run continuously or in triggered batches. Two file detection modes: - Directory listing: Scans the directory for new files (simpler, works everywhere) - File notification: Uses cloud event notifications to detect new files (faster for large directories, requires cloud setup) Auto Loader supports schema inference and schema evolution natively.
spark.readStream.format('cloudFiles') \
.option('cloudFiles.format', 'json') \
.option('cloudFiles.schemaLocation', '/checkpoints/schema') \
.option('cloudFiles.schemaEvolutionMode', 'addNewColumns') \
.load('/data/landing/') \
.writeStream \
.option('checkpointLocation', '/checkpoints/data') \
.trigger(availableNow=True) \
.toTable('bronze_table')COPY INTO
COPY INTO is a SQL command for incremental batch loading. It's idempotent — running it multiple times only loads files that haven't been processed yet. Best for: - Simple batch loading scenarios - When you want SQL-only pipelines - Small to medium file volumes - Ad-hoc data loading
COPY INTO my_table
FROM '/data/landing/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');When to Use Which
Use Auto Loader when: - You have millions of files or files arriving continuously - You need schema evolution (new columns in source data) - You want streaming semantics (exactly-once guarantees) - You're building production pipelines Use COPY INTO when: - You have a small number of files (thousands, not millions) - You want a simple SQL command - You're doing ad-hoc or infrequent loads - You don't need streaming semantics