SnowflakeDEA-C025 domains

DEA-C02 Exam Notes

Last-minute traps, must-know facts, and scenario tips for the SnowPro Advanced: Data Engineer exam.

General Exam Tips

1.Read ALL answer options before selecting — many questions have a 'most correct' answer, not just a correct one. Snowflake scenarios often have two plausible options; the distinction is usually cost efficiency or operational overhead.
2.Time budget: 65 questions in 115 minutes = ~1.75 min/question. Flag complex scenario questions and return to them. Never spend more than 3 minutes on one question on the first pass.
3.No penalty for wrong answers — always answer every question, even guesses. Leave nothing blank.
4.Scenario questions describe a business constraint (latency, cost, simplicity). Map the constraint to the correct Snowflake feature before reading the options. This prevents the options from anchoring you to wrong choices.
5.When two answers both seem right, ask: which is lower operational overhead? Which is more cost-efficient? The exam prefers native Snowflake features over external solutions when Snowflake has a built-in equivalent.
6.Multi-select questions (select two or three) are harder — partial credit is not given. If unsure, use elimination to narrow to the most defensible combination.
7.DEA-C02 was updated in February 2025. If you studied for DEA-C01, note that Domain 4 is now 'Data Governance' (not 'Security') and Iceberg tables, unstructured data ingestion, and Cortex LLM functions received increased coverage.
8.About 30% of questions test SQL syntax directly. The rest test architectural judgment about which Snowflake pattern fits a scenario. Practice both; do not over-index on memorizing syntax alone.
9.When a question describes spill-to-disk in the query profile, the answer is always 'scale up the warehouse'. When a question describes query queuing, the answer is always 'multi-cluster warehouse (scale out)'. These patterns are reliable.
10.SnowPro Core is a prerequisite. If Core concepts feel shaky, review micro-partitions, warehouses, roles, and basic SQL before going deep on Advanced topics.

Quick Navigation

Data Movement Data Transformation Performance Optimization Storage and Data Protection Data Governance

Domain 126% of exam

Data Movement

Must-Know Facts

COPY INTO uses VIRTUAL WAREHOUSE credits (your compute). Snowpipe uses SERVERLESS compute credits (Snowflake-managed, separate billing). This cost distinction appears on the exam explicitly.
Snowpipe delivery semantics: AT-LEAST-ONCE. Duplicate rows are possible if a file notification is replayed. Downstream deduplication is required for exactly-once guarantees.
Snowpipe Streaming (SDK-based) delivers EXACTLY-ONCE via offset token tracking per channel. It requires NO staging files and achieves sub-second latency. Classic Snowpipe and Snowpipe Streaming have fundamentally different delivery contracts.
Snowpipe Streaming uses CHANNELS (not stages, not SQS). Each channel is an ordered sequence of rows into a table. On recovery, call getLatestCommittedOffsetToken() to resume without duplicates.
Stream offset advances ONLY on a successful COMMIT of the consuming transaction. A task that starts but fails (without committing) does NOT advance the stream offset. The unchanged offset means the next run will see the same data.
SYSTEM$STREAM_HAS_DATA('stream_name') returns TRUE if there are unconsumed changes. Used in task WHEN clauses to avoid wasted compute running tasks against empty streams.
Stream types: standard (INSERT + UPDATE + DELETE), append-only (INSERT only, more efficient for insert-only pipelines), insert-only (for external tables only — no DML tracking, just new files).
Dynamic Table lag target is a MAXIMUM STALENESS threshold, not a fixed schedule. Snowflake chooses when to refresh. DOWNSTREAM lag propagates staleness from consuming dynamic tables up the pipeline.
Dynamic Table refresh modes: INCREMENTAL (default for supported SQL patterns), FULL (forced when SQL cannot support incremental refresh, e.g., certain window functions or multiple GROUP BYs). AUTO lets Snowflake decide.
External stages use STORAGE INTEGRATION objects for credential management — not hardcoded access keys. Storage integrations require configuration on both the Snowflake side and the cloud provider IAM side.
Kafka Connector for Snowflake can use either classic Snowpipe (file-based, at-least-once) or Snowpipe Streaming mode (row-level, exactly-once). The streaming mode is preferred for lower latency.
Data unloading COPY INTO exports to a stage with FILE_FORMAT, MAX_FILE_SIZE, HEADER, and SINGLE options. SINGLE=FALSE (default) writes multiple parallel files; SINGLE=TRUE writes one file.
Snowflake Secure Data Sharing: no data is copied or moved. Consumers query the provider's actual storage with their own warehouse. Changes to the source are immediately visible to consumers.

Common Traps

TrapSnowpipe is free / costs nothing because it is serverless

RealitySnowpipe uses Snowflake-managed serverless compute, which accrues separate compute credits. 'Serverless' means no virtual warehouse is needed, not that it is free. Budget for Snowpipe credits in cost analysis.

TrapBoth Snowpipe variants (file-based and Streaming) provide the same delivery guarantee

RealityClassic file-based Snowpipe provides AT-LEAST-ONCE delivery — duplicates are possible on notification replay. Snowpipe Streaming provides EXACTLY-ONCE delivery via per-channel offset token tracking. They are architecturally different, not interchangeable.

TrapA task that executes and reads from a stream advances the stream offset

RealityThe stream offset advances only when the consuming transaction COMMITS successfully. If the task reads the stream but fails before committing, the offset stays put. The next task run will re-read all unconsumed changes.

TrapDynamic Tables refresh on a fixed cron schedule like Tasks

RealityDynamic Tables use a LAG TARGET (maximum acceptable staleness), not a cron schedule. Snowflake's refresh service determines the actual refresh timing to meet the lag threshold. You cannot force a specific clock-time refresh without using DOWNSTREAM lag or manual REFRESH.

TrapDynamic Tables can use non-deterministic functions like CURRENT_TIMESTAMP() in their SQL definition

RealityDynamic Tables do not support non-deterministic functions in their defining query. CURRENT_TIMESTAMP(), RANDOM(), CURRENT_USER() will cause errors or unexpected behavior. Use deterministic expressions only.

TrapExternal tables have the same query performance as internal tables

RealityExternal tables read directly from cloud storage on every query with no caching and limited micro-partition pruning metadata. Performance is significantly slower than internal tables. Use external tables for query-in-place on data you cannot or do not want to load.

TrapStale Snowpipe pipelines automatically restart themselves

RealitySnowpipe pipelines can become stale (paused) due to errors or inactivity. Monitoring via SYSTEM$PIPE_STATUS() and COPY_HISTORY is required to detect stale pipelines. Restarting requires explicit ALTER PIPE REFRESH or investigation of the root cause.

Confusing Pairs

COPY INTOSnowpipe (file-based)

COPY INTO = batch load triggered manually/by schedule/Airflow, uses virtual warehouse credits, good for large periodic bulk loads. Snowpipe = continuous file-triggered load via SQS/Event Grid/Pub-Sub, uses serverless credits, good for near-real-time file ingestion. Key: COPY INTO is explicit and batched; Snowpipe is event-driven and continuous.

Snowpipe (file-based)Snowpipe Streaming (SDK-based)

File-based Snowpipe: triggered by file arrival, stages files, 1-5 min latency, at-least-once delivery. Snowpipe Streaming: SDK inserts rows directly (no files, no SQS), uses channels with offset tokens, sub-second latency, exactly-once delivery. Key difference: file vs row-level; at-least-once vs exactly-once.

Streams + TasksDynamic Tables

Streams + Tasks = procedural: you write the logic, scheduling, and error handling. Supports Snowpark, multi-target writes, custom exception handling, and non-SQL transformations. Dynamic Tables = declarative: define a SQL query and a lag target, Snowflake manages everything else. Prefer Dynamic Tables for pure SQL pipelines. Prefer Streams + Tasks when you need procedural control.

External TablesIceberg Tables

External tables = Snowflake-proprietary metadata overlay on cloud storage files; no open format; Snowflake-only compatible. Iceberg tables = Apache Iceberg open table format; supports schema evolution, partition evolution, time travel at format level; interoperable with Spark, Trino, Flink. Key: external tables are Snowflake-specific; Iceberg is an open, portable format.

Append-Only StreamStandard Stream

Append-only stream: captures INSERT operations only. More efficient (lower overhead) for insert-only source tables. Standard stream: captures INSERT, UPDATE, and DELETE. Required when you need to propagate deletes or upserts downstream. Key: use append-only when you know the source only ever inserts new rows.

Scenario Tips

If the question asks about:

Need to ingest files from S3 within 5 minutes of arrival with lowest operational overhead

Answer:

Snowpipe with AUTO_INGEST = TRUE and S3 event notifications to SQS. It triggers automatically on file arrival, uses serverless compute, and requires no scheduling code.

Distractor to avoid:

A scheduled Task running COPY INTO every 5 minutes — this requires a virtual warehouse to be running and may queue, missing the SLA during periods of warehouse contention.

If the question asks about:

Ingest CDC events from a Java application at sub-second latency with exactly-once delivery

Answer:

Snowpipe Streaming SDK with channels and offset token tracking. It bypasses file staging and SQS entirely, achieving sub-second latency with exactly-once semantics.

Distractor to avoid:

Classic Snowpipe — wrong because classic Snowpipe requires staging files and provides only at-least-once delivery. The scenario specifies exactly-once, which only Snowpipe Streaming provides.

If the question asks about:

Build a daily aggregation pipeline that processes only new orders added since the last run

Answer:

Append-only stream on the ORDERS table + a Task consuming the stream with SYSTEM$STREAM_HAS_DATA() in the WHEN clause. Append-only stream is more efficient than standard stream for insert-only tables.

Distractor to avoid:

Standard stream — works but is heavier overhead when only INSERTs need to be captured. If the question specifies insert-only pattern and asks for efficiency, append-only is preferred.

If the question asks about:

A company wants to share live product catalog data with 5 partners. Partners must see updates within minutes and cannot copy the data.

Answer:

Snowflake Secure Data Sharing via CREATE SHARE. Consumer queries the provider's actual storage in real time. No data movement occurs. Share secure views (not base tables) to prevent CTAS copying.

Distractor to avoid:

Database replication — creates actual copies that partners could then read from (and potentially extract). Replication also lags by the refresh interval, not real-time.

Last-Minute Facts

1Snowpipe auto-ingest for S3 uses SQS (not SNS directly). Azure uses Storage Queue notifications. GCS uses Pub/Sub.

2COPY INTO load history is maintained for 64 days — files loaded within this period are NOT reloaded unless FORCE = TRUE is specified.

3PURGE = TRUE in COPY INTO removes staged files after a successful load. Default is FALSE (files remain in stage).

4Stream stale condition: if a stream is not consumed within the source table's TIME_TRAVEL retention period, the stream becomes STALE and must be recreated.

5Dynamic Tables cannot be defined with non-deterministic functions (CURRENT_TIMESTAMP, RANDOM, UUID_STRING, etc.).

6Task DAGs: only the ROOT task has a SCHEDULE. Child tasks have AFTER clauses and NO SCHEDULE. Resume root task LAST when enabling a DAG.

7Kafka Connector with Snowpipe Streaming mode achieves exactly-once; with classic Snowpipe mode it achieves at-least-once.

8INSERT-ONLY streams are specifically for EXTERNAL TABLES (not regular tables). They capture only new file arrivals, not DML.

Domain 225% of exam

Data Transformation

Must-Know Facts

Snowpark uses LAZY EVALUATION — DataFrame transformations build a logical plan and execute nothing until an action is called (collect(), show(), count(), write()). This is the same model as Spark.
Pushdown optimization: Snowpark DataFrame operations (filter, select, join, groupBy, agg) compile to Snowflake SQL and execute INSIDE Snowflake's engine. The compute stays in Snowflake, data does not move to the client.
Pushdown BREAKS when you call a plain Python function (not part of the Snowpark API) directly inside a DataFrame transformation. That function runs client-side and Snowflake must transfer rows to the client.
Vectorized (batch) Python UDFs: use a pandas Series as input/output. Process multiple rows per function call instead of row-by-row. Significantly faster for large-volume Python transformations. Defined with @vectorized decorator or 'compute' method in a class.
EXECUTE AS OWNER stored procedures: run with the PROCEDURE OWNER's privileges, not the caller's. Enables controlled privilege escalation — a caller without direct table access can run a procedure that touches privileged tables. EXECUTE AS CALLER runs with the caller's own privileges (caller needs direct access).
Python UDF packages: only packages available in the Snowflake Anaconda channel can be referenced via PACKAGES = (...). Arbitrary pip packages require uploading to a stage and referencing via IMPORTS.
FLATTEN exam judgment: use FLATTEN (LATERAL) to explode ARRAY elements into separate rows. Use dot/bracket notation (v:field::TYPE) to read fields within an object without adding rows. A common trap is applying FLATTEN to a VARIANT object expecting field access — FLATTEN operates on arrays, not objects. If the question asks about reading nested object fields, dot notation is the answer; if it asks about one-row-per-element from an array, FLATTEN is the answer.
VARIANT column maximum value size is 16 MB. Documents exceeding 16 MB cannot be stored in a single VARIANT cell and must be chunked or split before loading.
External functions: HTTPS endpoint integration requiring an API integration object. Called row-by-row (Snowflake batches rows, but network round-trips add significant latency). Use for ML model endpoints or third-party lookups where data volume is moderate.
Cortex SENTIMENT returns a FLOAT between -1.0 (most negative) and 1.0 (most positive) — not a string label like 'positive'. EMBED_TEXT returns a VECTOR type, not an ARRAY. These syntax details appear as distractors.
Git repository integration: CREATE OR REPLACE GIT REPOSITORY ... ORIGIN = '<repo_url>'. Changes in the remote repo are NOT automatically pulled — a SYSTEM$GIT_REPOSITORY_FETCH() call or ALTER REPOSITORY FETCH is required.

Common Traps

TrapSnowpark is client-side Python that connects to Snowflake like a regular database driver

RealitySnowpark's key value proposition is server-side execution via pushdown optimization. DataFrame operations become Snowflake SQL plans that execute inside Snowflake's distributed engine. Snowpark is NOT just a Python connector — it brings the compute to the data.

TrapAny Python code inside a Snowpark DataFrame transformation benefits from pushdown

RealityOnly Snowpark API operations (filter, select, join, groupBy, etc.) push down. If you call a custom Python function that is not part of the Snowpark API inside a transformation, that code runs CLIENT-SIDE. Data rows are transferred to the Python client for that operation, eliminating the performance benefit.

TrapMaterialized views can use UDFs, CTEs, and CURRENT_ROLE() for masking logic

RealityMaterialized views have strict restrictions: no UDFs, no CTEs (WITH clause), no LATERAL joins, no UNION, no non-deterministic functions (including CURRENT_ROLE(), CURRENT_TIMESTAMP()). Trying to reference CURRENT_ROLE() in a materialized view will error or produce incorrect results. Use Dynamic Tables or regular views for role-based logic.

TrapEXECUTE AS CALLER stored procedures can access objects the caller cannot directly query

RealityEXECUTE AS CALLER runs with the CALLER's privileges — the caller must have direct access to any object the procedure touches. EXECUTE AS OWNER is the mechanism for controlled privilege escalation. This distinction is tested explicitly.

TrapCortex SENTIMENT returns a string label ('positive', 'negative', 'neutral')

RealitySENTIMENT returns a FLOAT between -1.0 and 1.0. To get a label, apply a CASE WHEN threshold (e.g., > 0.2 = 'positive', < -0.2 = 'negative'). CLASSIFY_TEXT returns string labels for user-defined categories.

TrapA UDF defined in JavaScript can use external HTTP calls or third-party libraries

RealityJavaScript UDFs run in a sandboxed Snowflake runtime with no network access and no external library imports. For external calls, use external functions (API Gateway integration). For third-party libraries, use Python UDFs with Snowflake Anaconda packages.

TrapClustering information: high average_depth means good clustering

RealityHigh average_depth means POOR clustering. average_depth near 1.0 is IDEAL — it means each value spans very few overlapping micro-partitions, so queries prune aggressively. The deprecated SYSTEM$CLUSTERING_RATIO returned 0–100 where higher=better, but SYSTEM$CLUSTERING_INFORMATION returns average_overlaps and average_depth (both: lower = better). These directions are counterintuitive and frequently reversed on the exam.

Confusing Pairs

Vectorized Python UDFRow-by-row Python UDF

Row-by-row UDF: called once per row, input is a scalar Python value. Slow for large tables due to Python overhead per row. Vectorized UDF: called in batches, input is a pandas Series, output is a pandas Series. Much faster for large-scale transforms. Use vectorized whenever the function can operate on pandas batch. Key: row-by-row = simplicity; vectorized = performance at scale.

External FunctionsSnowpark Python UDFs

External functions: call an external HTTPS endpoint (API Gateway + Lambda/Cloud Function). Per-batch network latency. Good when the logic MUST live outside Snowflake (e.g., proprietary ML model). Snowpark Python UDFs: run inside Snowflake's compute. No network round-trip. Good for custom Python logic using pandas/numpy that runs server-side. Key: external = off-platform execution; Snowpark UDF = on-platform execution.

Stored Procedure (EXECUTE AS OWNER)Stored Procedure (EXECUTE AS CALLER)

EXECUTE AS OWNER: procedure runs with owner's privileges. Caller does NOT need direct object access. Used for privilege escalation patterns (e.g., developers calling a procedure that writes to PII tables they cannot directly touch). EXECUTE AS CALLER: procedure runs with caller's own privileges. Caller must have direct access to all objects. Key: OWNER = controlled privilege escalation; CALLER = caller bears their own access requirements.

SUMMARIZEEXTRACT_ANSWER

SUMMARIZE: produces a concise summary of long-form text. No question needed — just compress content. EXTRACT_ANSWER: answers a specific question about a provided text passage (question-answering, not summarization). Key: SUMMARIZE = condense text; EXTRACT_ANSWER = find a specific answer within text given a question.

Snowpark ML (custom model training)Cortex LLM Functions (pre-built models)

Snowpark ML: train CUSTOM models inside Snowflake using sklearn-compatible API, Feature Store, Model Registry. You supply training data and model algorithm. Cortex: use PRE-BUILT large language models serverlessly via SQL functions (COMPLETE, SUMMARIZE, SENTIMENT, etc.). No training data required. Key: Snowpark ML = train your own; Cortex = use Snowflake's pre-built AI functions.

Scenario Tips

If the question asks about:

Apply a Python NLP classification function to 500 million rows using a pandas-compatible library

Answer:

Vectorized Snowpark Python UDF. It processes rows as pandas Series batches inside Snowflake's engine — no data movement, no per-row function call overhead.

Distractor to avoid:

External function calling an AWS Lambda — adds network latency for every row batch and requires maintaining Lambda infrastructure outside Snowflake. Only appropriate if the logic cannot run inside Snowflake.

If the question asks about:

A developer without direct access to the PII_CUSTOMERS table needs to run a GDPR deletion procedure on it

Answer:

Store procedure with EXECUTE AS OWNER where the owner has privilege on PII_CUSTOMERS. The developer calls the procedure; it runs with the owner's privileges. The developer never gets direct table access.

Distractor to avoid:

EXECUTE AS CALLER procedure — this would fail because the developer does not have direct access to PII_CUSTOMERS. CALLER context means the caller's privileges are used.

If the question asks about:

Summarize 10,000 customer support tickets and then classify their sentiment in SQL

Answer:

SELECT SNOWFLAKE.CORTEX.SENTIMENT(SNOWFLAKE.CORTEX.SUMMARIZE(ticket_text)) — chain SUMMARIZE (condense long text) then SENTIMENT (return float score -1.0 to 1.0).

Distractor to avoid:

EMBED_TEXT then CLASSIFY_TEXT — EMBED_TEXT produces vectors for semantic similarity search, not for sentiment. CLASSIFY_TEXT categorizes into user-defined buckets, not sentiment floats.

If the question asks about:

A team wants to version-control Snowpark Python scripts in GitHub and deploy directly into Snowflake

Answer:

Snowflake Git Repository integration: CREATE OR REPLACE GIT REPOSITORY linking the GitHub repo, then execute scripts with EXECUTE IMMEDIATE FROM @repo/path/script.sql. Requires a FETCH to pull latest changes.

Distractor to avoid:

Uploading scripts to an internal stage manually — works but has no automatic version-control linkage and requires manual uploads for every change.

Last-Minute Facts

1Snowpark collect() is an ACTION that triggers execution and returns data to the client. show() and count() also trigger execution. Simply defining .filter() or .select() does NOT run any SQL.

2VARIANT columns store up to 16 MB per value. Use FLATTEN for arrays; use dot/bracket notation for object field access.

3Cortex SENTIMENT output: FLOAT (-1.0 to 1.0). CLASSIFY_TEXT output: STRING label from the categories you provide. EMBED_TEXT output: VECTOR type (not ARRAY).

4Cortex billing trap: Cortex LLM functions consume SERVERLESS credits, not virtual warehouse credits. Running Cortex COMPLETE/SENTIMENT/SUMMARIZE does NOT consume credits on the virtual warehouse running the query — it bills separately to serverless. Cost analysis questions that ask 'which component drives cost for a Cortex workload?' must answer serverless credits, not warehouse size.

5Python UDF packages must be in the Snowflake Anaconda channel. Non-channel packages require IMPORTS from a stage.

6EXECUTE AS OWNER = privilege escalation (caller's access irrelevant). EXECUTE AS CALLER = caller must have direct object access.

7External functions require an API_INTEGRATION object in Snowflake and matching IAM configuration on the cloud provider side.

8Git repo integration does NOT auto-sync. You must run ALTER GIT REPOSITORY ... FETCH or SYSTEM$GIT_REPOSITORY_FETCH() to pull new commits.

Domain 321% of exam

Performance Optimization

Must-Know Facts

Result cache is ACCOUNT-LEVEL (shared across all users), free (zero compute credits), and valid for 24 hours per unique query hash. Each reuse resets the 24-hour window, up to a maximum of 31 days total.
Result cache is invalidated ONLY by DML on source tables (INSERT/UPDATE/DELETE/MERGE) or 24-hour expiry. Warehouse resize does NOT invalidate result cache. Warehouse suspend does NOT invalidate result cache.
Local disk cache (data cache) is WAREHOUSE-SPECIFIC SSD cache for recently scanned micro-partitions. Cleared when the warehouse SUSPENDS or is RESIZED. Warming up after suspend/resize takes 1-2 queries.
Cache exam judgment: the result cache survives warehouse suspend/resize; local disk cache does NOT (cleared on both). When a question asks 'what is cleared when the warehouse is suspended?' the answer is local disk cache, NOT result cache. The two user-visible cache types are result cache (account-level, Cloud Services layer) and local disk/data cache (warehouse SSD). A deeper FDN/remote data cache layer exists internally but is not directly testable.
Scale UP (larger warehouse size) = more memory and CPU per node = faster complex queries, fixes spill to disk, helps large sorts and joins. Scale OUT (multi-cluster warehouse) = more concurrent clusters = fewer queuing users. These solve different problems.
Spill to local storage: warehouse memory exceeded by intermediate data. Fix: scale up the warehouse. Spill to remote storage: local SSD also exhausted. Fix: scale up more, or optimize the query (avoid cartesian products).
Query profile key metrics: Partitions Scanned vs Total Partitions (pruning ratio), Bytes Spilled to Local Storage, Bytes Spilled to Remote Storage, join type (nested loop = no join key), row estimate accuracy.
average_depth near 1 = IDEAL (micro-partitions barely overlap). High average_depth = POOR clustering. average_overlaps: lower = better. The deprecated SYSTEM$CLUSTERING_RATIO returned 0–100 (higher = better); SYSTEM$CLUSTERING_INFORMATION uses average_depth and average_overlaps (both: lower = better). These directions are counterintuitive.
Search optimization service: creates a persistent search access path for equality predicates (col = val), IN predicates, ARRAY_CONTAINS, and LIKE with leading wildcards (SUBSTRING method). Best for point lookups returning very few rows.
Query acceleration service: offloads portions of LARGE SCAN queries to additional serverless compute. Best for ad-hoc analytics queries that scan many partitions. Controlled by QUERY_ACCELERATION_MAX_SCALE_FACTOR.
Automatic clustering has an ongoing credit cost — the background clustering service continuously reorganizes micro-partitions. Must weigh query performance improvement against clustering maintenance credits.

Common Traps

TrapResizing or suspending a warehouse clears the result cache

RealityResult cache is account-level, stored in cloud services — NOT in the warehouse. Warehouse resize or suspend clears the LOCAL DISK CACHE (warehouse SSD). Result cache survives warehouse lifecycle events and is shared across all users for identical queries.

TrapScaling up the warehouse fixes concurrency queuing when many users are waiting

RealityScale UP increases per-query compute but does NOT add concurrency slots. 200 users still queue behind each other on a larger single-cluster warehouse. Multi-cluster warehouse (scale OUT) adds additional clusters to handle concurrent query load. Use scale-out for concurrency problems.

TrapDefining any clustering key on a large table improves performance

RealityClustering keys only improve pruning when the clustered column is used in WHERE filters and has appropriate cardinality. UUID or surrogate key columns (too many distinct values) provide minimal pruning benefit while adding ongoing automatic clustering credit cost. High-cardinality date/status columns used in WHERE filters are good choices.

TrapSearch optimization and query acceleration solve the same performance problem

RealitySearch optimization is for HIGH-SELECTIVITY queries (equality/IN predicates returning very few rows — point lookups). Query acceleration is for LOW-SELECTIVITY queries (large scans returning many rows — ad-hoc analytics). They target opposite ends of the selectivity spectrum and can coexist on the same table.

TrapMaterialized view refreshes use the virtual warehouse assigned to the view

RealityMaterialized view background refresh uses Snowflake-managed SERVERLESS compute, not a virtual warehouse. Separate from query execution which uses a virtual warehouse. Materialized view maintenance costs appear in serverless credit usage.

TrapResult cache caches results even when masking policies are applied to columns

RealityMasking policies BYPASS result cache sharing across roles with different masking outcomes. User A (unmasked) and User B (masked) running the same query do NOT share cached results — each query executes independently to prevent data leakage through cache.

TrapHigh average_depth and high average_overlaps both indicate good clustering

RealityBoth average_depth and average_overlaps: LOW values = GOOD clustering; HIGH values = POOR clustering. average_depth near 1.0 is ideal (micro-partitions barely overlap). average_overlaps near 0 is ideal. The deprecated SYSTEM$CLUSTERING_RATIO used 0–100 with higher=better — do not mix up the direction with average_depth/average_overlaps.

Confusing Pairs

Scale Up (Larger Warehouse)Scale Out (Multi-cluster Warehouse)

Scale Up = run a larger warehouse size (X-Small → Small → Medium → etc.). More memory and compute per query. Fixes: spills to disk, slow complex sorts/joins, single large query performance. Scale Out = multi-cluster warehouse adds clusters when queries queue. Fixes: many concurrent users waiting in queue. Key question: is ONE query slow, or are MANY queries waiting?

Search Optimization ServiceQuery Acceleration Service

Search Optimization: builds a persistent access path for EQUALITY and IN predicates. Returns FEW rows (high selectivity). Good for lookups like 'find customer with email X'. Query Acceleration: offloads portions of queries that scan LARGE portions of a table to serverless compute. Returns MANY rows (low selectivity). Good for ad-hoc analytics. Key: Search = few matching rows; Acceleration = many rows scanned.

Result CacheLocal Disk Cache

Result cache: stores the OUTPUT of a completed query. Account-level, free, 24-hour TTL (extendable to 31 days by reuse), survives warehouse events. Local disk cache: stores recently scanned MICRO-PARTITIONS (raw data blocks). Warehouse-specific SSD, cleared on suspend/resize, improves repeat scan speed but still runs the query. Key: Result cache = no query execution; Local disk cache = query runs but reads from SSD instead of cloud storage.

Automatic ClusteringManual RECLUSTER

Automatic clustering: enabled by ALTER TABLE ... CLUSTER BY (col). Snowflake's background service continuously maintains clustering as DML disrupts it. Ongoing credit cost. Manual RECLUSTER: ALTER TABLE ... RECLUSTER (deprecated, replaced by automatic clustering). Automatic is now the standard approach. Key: just define the clustering key — automatic clustering handles maintenance.

Scenario Tips

If the question asks about:

Query profile shows 97% of partitions scanned on a 10TB table with a WHERE filter on ORDER_DATE

Answer:

Define a clustering key on ORDER_DATE. Poor pruning (97% scanned) for a date filter means micro-partitions are not organized by date. Clustering reorganizes data so date-range queries skip irrelevant partitions.

Distractor to avoid:

Enable search optimization — search optimization helps with equality predicates (ORDER_DATE = '2026-01-01'), NOT range scans (ORDER_DATE BETWEEN ...). Date range filters benefit from clustering, not search optimization.

If the question asks about:

200 concurrent analysts see 5+ minute queue times on an XL warehouse. Individual queries are fast when they eventually run.

Answer:

Enable multi-cluster warehouse (scale out) with auto-scale. The bottleneck is concurrency (too many queries queuing), not single-query performance. Adding clusters reduces queue depth.

Distractor to avoid:

Scale up to 2XL — this makes each query slightly faster but does nothing to reduce the number of concurrent queries waiting in the queue.

If the question asks about:

The same complex aggregation query on SALES runs 50 times per day by different analysts. The underlying data changes only once per day at 2 AM.

Answer:

Result cache serves repeated identical queries for free. Since the SALES table only changes at 2 AM, the result cache remains valid for the entire business day after the data load. Zero compute credits for the repeated executions.

Distractor to avoid:

Create a materialized view — also correct but adds storage and maintenance overhead. Result cache is free and requires no setup, making it the more cost-efficient answer when the exam asks about eliminating redundant compute cost.

If the question asks about:

A customer lookup query WHERE email = 'user@example.com' scans the entire 5TB CUSTOMERS table on every execution

Answer:

Enable search optimization with EQUALITY(email). This creates a persistent access path for equality predicates on the email column, allowing the lookup to skip irrelevant partitions without reading the full table.

Distractor to avoid:

Define a clustering key on email — clustering on email (high cardinality) is a poor choice. Each customer has a unique email, so micro-partitions still have overlapping ranges after clustering. Search optimization handles high-cardinality equality lookups more efficiently.

Last-Minute Facts

1All warehouse sizes bill a MINIMUM of 60 seconds per start/resume. After 60 seconds, billing is per-second.

2Result cache TTL: 24 hours per unique query + parameters. Each reuse resets the 24-hour clock. Maximum lifetime: 31 days total.

3Result cache is INVALIDATED by DML on any source table queried. Warehouse resize and suspend do NOT invalidate result cache.

4Local disk cache is cleared when the warehouse SUSPENDS or is RESIZED. It persists while the warehouse is running.

5Spill to local storage = scale up (out of memory). Spill to remote storage = scale up significantly more (both memory and local SSD exhausted).

6SYSTEM$CLUSTERING_INFORMATION(table, '(col1, col2)') — returns average_depth and average_overlaps (both: lower = better; 1.0 average_depth = ideal). SYSTEM$CLUSTERING_RATIO is deprecated (returned 0–100, higher = better). Exam may reference either; know both directions.

7Search optimization adds storage overhead AND background maintenance credits. Both are ongoing costs. Not free.

8Query acceleration controlled by QUERY_ACCELERATION_MAX_SCALE_FACTOR. Set to 0 to disable. Default scale factor is 8x.

Domain 414% of exam

Storage and Data Protection

Must-Know Facts

Time Travel retention: Standard edition = 0 to 1 day max. Enterprise edition = 0 to 90 days, default is 1 day per object. Set with DATA_RETENTION_TIME_IN_DAYS.
Fail-safe: ALWAYS 7 days for permanent tables. Fixed, non-configurable. Begins immediately after Time Travel expires. NOT accessible by users via SQL — only Snowflake Support can recover data from Fail-safe.
Time Travel and Fail-safe are SEQUENTIAL, not overlapping. Data is in Time Travel (user-accessible) for up to 90 days, then Fail-safe (Snowflake-managed, not user-accessible) for 7 days, then permanently deleted.
Reducing DATA_RETENTION_TIME_IN_DAYS to 0 disables Time Travel for an object. Fail-safe still runs its 7 days for permanent tables regardless.
Transient tables: NO Fail-safe, maximum 1 day Time Travel. Temporary tables: NO Fail-safe, max 1 day Time Travel, exist only for the session. Use transient/temporary for staging data that does not need recovery.
Time Travel AT/BEFORE syntax: AT(TIMESTAMP => ...), AT(OFFSET => -3600), AT(STATEMENT => '<qid>'), BEFORE(STATEMENT => '<qid>'). OFFSET is in SECONDS and must be negative.
UNDROP restores DROPPED objects (tables, schemas, databases) within their Time Travel period. UNDROP does NOT work for DML changes (INSERT/UPDATE/DELETE) — use AT/BEFORE clauses and INSERT...SELECT for row recovery.
Zero-copy clone: shares physical micro-partitions via metadata pointers. Near-zero initial storage cost. Storage increases only as clone and source diverge (copy-on-write on modification). Clone does NOT inherit grants from the source.
Cloning a database does NOT include virtual warehouses, resource monitors, users, or roles — only data objects. Grants are NOT transferred to the clone.
Database replication: secondary is READ-ONLY. Changes must be pulled manually with ALTER DATABASE ... REFRESH. Replication does NOT auto-sync continuously unless a replication group with scheduled refresh is configured.
Failover groups (business continuity): atomically replicate multiple objects (databases, shares, tasks, resource monitors). Connection objects provide a stable endpoint URL that can be redirected during failover without client config changes.

Common Traps

TrapSetting DATA_RETENTION_TIME_IN_DAYS = 0 eliminates both Time Travel and Fail-safe

RealitySetting retention to 0 disables Time Travel (no historical SQL queries possible) but does NOT affect Fail-safe. Fail-safe always runs its 7 days for permanent tables regardless of the retention setting. Transient tables are the only way to avoid Fail-safe entirely.

TrapA cloned table inherits all grants and privileges from the source table

RealityZero-copy clones do NOT inherit privileges from the source. Grants must be re-applied explicitly on the clone. This is a common operational surprise when cloning production to dev — devs may have access to prod (via the clone) unless privileges are properly scoped.

TrapUNDROP TABLE restores a table that had rows accidentally deleted

RealityUNDROP TABLE only restores DROPPED (DDL) objects — tables that were fully removed with DROP TABLE. It does NOT address DML changes like DELETE or UPDATE. For DML recovery, use Time Travel with AT(BEFORE(STATEMENT => '<query_id>')) and INSERT...SELECT.

TrapA zero-copy clone of a database immediately costs 50% of the source storage because the data is split

RealityA fresh clone costs nearly zero additional storage because it shares all physical micro-partitions with the source via metadata pointers. Storage increases only as modified tables in the clone or source write new micro-partitions (copy-on-write). The clone pays only for diverged micro-partitions.

TrapDatabase replication continuously synchronizes with the source automatically

RealityDatabase replication is NOT automatic continuous sync. The secondary is a read-only snapshot that must be explicitly refreshed with ALTER DATABASE <secondary> REFRESH. Automated refresh requires setting up a replication group with a scheduled refresh task.

TrapTime Travel and Fail-safe overlap — a table is accessible via Time Travel during the Fail-safe period

RealityTime Travel and Fail-safe are strictly sequential. After the Time Travel retention period expires, data enters Fail-safe. During Fail-safe, data is NOT accessible via SQL by users. Only Snowflake Support can access it, and only for catastrophic platform failures.

Confusing Pairs

Time TravelFail-safe

Time Travel: USER-accessible via SQL (AT/BEFORE clauses), configurable 0-90 days (Enterprise), user error recovery, supports UNDROP. Fail-safe: SNOWFLAKE-managed, 7 days fixed, NOT accessible via SQL, activated only by Snowflake for catastrophic infrastructure failures. Key: Time Travel = user recovery; Fail-safe = Snowflake-only disaster recovery.

Permanent TableTransient Table

Permanent table: full Time Travel (up to 90 days Enterprise) AND 7-day Fail-safe. Higher storage cost. Default table type. Transient table: max 1-day Time Travel, NO Fail-safe. Lower storage cost. Use for staging, intermediate ETL tables, or data that can be recreated. Key: transient = no Fail-safe = lower cost but no deep recovery.

UNDROP TABLETime Travel AT/BEFORE

UNDROP TABLE: recovers objects that were DROPPED with DDL (DROP TABLE). The entire table (schema, grants, data) is restored within retention period. Time Travel AT/BEFORE: queries or inserts FROM historical data state. Used for DML recovery (deleted/updated rows) or point-in-time analysis. Key: UNDROP = dropped table; AT/BEFORE = DML-affected data.

Database ReplicationFailover Groups

Database replication: single database replicated to secondary. Secondary is read-only, must be manually refreshed. Good for DR for a single database. Failover groups (business continuity): atomically replicate MULTIPLE objects (databases, tasks, shares, resource monitors). Supports automated failover with connection objects. Key: replication = single database; failover groups = atomic multi-object DR with switchover support.

Scenario Tips

If the question asks about:

A developer accidentally ran DELETE FROM ORDERS WHERE ORDER_DATE < '2026-01-01'. The query ID is available. The table has 7-day retention.

Answer:

INSERT INTO ORDERS SELECT * FROM ORDERS BEFORE(STATEMENT => '<delete_query_id>') EXCEPT SELECT * FROM ORDERS; — uses Time Travel BEFORE to view data before the delete, then inserts only the missing rows.

Distractor to avoid:

UNDROP TABLE ORDERS — wrong because ORDERS was not dropped, it just had DML executed against it. UNDROP only works for DDL drops.

If the question asks about:

Clone the PRODUCTION database for a QA environment. The QA team needs to run destructive tests without affecting production.

Answer:

CREATE DATABASE qa_clone CLONE production; — instant, near-zero storage cost at creation. QA modifications write new micro-partitions (copy-on-write) without touching production. Remember: clone does NOT inherit production grants — re-apply privileges to QA team on the clone.

Distractor to avoid:

Replicate the database to the QA account — replication creates a read-only secondary that cannot be written to. QA needs a writable environment, so cloning is correct.

If the question asks about:

A company needs disaster recovery for their Snowflake account with sub-hour RPO and automatic client failover without reconfiguring connection strings

Answer:

Configure a Failover Group (business continuity group) with scheduled refresh, and use a Connection Object as the client endpoint. On failover, the connection object is redirected to the new primary without requiring client changes.

Distractor to avoid:

Database replication alone — handles data replication but requires manual REFRESH, manual failover promotion, and client reconfiguration after failover. Failover Groups with connection objects are the complete DR solution.

Last-Minute Facts

1Permanent table storage = Time Travel data + Fail-safe data + active data. All three contribute to storage billing.

2Transient table storage = Time Travel data (max 1 day) + active data. NO Fail-safe storage cost.

3Temporary table storage = active data + up to 1-day Time Travel (default 1 day, same as transient), NO Fail-safe. However, the table is purged when the session ends — so the effective retention is whichever comes first: 1 day or session end.

4Time Travel maximum: Standard = 1 day. Enterprise = 90 days. Business Critical = 90 days.

5Fail-safe duration: always 7 days for permanent tables. Zero for transient and temporary tables.

6ALTER TABLE t SET DATA_RETENTION_TIME_IN_DAYS = 0; — disables Time Travel. Fail-safe still runs 7 days.

7Zero-copy clone: grants are NOT inherited. Must re-apply all privileges on the clone explicitly.

8Cloning AT(TIMESTAMP => ...) clones the historical state but clone's OWN Time Travel starts from clone creation time, not from the historical timestamp.

9Secondary database in replication: READ-ONLY. Run ALTER DATABASE <secondary> REFRESH to pull latest state from primary.

Domain 514% of exam

Data Governance

Must-Know Facts

Masking policies are evaluated at QUERY TIME, not storage time. Data is stored unmasked in Snowflake storage. The mask is applied when results are returned to the querying user based on their role.
Tag-based masking: assign a masking policy to a TAG (not directly to each column). Any column tagged with that tag automatically receives the masking policy. New columns tagged in the future also auto-receive the policy without manual reassignment.
Automated data classification (SYSTEM$CLASSIFY) SUGGESTS tags — it does NOT automatically apply masking policies. A data steward must review the classification results and explicitly apply both the tag AND the masking policy to the relevant columns.
Row access policies return BOOLEAN. A row evaluating to FALSE is simply not returned — no error is thrown. A misconfigured policy that always returns FALSE silently returns zero rows, which can be very difficult to debug.
A table can have at most ONE row access policy. A column can have at most ONE masking policy (either directly applied or via tag). If both a direct column masking policy and a tag-based policy exist, the direct assignment takes precedence.
Masking policies bypass result cache sharing across roles with different masking outcomes. User A (full data) and User B (masked data) do not share cached results even for identical SQL queries.
EXECUTE AS OWNER stored procedures enable controlled privilege escalation — common governance pattern for sensitive tables (PII, financial). The procedure owner holds the elevated privilege; the developer only has EXECUTE on the procedure.
Data Clean Rooms use the Snowflake Native App Framework. Neither the provider nor consumer sees the other's raw data — only approved aggregate query results leave the clean room enclave. Fundamentally different from Secure Data Sharing (which exposes row-level data to the consumer).
ACCOUNTADMIN should NOT be used for daily operations. Custom roles should inherit from SYSADMIN for object creation. SECURITYADMIN manages grants. USERADMIN manages users and roles. ACCOUNTADMIN is for account-level admin only.
ACCESS_HISTORY view in ACCOUNT_USAGE has a 2-3 hour latency. Queries do not appear in ACCESS_HISTORY immediately. Plan audit reports accordingly.
Managed access schemas: object owners (SYSADMIN) cannot grant privileges on objects in managed access schemas. Only the schema owner (or SECURITYADMIN) grants privileges. Enables centralized access control — prevents object owners from sharing access without security team oversight.

Common Traps

TrapAutomated data classification automatically applies masking policies to detected PII columns

RealitySYSTEM$CLASSIFY only SUGGESTS privacy and semantic category tags. It does nothing to restrict data access. A governance workflow must: review the suggestions, apply the tags, and separately configure/apply masking policies or row access policies. The automation stops at tag suggestion.

TrapA row access policy returning FALSE throws a permission error for unauthorized users

RealityRow access policies silently filter rows. A policy returning FALSE for a user means that user sees zero rows with no error message. This silent behavior is a common debugging trap — a misconfigured policy looks like an empty table, not a permissions error.

TrapMasking policy results are shared through the result cache across all roles

RealityMasking policies bypass result cache sharing. If one role sees full values and another sees masked values, their cached results cannot be shared. Each role computes the query independently to prevent the cache from leaking unmasked data to masked-view users.

TrapData Clean Rooms are equivalent to Secure Data Sharing but with extra privacy settings

RealityData Clean Rooms and Secure Data Sharing serve fundamentally different purposes. Data Clean Rooms: neither party sees the other's raw data — only aggregate results leave the enclave. Secure Data Sharing: the consumer CAN see row-level data from the provider's tables. For regulated multi-party analytics where raw data cannot be shared, Data Clean Rooms are required, not Secure Data Sharing.

TrapCustom roles should be granted directly to ACCOUNTADMIN for proper hierarchy

RealityCustom roles should be granted (roll up) to SYSADMIN, not ACCOUNTADMIN. ACCOUNTADMIN already inherits all roles, but custom object-creation roles belong in the SYSADMIN hierarchy. Using ACCOUNTADMIN for daily operations is an anti-pattern — it bypasses intended role separation.

TrapIS_ROLE_IN_SESSION('ANALYST') in a masking policy works the same as IS_GRANTED_TO_INVOKER_ROLE('ANALYST') in a managed access schema

RealityIS_ROLE_IN_SESSION returns TRUE if the role is CURRENTLY ACTIVE in the session. IS_GRANTED_TO_INVOKER_ROLE checks if the role is part of the invoker's role hierarchy. In managed access schemas where stored procedure logic runs with EXECUTE AS OWNER, IS_ROLE_IN_SESSION may not behave as expected because the session context belongs to the owner, not the caller. Use IS_GRANTED_TO_INVOKER_ROLE in such contexts.

Confusing Pairs

Dynamic Data Masking (Column-level)Row Access Policies (Row-level)

Dynamic data masking = applied to COLUMNS. Hides the VALUE of specific columns for unauthorized roles. All rows are visible; column content is masked. Row access policies = applied to TABLES, filter at ROW level. Unauthorized roles see fewer rows (filtered out), not masked values. Both can be applied simultaneously to the same table. Key: masking = hide column values; row access = hide entire rows.

Direct Column Masking PolicyTag-based Masking Policy

Direct: ALTER TABLE t MODIFY COLUMN email SET MASKING POLICY mask_email. Applied per column explicitly. Must be done for every new PII column. Tag-based: assign masking policy to a TAG. Any column tagged with the PII tag automatically receives the policy. Scales governance — new columns auto-governed on tagging. Key: direct = explicit per-column; tag-based = scalable governance via tagging.

Data Clean RoomsSecure Data Sharing

Secure Data Sharing: provider shares LIVE TABLE DATA with consumers. Consumer CAN query row-level data using their own warehouse. No data movement — consumer reads provider's storage directly. Data Clean Rooms: provider packages APPROVED ANALYSES (not raw data) as a Native App. Consumer runs analyses; only aggregate results exit the clean room. Neither party sees the other's raw data. Key: Sharing = raw data access; Clean Rooms = aggregate-only joint analysis.

SECURITYADMINSYSADMIN

SECURITYADMIN: manages GRANTS (who has access to what objects). Does NOT own objects or create databases. SYSADMIN: manages OBJECTS (creates databases, schemas, warehouses, tables). Does NOT manage grants except on objects it owns. Custom roles for object creation belong under SYSADMIN. Security policies, grants, and user role assignments belong under SECURITYADMIN. Key: SECURITYADMIN = access control; SYSADMIN = object ownership.

Scenario Tips

If the question asks about:

A healthcare org needs PII columns auto-governed as new tables are added. The governance team cannot manually track every new column.

Answer:

Tag-based masking: create a PII tag, associate the masking policy with the tag (ALTER TAG pii SET MASKING POLICY mask). Data engineers tag PII columns during table creation. The masking policy applies automatically without the governance team needing to assign it per column.

Distractor to avoid:

Run SYSTEM$CLASSIFY periodically and apply masking policies based on its output — this requires a human review step and creates lag between column creation and governance coverage. Tag-based masking provides immediate coverage on tagging.

If the question asks about:

An analyst queries a table with a row access policy and sees zero results. Data definitely exists in the table.

Answer:

Row access policy is misconfigured — it evaluates to FALSE for this analyst's role/user. Row access policies silently return zero rows (no error). Debug by reviewing the policy logic: check IS_ROLE_IN_SESSION(), CURRENT_USER(), and the mapping table (if any) to find why the policy evaluates to FALSE for this analyst.

Distractor to avoid:

The table is empty or the analyst is querying the wrong table — plausible but less likely given the stated scenario. Always check row access policy logic first when zero rows return unexpectedly for a known non-empty table.

If the question asks about:

Two competitors want to compare customer overlap statistics without revealing their actual customer lists to each other

Answer:

Data Clean Rooms built on the Snowflake Native App Framework. Both parties' data stays inside the clean room enclave. Only aggregate overlap statistics (not individual customer records) are returned to either party.

Distractor to avoid:

Secure Data Sharing — would expose raw customer records to the other party, violating the business requirement. Secure Data Sharing gives row-level access to the consumer.

If the question asks about:

A developer needs access to run a procedure that writes to a PII table, but the developer must NOT have direct SELECT/INSERT on the PII table

Answer:

Create a stored procedure with EXECUTE AS OWNER where the owner has INSERT privilege on the PII table. Grant EXECUTE on the procedure to the developer role. The developer can call the procedure but cannot directly access the PII table — all access is mediated through the procedure's defined logic.

Distractor to avoid:

Grant SELECT and INSERT on the PII table to the developer role — this gives the developer unrestricted direct access, violating the requirement for mediated access only.

Last-Minute Facts

1SYSTEM$CLASSIFY suggests tags only. It does NOT apply masking policies. Manual governance action is required after classification.

2A row access policy returning FALSE returns ZERO ROWS — no error. Silent filtering is the expected behavior.

3One row access policy per table maximum. One masking policy per column maximum. Direct assignment takes precedence over tag-based assignment.

4Masking policies bypass result cache sharing across roles. Users with different masking outcomes each compute independently.

5ACCESS_HISTORY in ACCOUNT_USAGE has 2-3 hour lag. Not real-time. Plan scheduled audit reports accordingly.

6ACCOUNTADMIN is NOT for daily use. Custom object-management roles roll up to SYSADMIN. Security grants roll up to SECURITYADMIN.

7Managed access schema: ONLY schema owner or SECURITYADMIN can grant privileges on objects within the schema. Object owners cannot self-grant to others.

8Data Clean Rooms: provider controls what analyses are allowed. Consumer cannot run arbitrary SQL on the provider's raw data — only pre-approved analysis functions defined in the Native App.

9Tag-based masking scales governance: tag new PII columns and the masking policy auto-applies. Far more scalable than per-column policy assignment in large schemas.

Feeling confident?

Put your knowledge to the test with a timed DEA-C02 mock exam.