CertPrepNow
MicrosoftDP-8003 domains

DP-800 Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Microsoft Certified: SQL AI Developer Associate exam.

General Exam Tips

  • 1.Read ALL answer options before selecting — many DP-800 questions have a plausible-sounding wrong answer that is subtly different from the correct one
  • 2.Budget roughly 2 minutes per standard question and 4-6 minutes per case study question. Flag hard questions and return to them rather than getting stuck
  • 3.Case studies give you a scenario upfront — read it carefully, underline scale requirements (e.g., 10 million rows) and SLAs, because the correct answer often hinges on those constraints
  • 4.The exam is developer-perspective, not DBA-perspective. When a question asks about 'the best approach', prefer schema-as-code, CI/CD pipelines, and API generation over manual scripts and direct database changes
  • 5.No penalty for wrong answers — guess on any question you cannot resolve; never leave an answer blank
  • 6.When two answers both seem correct, ask which one the exam is really testing. The more specific, newer, or SQL-native option is usually preferred over a generic Azure workaround
  • 7.Domain 1 and Domain 2 together are 74% of the exam — do not over-invest study time in the AI domain at the expense of foundational T-SQL and security topics
  • 8.The AI capabilities domain (Domain 3) contains the most novel material. Even experienced SQL developers fail here — treat it as a separate learning area requiring hands-on practice
  • 9.Scenario keyword triggers: 'tamper-evident' = ledger table, 'history of row versions' = temporal table, 'hide values from non-privileged users' = Dynamic Data Masking, 'meaningful similarity' = vector search, 'current enterprise data + LLM answer' = RAG pattern
  • 10.Interactive lab-style questions may appear — you may be placed inside SSMS or Azure Data Studio. Practice writing the functions from memory, not just recognition
Domain 137% of exam

Design and Develop Database Solutions

Must-Know Facts

  • Specialized table types: temporal (system-versioned history, FOR SYSTEM_TIME), ledger (cryptographic tamper detection, database digests), in-memory/MEMORY_OPTIMIZED (ultra-low-latency OLTP), external (reference remote data without importing), graph (node + edge tables with MATCH operator)
  • JSON functions: JSON_OBJECT and JSON_ARRAY construct JSON from scalars; JSON_ARRAYAGG aggregates multiple result rows into a JSON array; OPENJSON shreds JSON into rows; JSON_VALUE extracts a scalar; JSON_CONTAINS checks membership
  • Window functions: ROW_NUMBER (unique sequential), RANK (gaps on ties), DENSE_RANK (no gaps on ties), NTILE (buckets), LAG/LEAD (previous/next row), SUM/AVG/COUNT with OVER(PARTITION BY ... ORDER BY ...)
  • Fuzzy matching: EDIT_DISTANCE returns an integer Levenshtein edit count; EDIT_DISTANCE_SIMILARITY returns a 0-100 percentage; JARO_WINKLER_DISTANCE returns a 0-1 similarity score weighted toward prefix matches
  • Regular expression functions: REGEXP_LIKE (boolean match test), REGEXP_REPLACE (substitution), REGEXP_SUBSTR (extract match), REGEXP_INSTR (position), REGEXP_COUNT (occurrence count), REGEXP_MATCHES (all matches as rows), REGEXP_SPLIT_TO_TABLE (split on pattern)
  • Graph queries use CREATE TABLE ... AS NODE and CREATE TABLE ... AS EDGE, then MATCH syntax such as MATCH(person-(knows)->friend) for relationship traversal
  • CTEs support recursion with an anchor member and recursive member joined by UNION ALL; recursive CTEs are the standard way to query hierarchical data (org charts, bill of materials)
  • GitHub Copilot instruction files live at .github/copilot-instructions.md — they customize Copilot behavior for your project but do not replace code review or security validation
  • MCP (Model Context Protocol) server endpoints let GitHub Copilot connect directly to SQL Server or Fabric lakehouse context; know that MCP is a protocol, not a Microsoft-only product
  • Columnstore indexes store data column-by-column for analytical queries; rowstore (B-tree) indexes are for transactional point lookups — choosing the wrong type tanks performance

Common Traps

TrapLedger tables and temporal tables both track data history, so candidates pick whichever sounds familiar
RealityTemporal tables track row history with timestamps, letting you query 'what was the value at time T'. Ledger tables cryptographically hash the data so tampering is detectable. Temporal = point-in-time travel. Ledger = tamper-proof audit. The keyword 'cryptographically' or 'tamper-evident' always points to ledger.
TrapJSON_ARRAY and JSON_ARRAYAGG look similar and both produce arrays, so candidates mix them up
RealityJSON_ARRAY builds an array from individual scalar arguments in a single expression: JSON_ARRAY(1, 2, 3). JSON_ARRAYAGG is an aggregate function that collapses many rows into one JSON array, like GROUP_CONCAT for JSON. Use JSON_ARRAYAGG when your source is a multi-row result set.
TrapGraph MATCH queries seem direction-agnostic, like regular JOIN syntax
RealityEdge direction matters in MATCH. MATCH(a-(edge)->b) traverses from a to b; the arrow direction must match how the edge was inserted. Reversing the arrow in a query changes what you find — or returns nothing.
TrapTemporal tables require only one table — the system just adds timestamp columns
RealityThe SQL Server engine creates TWO physical tables: the main current table and a separate history table. You cannot directly INSERT, UPDATE, or DELETE the history table — the engine manages it. Questions sometimes ask who maintains the history table: the answer is always the system, not application code.
TrapGitHub Copilot for SQL automatically validates security — you just accept suggestions
RealityAI-generated T-SQL still requires security review. Copilot can suggest code with SQL injection risks, excess permissions, or incorrect logic. The instruction file customizes style, not security policy.
TrapLAG and LEAD seem interchangeable for accessing nearby rows
RealityLAG accesses the previous row (earlier in ORDER BY). LEAD accesses the next row (later in ORDER BY). If the question asks for the 'previous employee salary', LAG is correct. If it asks for the 'next period value', LEAD is correct. Mixing them up gives results that are off by two rows.

Confusing Pairs

RANK()DENSE_RANK()

RANK() leaves gaps after ties: two rows tied at rank 2 both get rank 2, and the next row gets rank 4. DENSE_RANK() never gaps: two rows tied at rank 2 both get rank 2, and the next row gets rank 3. Use DENSE_RANK when continuous numbering matters (e.g., top-N reporting). Use RANK when the absolute position within a full sorted list matters.

Updatable Ledger TableAppend-Only Ledger Table

Updatable ledger tables allow INSERT, UPDATE, and DELETE but record every change cryptographically. Append-only ledger tables only allow INSERT — no updates or deletes are permitted. Use append-only for financial audit logs where records must never be modified. Use updatable where rows change but tampering must still be detectable.

EDIT_DISTANCEEDIT_DISTANCE_SIMILARITY

EDIT_DISTANCE returns a raw integer count of edits needed to transform one string into another (lower = more similar). EDIT_DISTANCE_SIMILARITY returns an integer from 0-100 representing percentage similarity (higher = more similar). When the question provides a threshold like 'find names that are at least 80% similar', use EDIT_DISTANCE_SIMILARITY, not EDIT_DISTANCE.

Recursive CTESubquery

Recursive CTEs are for hierarchical/graph traversal where the depth is unknown at query time (org chart, folder tree, bill of materials). Subqueries are for inline filtering with known depth. If the question mentions 'all levels' or 'any depth', the answer is recursive CTE.

MCP Server EndpointREST Endpoint (DAB)

MCP (Model Context Protocol) endpoints expose database context to AI assistants like GitHub Copilot so the AI can understand your schema and generate better suggestions. REST endpoints in Data API builder expose data to application consumers for CRUD operations. MCP is for AI tooling. REST/GraphQL is for app integration.

Scenario Tips

If the question asks about:

The question asks for a table type to store financial transactions that must be permanently verifiable against tampering, even by DBAs

Answer:

Append-only ledger table — it prevents modification of existing records and generates cryptographic database digests that prove the data has not changed since insertion

Distractor to avoid:

Temporal table sounds like an audit feature but only tracks time-based history, not cryptographic integrity. A motivated DBA could delete or modify temporal history records.

If the question asks about:

A query must return each salesperson's monthly revenue and the revenue from the previous month within the same region, ordered by month

Answer:

Use LAG(revenue, 1) OVER (PARTITION BY region ORDER BY month_date) — LAG retrieves the prior row's value without a self-join

Distractor to avoid:

LEAD retrieves the next row's value, not the previous. A self-join works but is far less efficient for this pattern.

If the question asks about:

The question asks you to aggregate all product tags from a query result into a single JSON array per product

Answer:

JSON_ARRAYAGG(tag) — it collapses multiple rows into one JSON array per group, used with GROUP BY product_id

Distractor to avoid:

JSON_ARRAY(tag1, tag2, ...) only works when you have a fixed number of scalar values in hand, not when aggregating across multiple result rows.

If the question asks about:

The question asks which index type to use for a table that supports heavy nightly reporting queries scanning millions of rows and aggregating sales figures

Answer:

Columnstore index — it compresses data column-by-column and batch-processes analytical aggregations efficiently

Distractor to avoid:

Nonclustered B-tree index is optimal for OLTP point lookups and equality filters, not for full-scan analytical aggregations over wide tables.

Last-Minute Facts

1Temporal tables need SysStartTime and SysEndTime DATETIME2 columns with GENERATED ALWAYS AS ROW START/END in the DDL
2The MATCH operator in graph queries only works in the FROM clause, not in WHERE or JOIN clauses
3NTILE(n) divides rows into n equal-sized buckets; if rows cannot divide evenly, earlier buckets get the extra row
4REGEXP_LIKE returns a boolean (1/0), making it ideal for a WHERE clause filter; REGEXP_MATCHES returns a table of all matches
5JSON_CONTAINS(json_column, '"value"') requires the search value to be a valid JSON fragment including quotes for strings
6Copilot instruction files use markdown format and can restrict Copilot to project-specific conventions, naming patterns, and security rules
Domain 237% of exam

Secure, Optimize, and Deploy Database Solutions

Must-Know Facts

  • Always Encrypted: two key types — column master key (CMK) stored in Azure Key Vault or Windows certificate store, held by the CLIENT; column encryption key (CEK) stored encrypted in the database. The SQL engine never sees plaintext. Deterministic encryption permits equality comparison; randomized encryption permits no comparisons at all
  • Dynamic Data Masking: masking functions are default (XXXX), email (aXXX@XXXX.com), random (random number in range), and custom string (prefix + padding + suffix). Users with UNMASK permission bypass all masking. DBAs always see real data — masking is not a security boundary
  • Row-Level Security: implemented as inline table-valued functions (security predicates) bound to a security policy. Filter predicates restrict SELECT results. Block predicates restrict INSERT/UPDATE/DELETE. Both types can be on the same table simultaneously
  • SQL Database Projects: SDK-style .sqlproj file represents the full schema as source code. Build with 'dotnet build', deploy via dacpac or SqlPackage. Schema drift detection compares live database against the project; it detects divergence but does not auto-fix it
  • CI/CD pipeline components: branching policies (control who merges to main), pull request workflows, approval gates (manual sign-off before production deploy), code owners (required reviewers per file path), secrets management (never hardcode credentials), and deployment triggers
  • Data API builder (DAB): single JSON config file generates REST, GraphQL, AND MCP endpoints simultaneously. Entity sections map to tables, views, or stored procedures. Supports pagination, OData-style filtering, field-level projection, and relationship linking for GraphQL
  • Query Store: captures plan history and runtime stats. Use 'Top Resource Consuming Queries' to find problem queries, 'Plan Comparison' to see plan regressions, and 'Force Plan' to lock a query to a good plan. Must be enabled — it is off by default on some platforms
  • Transaction isolation levels: READ COMMITTED (default) prevents dirty reads; SNAPSHOT prevents dirty, phantom, and non-repeatable reads without blocking; SERIALIZABLE prevents all anomalies but blocks most concurrency; READ UNCOMMITTED allows dirty reads (fastest, no locks)
  • Passwordless access: Managed Identity eliminates passwords in connection strings. System-assigned identity is tied to the resource lifecycle. User-assigned identity is independent and can be shared across resources
  • Change patterns: CDC captures full before/after row images in change tables — use when you need the actual changed values. Change Tracking captures only which rows changed by primary key — use for lightweight sync when you only need to know what to re-query. Change Event Streaming (CES) streams changes as events — lower latency than polling

Common Traps

TrapAlways Encrypted and Dynamic Data Masking both 'protect' sensitive data, so candidates treat them as alternatives for the same threat model
RealityAlways Encrypted protects data from everyone including the DBA and SQL engine — the engine processes ciphertext only, keys stay with the client. Dynamic Data Masking only changes what a query returns to non-privileged users; the underlying data is stored and processed in plaintext. If the threat is a rogue DBA or database breach, Always Encrypted is required. If the threat is a junior analyst seeing live PII in a query result, masking is sufficient.
TrapDeterministic Always Encrypted allows equality searches, so candidates assume it also supports LIKE or range queries
RealityDeterministic encryption maps the same plaintext to the same ciphertext every time — this allows = comparisons. But it does NOT support LIKE, >, <, BETWEEN, or ORDER BY. Randomized encryption does not support ANY search predicates. This distinction appears directly in exam scenarios about searchable encrypted columns.
TrapRow-Level Security and Dynamic Data Masking both control data visibility, so candidates pick either one for 'hiding sensitive data'
RealityRLS hides entire ROWS — if the predicate fails, the row does not exist from the user's perspective (the row count changes). DDM hides COLUMN VALUES — all rows are visible but sensitive values are replaced with masked equivalents. RLS is about access to records; DDM is about exposure of field content within visible records.
TrapSchema drift detection in SQL Database Projects automatically fixes the live database to match source control
RealitySchema drift detection is read-only diagnosis. It identifies that the live database has diverged from the project definition. A developer must then decide whether to update the project to reflect the live change, or redeploy the project to overwrite the live change. The tool does not auto-resolve the conflict.
TrapQuery Store is enabled by default and always collecting data
RealityQuery Store must be explicitly enabled. It is on by default in Azure SQL Database, but it is OFF by default in SQL Server on-premises. In an exam scenario about 'a SQL Server where Query Store data is unavailable', the answer is that it needs to be enabled first.
TrapData API builder requires choosing between REST and GraphQL endpoints
RealityDAB generates REST, GraphQL, AND MCP endpoints from the same configuration file simultaneously. You configure one entity definition and all endpoint types are exposed. The exam tests this 'simultaneously from one config' characteristic.
TrapCDC and Change Tracking both track changes, so they are interchangeable for embedding maintenance
RealityCDC captures the WHAT — full before and after row images stored in change tables, allowing you to see exactly what values changed. Change Tracking captures the WHICH — only the primary key and the type of change (insert/update/delete), without row data. For embedding regeneration, CDC is required when you need to know what content changed. Change Tracking only tells you which rows to re-fetch.

Confusing Pairs

Always Encrypted (deterministic)Always Encrypted (randomized)

Deterministic: same plaintext always produces same ciphertext. Allows equality (=) comparisons and indexing. Risk: frequency analysis attacks on low-cardinality columns. Randomized: same plaintext produces different ciphertext each time. No equality, no range, no LIKE, no ORDER BY on encrypted column. Use deterministic for searchable columns (SSN lookup by exact value). Use randomized for non-searchable sensitive data (credit card numbers not used in WHERE clauses).

System-Assigned Managed IdentityUser-Assigned Managed Identity

System-assigned: created by and tied to one specific Azure resource. Deleted when the resource is deleted. Cannot be shared. User-assigned: created independently as a standalone identity resource. Can be assigned to multiple resources. Survives resource deletion. Use system-assigned for single-resource access. Use user-assigned when multiple resources need the same identity or when you need identity continuity across resource recreation.

SQL Database Projects (build validation)SQL Database Projects (schema drift detection)

Build validation happens at CI time: the project compiles against a known baseline to catch syntax errors, missing object references, and broken dependencies before deployment. Schema drift detection happens at deploy time or on demand: it compares the live database state against the project and surfaces unauthorized changes. Both are part of the SQL Database Projects workflow but serve different phases of the pipeline.

DAB REST endpointDAB GraphQL endpoint

REST: resource-oriented URLs, each entity gets its own path (e.g., /api/customers). Simple GET/POST/PUT/DELETE. Easy to cache via HTTP. One request per entity or relationship. GraphQL: query-oriented, single endpoint (/graphql). Client specifies exact fields and relationships in the query. Reduces round trips for related data. Use REST for simple CRUD and broad client compatibility. Use GraphQL when clients need to fetch nested relationships in a single request.

SNAPSHOT isolationSERIALIZABLE isolation

SNAPSHOT isolation: readers see a consistent snapshot of data as of their transaction start time. Writers do not block readers and readers do not block writers. No dirty reads, no non-repeatable reads, no phantom reads. Relies on the version store in tempdb. SERIALIZABLE: the most restrictive level — uses range locks to prevent all anomalies. Readers block writers and writers block readers. Eliminates phantom reads but dramatically reduces concurrency. SNAPSHOT is the recommended choice for high-concurrency OLTP when you need repeatable reads without blocking.

Scenario Tips

If the question asks about:

The question asks how to encrypt a credit card column so that database administrators running SELECT queries cannot see the actual card numbers

Answer:

Always Encrypted with randomized encryption — the SQL engine stores and processes only ciphertext; only the application with the column master key can decrypt. DBAs running SSMS queries get ciphertext back.

Distractor to avoid:

Transparent Data Encryption (TDE) protects data at rest on disk but decrypts in memory for query processing, so DBAs can still see plaintext values. Dynamic Data Masking only hides values in query results and is bypassed by users with UNMASK permission.

If the question asks about:

A developer reports that a query that ran in 200ms last week now takes 8 seconds. No schema changes were made. How do you diagnose this?

Answer:

Query Store — use the 'Regressed Queries' report to find queries whose performance degraded between time periods, compare old vs new execution plans, and force the previously-good plan while investigating the root cause.

Distractor to avoid:

DMVs like sys.dm_exec_query_stats show current stats but lose history after a SQL Server restart. Query Store persists historical plan data across restarts and lets you compare plans across time periods.

If the question asks about:

A company needs to expose their Products and Orders tables as APIs to a mobile app with filtering, pagination, and nested relationship queries, without writing a custom API layer

Answer:

Data API builder (DAB) — configure entities for Products and Orders in the JSON config, define the GraphQL relationship between them, and DAB auto-generates REST and GraphQL endpoints with pagination and OData filtering built in.

Distractor to avoid:

Azure API Management manages and protects existing APIs but does not generate endpoints from a database. Azure Functions with SQL bindings require writing function code for each operation.

If the question asks about:

The question describes a CI/CD pipeline where a developer modified the production database schema directly, bypassing source control, and the team needs to detect and prevent this

Answer:

Schema drift detection in SQL Database Projects identifies the divergence between live database and project source. Combine with deployment gates that compare the project against live before any new deployment, and enforce branching policies so all changes must go through a PR.

Distractor to avoid:

SQL Server auditing logs who ran which commands but does not compare schemas or enforce source control alignment. It is reactive, not preventive.

If the question asks about:

A scenario asks which isolation level to use for a high-concurrency read workload where readers should not block writers and writers should not block readers, but each reader must see a consistent view throughout their transaction

Answer:

SNAPSHOT isolation — readers see data as of their transaction start time, writers do not block readers, and the version store provides consistency without locking.

Distractor to avoid:

REPEATABLE READ prevents non-repeatable reads but uses shared locks that block concurrent writers. SERIALIZABLE is even more restrictive and would cause significant contention in a high-concurrency scenario.

If the question asks about:

A vector embedding pipeline needs to know the exact values of changed rows (before and after update) to decide whether to regenerate the embedding

Answer:

Change Data Capture (CDC) — it records full before and after row images in change tables, giving you the actual changed column values to compare.

Distractor to avoid:

Change Tracking only records which rows changed by primary key and the type of operation. It does not capture the old or new values, so you cannot determine if the semantically meaningful columns actually changed.

Last-Minute Facts

1Always Encrypted CEK (column encryption key) is stored in the database encrypted with the CMK. CMK is never stored in the database — it lives in Azure Key Vault or a certificate store
2Dynamic Data Masking default masking function replaces all characters with X for strings and 0 for numbers. Email masking shows the first letter and replaces the rest with XXX@XXXX.com
3UNMASK permission overrides all Dynamic Data Masking rules for that user — they see the real values
4Query Store max size default is 100MB in SQL Server; 1GB in Azure SQL Database
5SQL Database Projects use the .sqlproj file extension; the SDK-style format (not legacy .sqlproj) is what the exam focuses on for CI/CD with dotnet build
6DAB configuration file is named dab-config.json by default; entity permissions use the 'actions' array with values: create, read, update, delete, execute (for stored procedures)
7SNAPSHOT isolation requires the database option ALLOW_SNAPSHOT_ISOLATION to be set ON before any session can use it
8Azure SQL Database has Query Store enabled by default; SQL Server on-premises does NOT
9Transparent Data Encryption (TDE) encrypts data files at the page level on disk but provides NO protection against privileged users — the engine decrypts in memory for all queries
Domain 326% of exam

Implement AI Capabilities in Database Solutions

Must-Know Facts

  • Two ways to generate embeddings in T-SQL: AI_GENERATE_EMBEDDINGS(text USE MODEL ExternalModelName) is the high-level native function that calls a pre-registered CREATE EXTERNAL MODEL — this is the exam-preferred concise approach. sp_invoke_external_rest_endpoint is the lower-level raw REST call to any HTTP endpoint, used for both embedding generation and LLM completions when you need full control over the request payload
  • VECTOR data type: declared as VECTOR(dimensions) where dimensions matches the embedding model output (e.g., VECTOR(1536) for Ada-002, VECTOR(3072) for text-embedding-3-large). Stores embeddings natively in SQL Server 2025, Azure SQL, and Fabric SQL
  • VECTOR_DISTANCE(metric, vector1, vector2): exact nearest neighbor (ENN) — scans ALL rows, no index used, always accurate. Three metrics: 'cosine' (angle between vectors, default for text/NLP), 'euclidean' (straight-line distance, numeric/spatial data), 'dot' (requires pre-normalized vectors, fastest when normalized) — the metric string is 'dot' not 'dot_product'
  • VECTOR_SEARCH: approximate nearest neighbor (ANN) using DiskANN vector indexes. Faster than VECTOR_DISTANCE at scale, returns approximate results. Currently in public preview — requires PREVIEW_FEATURES = ON in SQL Server 2025
  • VECTOR_NORMALIZE(vector, norm_type): returns a unit-length normalized vector. norm1 = sum of absolute values, norm2 = Euclidean norm (default for normalization), norminf = max absolute component. Use before dot product similarity — dot product distance assumes pre-normalized vectors
  • VECTORPROPERTY(vector, property): returns metadata. VECTORPROPERTY(col, 'Dimensions') returns the integer count of dimensions. VECTORPROPERTY(col, 'BaseType') returns the underlying numeric type name
  • Hybrid search combines full-text search (keyword/BM25) with vector search (semantic) and merges ranked results using Reciprocal Rank Fusion (RRF). RRF formula: score = sum(1/(k + rank_i)) where k is a constant (typically 60). Higher RRF score = better combined rank
  • RAG workflow: (1) receive user query text, (2) generate query embedding via AI_GENERATE_EMBEDDINGS or sp_invoke_external_rest_endpoint, (3) search for top-K relevant chunks using VECTOR_SEARCH or VECTOR_DISTANCE, (4) fetch the text of matching chunks, (5) build a prompt combining user query + retrieved context formatted as JSON, (6) call LLM via sp_invoke_external_rest_endpoint, (7) extract and return the answer from the JSON response
  • Embedding maintenance methods — choose based on update latency needs: triggers (immediate, synchronous, risks blocking writes), CDC (near-real-time, captures full row data), Change Tracking (lightweight sync, captures only PKs), Change Event Streaming/CES (event-driven, Fabric-native), Azure Functions with SQL trigger binding (async, scalable), Azure Logic Apps (workflow-based, low-code), Microsoft Foundry (managed AI pipeline)
  • Chunking strategy: split large documents into segments with overlap (e.g., 512-token chunks with 50-token overlap) so context is not lost at chunk boundaries. Chunk size affects recall quality — too small loses context, too large dilutes relevance

Common Traps

TrapVECTOR_SEARCH and VECTOR_DISTANCE both perform nearest-neighbor search, so candidates assume they are interchangeable based on preference
RealityVECTOR_DISTANCE is exact (scans all vectors, no index, always returns the true nearest neighbors). VECTOR_SEARCH is approximate (uses DiskANN index, much faster at scale, may miss some true nearest neighbors). At 10 million rows with a sub-100ms SLA, VECTOR_DISTANCE will time out; VECTOR_SEARCH is required. For small datasets where perfect accuracy is critical, VECTOR_DISTANCE is appropriate.
TrapCosine, Euclidean, and dot product are just different names for the same concept and can be used interchangeably
RealityThese are fundamentally different distance metrics for different data types. Cosine measures the ANGLE between vectors (direction similarity) — best for text/NLP where word frequency scale should not affect similarity. Euclidean measures STRAIGHT-LINE DISTANCE — best for numeric/spatial data where magnitude differences are meaningful. Dot product is fastest but REQUIRES pre-normalized unit vectors to be mathematically equivalent to cosine. Using dot product on unnormalized vectors gives wrong similarity scores.
TrapRAG fine-tunes or retrains the underlying language model with your data
RealityRAG (Retrieval-Augmented Generation) never modifies the model. It augments the PROMPT at query time by injecting retrieved context from your database. The model weights are unchanged. Fine-tuning adjusts model weights using training examples and is a completely separate and far more expensive process. The exam explicitly distinguishes these — when the requirement is 'answer based on current enterprise data', RAG is the answer, not fine-tuning.
TrapAI_GENERATE_EMBEDDINGS can be called directly without any setup
RealityAI_GENERATE_EMBEDDINGS requires a pre-configured CREATE EXTERNAL MODEL registration. The external model defines the endpoint URL, authentication credentials, and model provider. Without CREATE EXTERNAL MODEL, you cannot call AI_GENERATE_EMBEDDINGS — you must fall back to sp_invoke_external_rest_endpoint with manual HTTP request construction.
TrapVECTOR_NORMALIZE modifies the stored vector in place
RealityVECTOR_NORMALIZE is a pure function — it returns a new normalized vector but does NOT modify the stored value. To persist normalized vectors, you must run an UPDATE or INSERT that explicitly stores the VECTOR_NORMALIZE result. If you call VECTOR_NORMALIZE on a vector column in a SELECT, the stored value remains unchanged.
TrapFull-text search and vector search cover the same use cases, so you only need one
RealityFull-text search (CONTAINS, FREETEXT, CONTAINSTABLE) matches keywords, stems, and linguistic variations — it finds documents containing the word 'run', 'running', 'ran'. Vector search matches semantic meaning — it can find documents about 'jogging' when you search for 'running' even without the word. Hybrid search uses both together via RRF for best coverage. The choice depends on whether the requirement emphasizes keyword precision or conceptual similarity.
TrapEmbedding maintenance via triggers is the best approach because it is immediate
RealityTriggers are synchronous and run inside the same transaction as the write. On a high-write-volume table, embedding generation (which requires an external REST call) inside a trigger will block the transaction, increase latency, and risk timeouts. Async approaches like CDC + Azure Functions, Change Tracking + batch job, or CES provide better write performance at the cost of eventual consistency. The exam asks you to match the maintenance method to the update frequency and write-throughput requirements.

Confusing Pairs

AI_GENERATE_EMBEDDINGSsp_invoke_external_rest_endpoint

AI_GENERATE_EMBEDDINGS is the high-level native T-SQL function: SELECT AI_GENERATE_EMBEDDINGS(description USE MODEL MyEmbeddingModel). The keyword is USE MODEL, not USING. Requires CREATE EXTERNAL MODEL registration. Concise, exam-preferred for embedding generation. sp_invoke_external_rest_endpoint is the low-level raw REST caller: EXEC sp_invoke_external_rest_endpoint @url=..., @method='POST', @payload=.... Used for both embeddings AND LLM completions. More verbose, requires you to construct and parse JSON manually. Use AI_GENERATE_EMBEDDINGS for embeddings when an external model is registered. Use sp_invoke_external_rest_endpoint when calling an LLM for RAG completion answers or when no external model registration exists.

VECTOR_DISTANCE (ENN)VECTOR_SEARCH (ANN)

VECTOR_DISTANCE = Exact Nearest Neighbor. Computes distance against every row. No index required or used. Perfect accuracy. Scales O(n) — gets progressively slower as data grows. Appropriate for small datasets or when accuracy is critical. VECTOR_SEARCH = Approximate Nearest Neighbor. Uses DiskANN vector index. Sub-linear scaling. May miss some true nearest neighbors (recall < 100%). Required for production-scale embedding search over millions of rows with latency requirements. Modern VECTOR_SEARCH syntax uses 'SELECT TOP (N) WITH APPROXIMATE ... FROM VECTOR_SEARCH(TABLE = ..., COLUMN = ..., SIMILAR_TO = @vec, METRIC = ''cosine'') ORDER BY distance' — the legacy TOP_N parameter is deprecated.

Full-Text SearchSemantic Vector Search

Full-Text Search: finds documents containing specific words or linguistic variants. Uses inverted indexes. Deterministic and exact. Best for: 'find all products with the word solar'. Semantic Vector Search: finds documents conceptually similar to a query embedding. Uses vector distance. Can find relevant results with zero keyword overlap. Best for: 'find products similar in purpose to renewable energy devices'. Hybrid search combines both and is the exam answer when both precision and recall matter.

CDC (Change Data Capture)Change Tracking

CDC: captures full before-AND-after row images. Stored in dedicated change tables. Higher overhead. Use when you need actual data values (e.g., to check if the product description column changed before regenerating embeddings). Change Tracking: captures only primary keys of changed rows plus operation type. Minimal overhead. Use when you just need to know which rows to re-fetch (you will query the current row yourself). For embedding maintenance: CDC when you want to avoid unnecessary regeneration by checking if semantic columns changed. Change Tracking when you will simply re-embed every changed row regardless.

Cosine DistanceDot Product Distance

Cosine distance = 1 minus the cosine of the angle between vectors. Invariant to vector magnitude — a text appearing 10 times is not 10x more relevant than one appearing once. The standard for text embedding similarity. In T-SQL: VECTOR_DISTANCE('cosine', v1, v2). Dot product distance = negative of the dot product. Sensitive to vector magnitude. Mathematically identical to cosine when both vectors are unit-normalized. In T-SQL: VECTOR_DISTANCE('dot', v1, v2) — the metric name is 'dot', NOT 'dot_product' (that spelling is only used in CREATE VECTOR INDEX). Use cosine for raw embeddings. Use dot only after calling VECTOR_NORMALIZE — the exam tests this pre-normalization requirement.

Scenario Tips

If the question asks about:

A retail database has 50 million product embeddings and must return the top 10 most semantically similar products to a user query in under 200ms

Answer:

VECTOR_SEARCH with a DiskANN index — approximate nearest neighbor search uses the index to avoid scanning all 50 million vectors, achieving sub-200ms latency. Use the modern syntax: SELECT TOP (50) WITH APPROXIMATE ... FROM VECTOR_SEARCH(TABLE = dbo.Products AS t, COLUMN = embedding, SIMILAR_TO = @qvec, METRIC = 'cosine') AS r ORDER BY r.distance

Distractor to avoid:

VECTOR_DISTANCE scans all 50 million rows to compute exact distances. Even with optimized hardware, this will vastly exceed 200ms at 50 million scale.

If the question asks about:

A question asks which T-SQL function to use to generate an embedding for a new product description, given that the team has already registered an Azure OpenAI embedding model using CREATE EXTERNAL MODEL

Answer:

AI_GENERATE_EMBEDDINGS(description USE MODEL AzureOpenAIEmbeddings) — the native function calls the registered model in one line without manually constructing JSON payloads

Distractor to avoid:

sp_invoke_external_rest_endpoint works but requires you to manually build the JSON request body, handle authentication, and parse the response array — the exam favors AI_GENERATE_EMBEDDINGS when an external model is registered.

If the question asks about:

A RAG scenario asks how product documentation text is retrieved to give context to an LLM before generating an answer to a user question

Answer:

Generate a query embedding for the user's question, search for the top-K most semantically relevant document chunks using VECTOR_SEARCH or VECTOR_DISTANCE, then pass those chunks as context in the prompt sent via sp_invoke_external_rest_endpoint to Azure OpenAI

Distractor to avoid:

RAG does NOT call the LLM to search the database — the database search happens first in SQL, and only the retrieved context plus user question travels to the LLM.

If the question asks about:

The question describes a scenario where product descriptions are updated frequently throughout the day and embeddings must remain synchronized with a maximum 5-minute lag

Answer:

Change Data Capture (CDC) feeding an Azure Functions trigger — CDC captures changes as they occur; the function processes them asynchronously in near-real-time without blocking writes

Distractor to avoid:

Table triggers are synchronous and would make every product update wait for an embedding API call to complete, degrading write performance. A nightly batch job would miss the 5-minute lag requirement.

If the question asks about:

A search feature must find products matching both a user's specific brand keyword AND products semantically related to their use-case description, ranked by combined relevance

Answer:

Hybrid search with Reciprocal Rank Fusion (RRF) — run CONTAINSTABLE for keyword ranking and VECTOR_SEARCH for semantic ranking, then merge the ranked lists using the RRF formula

Distractor to avoid:

Vector search alone would miss the brand keyword requirement. Full-text search alone would miss semantically related products that don't use the exact same words.

If the question asks about:

A developer needs to verify that a vector column stores 1536-dimensional float values, checking programmatically without inspecting the raw data

Answer:

VECTORPROPERTY(vector_column, 'Dimensions') returns 1536 and VECTORPROPERTY(vector_column, 'BaseType') returns the numeric type name

Distractor to avoid:

sys.columns provides general column metadata but does not expose vector-specific properties like dimension count. VECTORPROPERTY is the dedicated T-SQL function for this.

If the question asks about:

The question asks whether to use dot product or cosine distance when calling VECTOR_DISTANCE on raw embeddings from Azure OpenAI that have NOT been normalized

Answer:

Cosine distance — it is magnitude-invariant and works correctly on raw unnormalized embeddings. Dot product distance only gives equivalent results to cosine when both vectors are unit-normalized first.

Distractor to avoid:

Dot product is tempting because it is described as 'faster', but on unnormalized vectors it produces incorrect similarity rankings weighted by magnitude rather than direction.

Last-Minute Facts

1DiskANN vector indexes are in public preview and require PREVIEW_FEATURES = ON in SQL Server 2025 before they can be created
2VECTOR_DISTANCE metric names are lowercase strings: 'cosine', 'euclidean', 'dot' — note: 'dot_product' (with underscore) is used only for CREATE VECTOR INDEX, NOT for VECTOR_DISTANCE or VECTOR_SEARCH
3AI_GENERATE_EMBEDDINGS syntax: AI_GENERATE_EMBEDDINGS(column_or_expression USE MODEL ExternalModelName) — the keyword is USE MODEL, not USING
4sp_invoke_external_rest_endpoint key parameters: @url (required), @method (GET/POST), @headers (JSON), @payload (JSON body), @response (OUTPUT JSON)
5Reciprocal Rank Fusion formula: score = 1/(k + rank) summed across result sets; k=60 is the standard constant; higher total RRF score = higher combined rank
6Ada-002 (text-embedding-ada-002) produces 1536-dimensional vectors; text-embedding-3-small produces 1536; text-embedding-3-large produces 3072 — declare vector column dimensions to match your model exactly
7RAG does NOT fine-tune the model — model weights never change; only the prompt context changes at inference time
8VECTOR_NORMALIZE norm types: norm1 = L1 (sum of absolute values), norm2 = L2 / Euclidean (most common), norminf = L-infinity (max absolute value)
9Full-text indexes require a full-text catalog and a full-text index created separately from regular indexes — CONTAINS and FREETEXT only work on full-text indexed columns
10Change Event Streaming (CES) is the Fabric-native change propagation mechanism and is the preferred embedding maintenance method for SQL databases in Microsoft Fabric

Feeling confident?

Put your knowledge to the test with a timed DP-800 mock exam.