General Exam Tips
- 1.Read ALL answer options before selecting — many DP-800 questions have a plausible-sounding wrong answer that is subtly different from the correct one
- 2.Budget roughly 2 minutes per standard question and 4-6 minutes per case study question. Flag hard questions and return to them rather than getting stuck
- 3.Case studies give you a scenario upfront — read it carefully, underline scale requirements (e.g., 10 million rows) and SLAs, because the correct answer often hinges on those constraints
- 4.The exam is developer-perspective, not DBA-perspective. When a question asks about 'the best approach', prefer schema-as-code, CI/CD pipelines, and API generation over manual scripts and direct database changes
- 5.No penalty for wrong answers — guess on any question you cannot resolve; never leave an answer blank
- 6.When two answers both seem correct, ask which one the exam is really testing. The more specific, newer, or SQL-native option is usually preferred over a generic Azure workaround
- 7.Domain 1 and Domain 2 together are 74% of the exam — do not over-invest study time in the AI domain at the expense of foundational T-SQL and security topics
- 8.The AI capabilities domain (Domain 3) contains the most novel material. Even experienced SQL developers fail here — treat it as a separate learning area requiring hands-on practice
- 9.Scenario keyword triggers: 'tamper-evident' = ledger table, 'history of row versions' = temporal table, 'hide values from non-privileged users' = Dynamic Data Masking, 'meaningful similarity' = vector search, 'current enterprise data + LLM answer' = RAG pattern
- 10.Interactive lab-style questions may appear — you may be placed inside SSMS or Azure Data Studio. Practice writing the functions from memory, not just recognition
Quick Navigation
Design and Develop Database Solutions
Must-Know Facts
- Specialized table types: temporal (system-versioned history, FOR SYSTEM_TIME), ledger (cryptographic tamper detection, database digests), in-memory/MEMORY_OPTIMIZED (ultra-low-latency OLTP), external (reference remote data without importing), graph (node + edge tables with MATCH operator)
- JSON functions: JSON_OBJECT and JSON_ARRAY construct JSON from scalars; JSON_ARRAYAGG aggregates multiple result rows into a JSON array; OPENJSON shreds JSON into rows; JSON_VALUE extracts a scalar; JSON_CONTAINS checks membership
- Window functions: ROW_NUMBER (unique sequential), RANK (gaps on ties), DENSE_RANK (no gaps on ties), NTILE (buckets), LAG/LEAD (previous/next row), SUM/AVG/COUNT with OVER(PARTITION BY ... ORDER BY ...)
- Fuzzy matching: EDIT_DISTANCE returns an integer Levenshtein edit count; EDIT_DISTANCE_SIMILARITY returns a 0-100 percentage; JARO_WINKLER_DISTANCE returns a 0-1 similarity score weighted toward prefix matches
- Regular expression functions: REGEXP_LIKE (boolean match test), REGEXP_REPLACE (substitution), REGEXP_SUBSTR (extract match), REGEXP_INSTR (position), REGEXP_COUNT (occurrence count), REGEXP_MATCHES (all matches as rows), REGEXP_SPLIT_TO_TABLE (split on pattern)
- Graph queries use CREATE TABLE ... AS NODE and CREATE TABLE ... AS EDGE, then MATCH syntax such as MATCH(person-(knows)->friend) for relationship traversal
- CTEs support recursion with an anchor member and recursive member joined by UNION ALL; recursive CTEs are the standard way to query hierarchical data (org charts, bill of materials)
- GitHub Copilot instruction files live at .github/copilot-instructions.md — they customize Copilot behavior for your project but do not replace code review or security validation
- MCP (Model Context Protocol) server endpoints let GitHub Copilot connect directly to SQL Server or Fabric lakehouse context; know that MCP is a protocol, not a Microsoft-only product
- Columnstore indexes store data column-by-column for analytical queries; rowstore (B-tree) indexes are for transactional point lookups — choosing the wrong type tanks performance
Common Traps
Confusing Pairs
Scenario Tips
The question asks for a table type to store financial transactions that must be permanently verifiable against tampering, even by DBAs
Append-only ledger table — it prevents modification of existing records and generates cryptographic database digests that prove the data has not changed since insertion
Temporal table sounds like an audit feature but only tracks time-based history, not cryptographic integrity. A motivated DBA could delete or modify temporal history records.
A query must return each salesperson's monthly revenue and the revenue from the previous month within the same region, ordered by month
Use LAG(revenue, 1) OVER (PARTITION BY region ORDER BY month_date) — LAG retrieves the prior row's value without a self-join
LEAD retrieves the next row's value, not the previous. A self-join works but is far less efficient for this pattern.
The question asks you to aggregate all product tags from a query result into a single JSON array per product
JSON_ARRAYAGG(tag) — it collapses multiple rows into one JSON array per group, used with GROUP BY product_id
JSON_ARRAY(tag1, tag2, ...) only works when you have a fixed number of scalar values in hand, not when aggregating across multiple result rows.
The question asks which index type to use for a table that supports heavy nightly reporting queries scanning millions of rows and aggregating sales figures
Columnstore index — it compresses data column-by-column and batch-processes analytical aggregations efficiently
Nonclustered B-tree index is optimal for OLTP point lookups and equality filters, not for full-scan analytical aggregations over wide tables.
Last-Minute Facts
Secure, Optimize, and Deploy Database Solutions
Must-Know Facts
- Always Encrypted: two key types — column master key (CMK) stored in Azure Key Vault or Windows certificate store, held by the CLIENT; column encryption key (CEK) stored encrypted in the database. The SQL engine never sees plaintext. Deterministic encryption permits equality comparison; randomized encryption permits no comparisons at all
- Dynamic Data Masking: masking functions are default (XXXX), email (aXXX@XXXX.com), random (random number in range), and custom string (prefix + padding + suffix). Users with UNMASK permission bypass all masking. DBAs always see real data — masking is not a security boundary
- Row-Level Security: implemented as inline table-valued functions (security predicates) bound to a security policy. Filter predicates restrict SELECT results. Block predicates restrict INSERT/UPDATE/DELETE. Both types can be on the same table simultaneously
- SQL Database Projects: SDK-style .sqlproj file represents the full schema as source code. Build with 'dotnet build', deploy via dacpac or SqlPackage. Schema drift detection compares live database against the project; it detects divergence but does not auto-fix it
- CI/CD pipeline components: branching policies (control who merges to main), pull request workflows, approval gates (manual sign-off before production deploy), code owners (required reviewers per file path), secrets management (never hardcode credentials), and deployment triggers
- Data API builder (DAB): single JSON config file generates REST, GraphQL, AND MCP endpoints simultaneously. Entity sections map to tables, views, or stored procedures. Supports pagination, OData-style filtering, field-level projection, and relationship linking for GraphQL
- Query Store: captures plan history and runtime stats. Use 'Top Resource Consuming Queries' to find problem queries, 'Plan Comparison' to see plan regressions, and 'Force Plan' to lock a query to a good plan. Must be enabled — it is off by default on some platforms
- Transaction isolation levels: READ COMMITTED (default) prevents dirty reads; SNAPSHOT prevents dirty, phantom, and non-repeatable reads without blocking; SERIALIZABLE prevents all anomalies but blocks most concurrency; READ UNCOMMITTED allows dirty reads (fastest, no locks)
- Passwordless access: Managed Identity eliminates passwords in connection strings. System-assigned identity is tied to the resource lifecycle. User-assigned identity is independent and can be shared across resources
- Change patterns: CDC captures full before/after row images in change tables — use when you need the actual changed values. Change Tracking captures only which rows changed by primary key — use for lightweight sync when you only need to know what to re-query. Change Event Streaming (CES) streams changes as events — lower latency than polling
Common Traps
Confusing Pairs
Scenario Tips
The question asks how to encrypt a credit card column so that database administrators running SELECT queries cannot see the actual card numbers
Always Encrypted with randomized encryption — the SQL engine stores and processes only ciphertext; only the application with the column master key can decrypt. DBAs running SSMS queries get ciphertext back.
Transparent Data Encryption (TDE) protects data at rest on disk but decrypts in memory for query processing, so DBAs can still see plaintext values. Dynamic Data Masking only hides values in query results and is bypassed by users with UNMASK permission.
A developer reports that a query that ran in 200ms last week now takes 8 seconds. No schema changes were made. How do you diagnose this?
Query Store — use the 'Regressed Queries' report to find queries whose performance degraded between time periods, compare old vs new execution plans, and force the previously-good plan while investigating the root cause.
DMVs like sys.dm_exec_query_stats show current stats but lose history after a SQL Server restart. Query Store persists historical plan data across restarts and lets you compare plans across time periods.
A company needs to expose their Products and Orders tables as APIs to a mobile app with filtering, pagination, and nested relationship queries, without writing a custom API layer
Data API builder (DAB) — configure entities for Products and Orders in the JSON config, define the GraphQL relationship between them, and DAB auto-generates REST and GraphQL endpoints with pagination and OData filtering built in.
Azure API Management manages and protects existing APIs but does not generate endpoints from a database. Azure Functions with SQL bindings require writing function code for each operation.
The question describes a CI/CD pipeline where a developer modified the production database schema directly, bypassing source control, and the team needs to detect and prevent this
Schema drift detection in SQL Database Projects identifies the divergence between live database and project source. Combine with deployment gates that compare the project against live before any new deployment, and enforce branching policies so all changes must go through a PR.
SQL Server auditing logs who ran which commands but does not compare schemas or enforce source control alignment. It is reactive, not preventive.
A scenario asks which isolation level to use for a high-concurrency read workload where readers should not block writers and writers should not block readers, but each reader must see a consistent view throughout their transaction
SNAPSHOT isolation — readers see data as of their transaction start time, writers do not block readers, and the version store provides consistency without locking.
REPEATABLE READ prevents non-repeatable reads but uses shared locks that block concurrent writers. SERIALIZABLE is even more restrictive and would cause significant contention in a high-concurrency scenario.
A vector embedding pipeline needs to know the exact values of changed rows (before and after update) to decide whether to regenerate the embedding
Change Data Capture (CDC) — it records full before and after row images in change tables, giving you the actual changed column values to compare.
Change Tracking only records which rows changed by primary key and the type of operation. It does not capture the old or new values, so you cannot determine if the semantically meaningful columns actually changed.
Last-Minute Facts
Implement AI Capabilities in Database Solutions
Must-Know Facts
- Two ways to generate embeddings in T-SQL: AI_GENERATE_EMBEDDINGS(text USE MODEL ExternalModelName) is the high-level native function that calls a pre-registered CREATE EXTERNAL MODEL — this is the exam-preferred concise approach. sp_invoke_external_rest_endpoint is the lower-level raw REST call to any HTTP endpoint, used for both embedding generation and LLM completions when you need full control over the request payload
- VECTOR data type: declared as VECTOR(dimensions) where dimensions matches the embedding model output (e.g., VECTOR(1536) for Ada-002, VECTOR(3072) for text-embedding-3-large). Stores embeddings natively in SQL Server 2025, Azure SQL, and Fabric SQL
- VECTOR_DISTANCE(metric, vector1, vector2): exact nearest neighbor (ENN) — scans ALL rows, no index used, always accurate. Three metrics: 'cosine' (angle between vectors, default for text/NLP), 'euclidean' (straight-line distance, numeric/spatial data), 'dot' (requires pre-normalized vectors, fastest when normalized) — the metric string is 'dot' not 'dot_product'
- VECTOR_SEARCH: approximate nearest neighbor (ANN) using DiskANN vector indexes. Faster than VECTOR_DISTANCE at scale, returns approximate results. Currently in public preview — requires PREVIEW_FEATURES = ON in SQL Server 2025
- VECTOR_NORMALIZE(vector, norm_type): returns a unit-length normalized vector. norm1 = sum of absolute values, norm2 = Euclidean norm (default for normalization), norminf = max absolute component. Use before dot product similarity — dot product distance assumes pre-normalized vectors
- VECTORPROPERTY(vector, property): returns metadata. VECTORPROPERTY(col, 'Dimensions') returns the integer count of dimensions. VECTORPROPERTY(col, 'BaseType') returns the underlying numeric type name
- Hybrid search combines full-text search (keyword/BM25) with vector search (semantic) and merges ranked results using Reciprocal Rank Fusion (RRF). RRF formula: score = sum(1/(k + rank_i)) where k is a constant (typically 60). Higher RRF score = better combined rank
- RAG workflow: (1) receive user query text, (2) generate query embedding via AI_GENERATE_EMBEDDINGS or sp_invoke_external_rest_endpoint, (3) search for top-K relevant chunks using VECTOR_SEARCH or VECTOR_DISTANCE, (4) fetch the text of matching chunks, (5) build a prompt combining user query + retrieved context formatted as JSON, (6) call LLM via sp_invoke_external_rest_endpoint, (7) extract and return the answer from the JSON response
- Embedding maintenance methods — choose based on update latency needs: triggers (immediate, synchronous, risks blocking writes), CDC (near-real-time, captures full row data), Change Tracking (lightweight sync, captures only PKs), Change Event Streaming/CES (event-driven, Fabric-native), Azure Functions with SQL trigger binding (async, scalable), Azure Logic Apps (workflow-based, low-code), Microsoft Foundry (managed AI pipeline)
- Chunking strategy: split large documents into segments with overlap (e.g., 512-token chunks with 50-token overlap) so context is not lost at chunk boundaries. Chunk size affects recall quality — too small loses context, too large dilutes relevance
Common Traps
Confusing Pairs
Scenario Tips
A retail database has 50 million product embeddings and must return the top 10 most semantically similar products to a user query in under 200ms
VECTOR_SEARCH with a DiskANN index — approximate nearest neighbor search uses the index to avoid scanning all 50 million vectors, achieving sub-200ms latency. Use the modern syntax: SELECT TOP (50) WITH APPROXIMATE ... FROM VECTOR_SEARCH(TABLE = dbo.Products AS t, COLUMN = embedding, SIMILAR_TO = @qvec, METRIC = 'cosine') AS r ORDER BY r.distance
VECTOR_DISTANCE scans all 50 million rows to compute exact distances. Even with optimized hardware, this will vastly exceed 200ms at 50 million scale.
A question asks which T-SQL function to use to generate an embedding for a new product description, given that the team has already registered an Azure OpenAI embedding model using CREATE EXTERNAL MODEL
AI_GENERATE_EMBEDDINGS(description USE MODEL AzureOpenAIEmbeddings) — the native function calls the registered model in one line without manually constructing JSON payloads
sp_invoke_external_rest_endpoint works but requires you to manually build the JSON request body, handle authentication, and parse the response array — the exam favors AI_GENERATE_EMBEDDINGS when an external model is registered.
A RAG scenario asks how product documentation text is retrieved to give context to an LLM before generating an answer to a user question
Generate a query embedding for the user's question, search for the top-K most semantically relevant document chunks using VECTOR_SEARCH or VECTOR_DISTANCE, then pass those chunks as context in the prompt sent via sp_invoke_external_rest_endpoint to Azure OpenAI
RAG does NOT call the LLM to search the database — the database search happens first in SQL, and only the retrieved context plus user question travels to the LLM.
The question describes a scenario where product descriptions are updated frequently throughout the day and embeddings must remain synchronized with a maximum 5-minute lag
Change Data Capture (CDC) feeding an Azure Functions trigger — CDC captures changes as they occur; the function processes them asynchronously in near-real-time without blocking writes
Table triggers are synchronous and would make every product update wait for an embedding API call to complete, degrading write performance. A nightly batch job would miss the 5-minute lag requirement.
A search feature must find products matching both a user's specific brand keyword AND products semantically related to their use-case description, ranked by combined relevance
Hybrid search with Reciprocal Rank Fusion (RRF) — run CONTAINSTABLE for keyword ranking and VECTOR_SEARCH for semantic ranking, then merge the ranked lists using the RRF formula
Vector search alone would miss the brand keyword requirement. Full-text search alone would miss semantically related products that don't use the exact same words.
A developer needs to verify that a vector column stores 1536-dimensional float values, checking programmatically without inspecting the raw data
VECTORPROPERTY(vector_column, 'Dimensions') returns 1536 and VECTORPROPERTY(vector_column, 'BaseType') returns the numeric type name
sys.columns provides general column metadata but does not expose vector-specific properties like dimension count. VECTORPROPERTY is the dedicated T-SQL function for this.
The question asks whether to use dot product or cosine distance when calling VECTOR_DISTANCE on raw embeddings from Azure OpenAI that have NOT been normalized
Cosine distance — it is magnitude-invariant and works correctly on raw unnormalized embeddings. Dot product distance only gives equivalent results to cosine when both vectors are unit-normalized first.
Dot product is tempting because it is described as 'faster', but on unnormalized vectors it produces incorrect similarity rankings weighted by magnitude rather than direction.