MicrosoftDP-800106 concepts

DP-800 Cheat Sheet

Quick reference for the Microsoft Certified: SQL AI Developer Associate exam.

Quick Navigation

Advanced T-SQL: Window Functions and CTEs Advanced T-SQL: JSON, Regex, and Fuzzy Matching Table Types and Schema Design Security: Always Encrypted, RLS, and DDM Performance: Query Store, DMVs, and Execution Plans SQL Database Projects and CI/CD Data API Builder (DAB)Vector Data Type and Distance Functions Embedding Generation and External Models Vector Search and Hybrid Search RAG Workflow and Embedding Maintenance Change Patterns and Integration Programmability Objects and Partitioning AI-Assisted Development Tools

Advanced T-SQL: Window Functions and CTEs

ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC): Assigns a unique sequential integer to each row within a partition, restarting at 1 for each new partition value.
LAG(salary, 1) OVER (PARTITION BY dept ORDER BY hire_date): Returns the value from the previous row within the partition — use for comparing current row to the prior row.
LEAD(salary, 1) OVER (PARTITION BY dept ORDER BY hire_date): Returns the value from the next row within the partition — LAG looks back, LEAD looks forward.
SUM(amount) OVER (PARTITION BY dept ORDER BY date ROWS UNBOUNDED PRECEDING): Calculates a running total within each partition from the first row to the current row.
WITH cte AS (SELECT ...) SELECT * FROM cte: Common Table Expression — a temporary named result set valid for one query; supports recursion via WITH RECURSIVE.
WITH rcte AS (SELECT 1 AS n UNION ALL SELECT n+1 FROM rcte WHERE n < 10) SELECT * FROM rcte: Recursive CTE — anchor member union-all with recursive member; used for hierarchical data like org charts or bill-of-materials.
DENSE_RANK() vs RANK(): DENSE_RANK has no gaps after ties; RANK skips numbers after ties — e.g., two rows tied at rank 2 give DENSE_RANK 3 next vs RANK 4 next.
JARO_WINKLER_DISTANCE(str1, str2): Returns a 0.0-1.0 similarity score that heavily weights matching prefixes — use for person name matching where first-character agreement is more significant than EDIT_DISTANCE.

Advanced T-SQL: JSON, Regex, and Fuzzy Matching

JSON_OBJECT('name', name, 'age', age): Constructs a JSON object from key-value pairs using scalar expressions — available in SQL Server 2022+.
JSON_ARRAYAGG(product_name ORDER BY product_name): Aggregate function that collects multiple rows into a single JSON array — unlike JSON_ARRAY which takes scalar values, not rows.
JSON_CONTAINS(json_col, '"Manager"', '$.roles'): Returns 1 if the specified path in a JSON column contains the target value — useful for filtering semi-structured data.
OPENJSON(json_col) WITH (id INT '$.id', name NVARCHAR(100) '$.name'): Parses a JSON string and returns rows/columns using a typed WITH clause — the primary way to shred JSON into relational form.
REGEXP_LIKE(email, '^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$'): Returns 1 if the string matches the regular expression pattern — SQL Server 2025 native regex, no CLR required.
REGEXP_REPLACE(phone, '[^0-9]', ''): Replaces all substrings matching the pattern with the replacement string — use to normalize dirty string data.
EDIT_DISTANCE(str1, str2): Returns the Levenshtein edit distance (minimum character edits) between two strings — lower = more similar.
EDIT_DISTANCE_SIMILARITY(str1, str2): Returns a 0-100 integer percentage similarity score based on edit distance — 100 = identical strings.

Table Types and Schema Design

CREATE TABLE Employees (...) WITH (SYSTEM_VERSIONING = ON, HISTORY_TABLE = dbo.EmployeesHistory): Temporal (system-versioned) table — SQL Server automatically maintains a history table tracking all row versions with valid-from/valid-to timestamps.
SELECT * FROM Employees FOR SYSTEM_TIME AS OF '2025-01-01': Queries the temporal table to see how the data looked at a specific point in time — reads from the history table transparently.
CREATE TABLE Transactions (...) WITH (LEDGER = ON): Ledger table — provides tamper-evident cryptographic hashing of row changes; cannot be silently modified without detection.
CREATE TABLE Person AS NODE; CREATE TABLE Knows AS EDGE;: Graph node and edge tables for modeling many-to-many relationships — queried using the MATCH operator.
SELECT p1.name, p2.name FROM Person p1, Knows, Person p2 WHERE MATCH(p1-(Knows)->p2): Graph MATCH query traversing edges from node to node — arrow direction matters for directional relationships.
CREATE TABLE HotData (...) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA): In-memory OLTP table stored entirely in RAM with optional durability — for ultra-high-throughput transactional workloads.
Updatable vs Append-Only Ledger: Updatable ledger tables allow INSERT/UPDATE/DELETE with full history; append-only ledger tables allow only INSERT — use append-only for immutable audit logs.
CREATE SEQUENCE dbo.OrderSeq AS INT START WITH 1 INCREMENT BY 1; SELECT NEXT VALUE FOR dbo.OrderSeq: SEQUENCE generates ordered numeric values independently of any table — unlike IDENTITY, one sequence can be shared across multiple tables and can be reset or cycled.

Security: Always Encrypted, RLS, and DDM

Always Encrypted — Deterministic vs Randomized: Deterministic encryption allows equality lookups (WHERE ssn = ?) but not range queries; randomized encryption prevents any comparison — choose based on query patterns.
Always Encrypted Key Hierarchy: Column Master Key (CMK) stored in client key store (Azure Key Vault, Windows cert store) encrypts the Column Encryption Key (CEK) which encrypts column data — the database engine never holds plaintext keys.
CREATE SECURITY POLICY SalesFilter ADD FILTER PREDICATE fn_securitypredicate(SalesRep) ON Sales.Orders WITH (STATE = ON): Row-Level Security — filter predicate silently hides rows; block predicate prevents writes to rows the user cannot see.
ALTER TABLE Customers ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()'): Dynamic Data Masking — masks email as aXXX@XXXX.com for non-privileged users; original data is unchanged in storage.
DDM Masking Functions: default() masks all characters; email() shows first letter and domain; random(min, max) returns random number; partial(prefix, padding, suffix) for custom formats.
GRANT UNMASK ON dbo.Customers TO UserName: Users with UNMASK permission bypass Dynamic Data Masking and see real values — DDM is NOT a security boundary against privileged users.
RLS hides ROWS; DDM hides COLUMN VALUES: RLS makes non-visible rows appear nonexistent to the user; DDM shows all rows but replaces sensitive column values with masked substitutes.
Column-Level Encryption: OPEN SYMMETRIC KEY vs Always Encrypted: Column-level encryption with symmetric keys stores the key in the database (DBAs can decrypt); Always Encrypted stores keys with the client application so the engine never sees plaintext — different threat models.

Performance: Query Store, DMVs, and Execution Plans

ALTER DATABASE MyDB SET QUERY_STORE = ON (OPERATION_MODE = READ_WRITE): Enables Query Store on the database — captures query text, execution plans, and runtime statistics automatically over time.
sys.query_store_query + sys.query_store_plan + sys.query_store_runtime_stats: Core Query Store DMV tables — join these to identify regressed queries by comparing average duration across time intervals.
EXEC sp_query_store_force_plan @query_id = 1, @plan_id = 5: Forces Query Store to use a specific execution plan for a query — use after identifying that a previous plan was faster.
SELECT * FROM sys.dm_exec_query_stats CROSS APPLY sys.dm_exec_sql_text(sql_handle): DMV query to find top resource-consuming queries by total CPU, elapsed time, or logical reads since last SQL Server restart.
Table Scan vs Index Seek in Execution Plans: Table Scan reads all rows (no useful index); Index Seek navigates the B-tree directly to matching rows — always prefer Seek for large tables.
Key Lookup operator: Occurs when a nonclustered index seek must then look up non-indexed columns in the clustered index — indicates a INCLUDE column is needed on the nonclustered index.
Transaction Isolation Level Comparison: READ UNCOMMITTED: dirty reads allowed. READ COMMITTED (default): no dirty reads. REPEATABLE READ: no non-repeatable reads. SERIALIZABLE: no phantoms. SNAPSHOT: optimistic, readers don't block writers.
SELECT blocking_session_id, session_id, wait_type, wait_time FROM sys.dm_exec_requests WHERE blocking_session_id > 0: Identifies blocked sessions and the blocking session ID — use to diagnose lock chains; deadlocks are logged automatically to the system_health extended events session.

SQL Database Projects and CI/CD

dotnet new sqlproject -n MyDatabase: Creates an SDK-style SQL Database Project (.sqlproj) — stores the entire database schema as source-controlled .sql files.
dotnet build MyDatabase.sqlproj: Validates and compiles the SQL project into a .dacpac artifact — catches syntax errors and object reference issues before deployment.
SqlPackage /Action:Publish /SourceFile:MyDB.dacpac /TargetConnectionString:"...": Deploys the .dacpac to a target database — generates a migration script and applies only the differences (schema compare).
Schema Drift Detection: SQL Database Projects compare the live database against the source-controlled .sqlproj to detect unauthorized out-of-band schema changes — detects drift but does NOT auto-fix it.
CI/CD Pipeline Stages: Source control → branch/PR → build validation (dotnet build) → schema drift check → automated tests → deployment trigger → approval gate → publish to target.
Secrets Management in Pipelines: Store connection strings and credentials in Azure Key Vault or pipeline variable groups — never commit credentials to the .sqlproj or source control.
SDK-style .sqlproj vs Legacy SSDT: SDK-style projects use the MSBuild SDK and support cross-platform dotnet CLI builds; legacy SSDT projects require Visual Studio and Windows — DP-800 focuses on SDK-style.

Data API Builder (DAB)

dab init --database-type mssql --connection-string "...": Initializes a Data API builder project and creates the dab-config.json configuration file.
dab add Product --source dbo.Products --permissions "anonymous:read": Adds an entity to DAB configuration, exposing the Products table as REST and GraphQL endpoints with the specified permissions.
dab start: Starts the DAB runtime server locally, generating REST endpoints at /api/{entity} and GraphQL at /graphql — both from one config file.
DAB entity — REST vs GraphQL: One DAB entity configuration generates /api/Product (REST) AND a GraphQL query/mutation simultaneously — you do not choose one; both are always available.
DAB Pagination: REST uses $first and $after (cursor-based) query parameters; GraphQL uses first and after arguments on connections — both prevent unbounded result sets.
DAB Caching: Configure cache.enabled and cache.ttl-seconds per entity in dab-config.json to cache GET responses — reduces database load for frequently read data.
DAB MCP Endpoint: Data API builder can expose a Model Context Protocol (MCP) endpoint alongside REST and GraphQL — allows AI tools like GitHub Copilot to query the database directly.
DAB Stored Procedure Exposure: Add stored procedures as DAB entities with source.type set to stored-procedure — they become callable via REST POST or GraphQL mutation.

Vector Data Type and Distance Functions

CREATE TABLE Docs (id INT PRIMARY KEY, content NVARCHAR(MAX), embedding VECTOR(1536)): Declares a vector column with 1536 dimensions (matching text-embedding-3-small output) — SQL Server 2025 native VECTOR data type.
VECTOR_DISTANCE('cosine', v1, v2): Computes exact cosine distance between two vectors — values 0 (identical) to 2 (opposite). Use for text and NLP embeddings.
VECTOR_DISTANCE('euclidean', v1, v2): Computes Euclidean (L2) straight-line distance between vectors — use for numeric/spatial data where magnitude differences matter.
VECTOR_DISTANCE('dot', v1, v2): Computes dot product distance — requires pre-normalized vectors; using unnormalized vectors with dot product gives inaccurate similarity scores.
VECTOR_NORMALIZE(embedding, 'norm2'): Returns a unit-length vector using L2 norm (norm2), norm1 (sum of absolute values), or norminf (max component) — does NOT modify stored data, returns a new vector.
VECTORPROPERTY(embedding, 'Dimensions'): Returns the number of dimensions in a vector column as INT — also supports 'BaseType' to return the underlying numeric type name.
VECTOR_DISTANCE scans ALL rows (ENN); VECTOR_SEARCH uses index (ANN): VECTOR_DISTANCE = exact nearest neighbor, no index needed, slow at scale. VECTOR_SEARCH = approximate nearest neighbor via DiskANN index, fast but approximate.

Embedding Generation and External Models

CREATE EXTERNAL MODEL AzureOpenAIEmbedding WITH (LOCATION = 'https://...openai.azure.com/...', CREDENTIAL = AzureOpenAICred, API_TYPE = 'azure_openai', API_FORMAT = 'embedding'): Registers an external AI model (e.g., Azure OpenAI embedding model) in the database so it can be referenced by AI_GENERATE_EMBEDDINGS.
AI_GENERATE_EMBEDDINGS(content USING AzureOpenAIEmbedding): Native T-SQL function that calls a registered CREATE EXTERNAL MODEL to generate a VECTOR embedding from a text expression — the exam-preferred embedding method.
EXEC sp_invoke_external_rest_endpoint @url = 'https://...openai.azure.com/openai/deployments/embedding/embeddings?api-version=2024-02-01', @payload = '{"input":"text"}', @response = @result OUTPUT: Calls an external REST API directly from T-SQL — lower-level alternative to AI_GENERATE_EMBEDDINGS; used for completions and chat endpoints too.
AI_GENERATE_EMBEDDINGS vs sp_invoke_external_rest_endpoint: AI_GENERATE_EMBEDDINGS is the high-level function requiring CREATE EXTERNAL MODEL registration; sp_invoke_external_rest_endpoint is the low-level raw REST call — both produce embeddings but through different abstraction layers.
Chunking for Embeddings: Split large documents into chunks before embedding — too small loses context, too large dilutes relevance; use overlapping chunks (10-20%) to preserve context across boundaries.
CREATE EXTERNAL MODEL with Managed Identity: Use CREDENTIAL pointing to a Managed Identity so no connection string passwords are stored in the database — the server authenticates to Azure OpenAI automatically.

Vector Search and Hybrid Search

CREATE VECTOR INDEX idx_docs_embedding ON Docs(embedding) WITH (METRIC = 'cosine'): Creates a DiskANN vector index on the embedding column — enables VECTOR_SEARCH for approximate nearest neighbor (ANN) queries.
SELECT TOP 5 id, content, VECTOR_DISTANCE('cosine', embedding, @qvec) AS dist FROM Docs ORDER BY dist: Exact nearest neighbor (ENN) search using VECTOR_DISTANCE — scans all vectors, accurate but O(n) — appropriate for small datasets only.
SELECT id, content FROM VECTOR_SEARCH(TABLE Docs, COLUMN embedding, @qvec, TOP_N => 5, METRIC => 'cosine'): Approximate nearest neighbor (ANN) search using DiskANN index — much faster on large datasets but may miss some true nearest neighbors.
Hybrid Search = Full-Text + Vector Search merged with RRF: Combine CONTAINSTABLE (keyword relevance) with VECTOR_SEARCH (semantic relevance) results; merge the two ranked lists using Reciprocal Rank Fusion to improve overall relevance.
Reciprocal Rank Fusion (RRF) formula: RRF score = sum(1 / (k + rank_i)) for each ranked list; k is typically 60 — higher-ranked results get more weight; no normalization of original scores needed.
Full-Text Search: CONTAINS vs FREETEXT: CONTAINS matches exact words and inflections with Boolean logic (AND, OR, NEAR); FREETEXT uses word breaking and thesaurus for natural language matching — less precise but more recall.
CREATE FULLTEXT INDEX ON Docs(content) KEY INDEX PK_Docs ON ft_catalog: Creates a full-text index required for CONTAINS and FREETEXT predicates — a full-text catalog must exist before creating the index.

RAG Workflow and Embedding Maintenance

RAG Workflow Steps: 1) Receive user query. 2) Generate query embedding with AI_GENERATE_EMBEDDINGS. 3) Search for context with VECTOR_SEARCH. 4) Build prompt with retrieved chunks. 5) Call LLM via sp_invoke_external_rest_endpoint. 6) Extract and return answer.
FOR JSON PATH in RAG prompts: Convert retrieved rows into JSON using SELECT ... FOR JSON PATH before embedding in the LLM prompt — models parse structured JSON better than unformatted row output.
RAG does NOT retrain the model: RAG augments the LLM prompt with retrieved context at query time — model weights are never changed; this is distinct from fine-tuning.
Embedding Maintenance: Triggers: AFTER INSERT/UPDATE trigger regenerates embeddings synchronously when source rows change — simplest approach but adds latency to every write operation.
Embedding Maintenance: CDC (Change Data Capture): CDC records full before/after row images in change tables — downstream process reads changes and regenerates embeddings asynchronously; captures WHAT changed.
Embedding Maintenance: Change Tracking: Tracks only WHICH rows changed (primary key only) — lighter weight than CDC; use when you just need to know what to re-embed, not the old values.
CDC vs Change Tracking distinction: CDC = full row data (before and after images). Change Tracking = primary keys of changed rows only. Choose CDC when you need old/new values; Change Tracking for lightweight sync.
Azure Functions with SQL trigger binding: Azure Function triggered automatically when rows change in a SQL table — reads new/changed rows and regenerates embeddings without polling or CDC setup.

Change Patterns and Integration

EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'Products', @role_name = NULL: Enables Change Data Capture on a table — creates change tables that record full row before/after images for INSERT, UPDATE, DELETE.
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Products(@from_lsn, @to_lsn, N'all'): Queries all CDC changes between two LSN (Log Sequence Number) values — returns operation type (__$operation: 1=delete, 2=insert, 3=before-update, 4=after-update).
ALTER TABLE Products ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = ON): Enables Change Tracking on a table — records only primary keys of changed rows and which columns changed, not actual old/new values.
CHANGETABLE(CHANGES Products, @last_sync_version) AS CT: Returns rows changed since the specified sync version — use with INNER JOIN to the source table to get current column values.
Change Event Streaming (CES): Newer SQL Server streaming mechanism for database change events — an alternative to CDC and Change Tracking for near-real-time embedding maintenance pipelines.
Azure Logic Apps with SQL connector: No-code approach to trigger workflows when rows are inserted or updated in SQL — routes changed data to Azure OpenAI for embedding regeneration without custom code.
Query Performance Insight (Azure SQL portal): Azure SQL portal blade that surfaces top resource-consuming queries by CPU, duration, or I/O — no T-SQL needed; wraps Query Store data in a visual dashboard for quick identification of regressions.
CREATE SERVER AUDIT AuditName TO URL = 'https://<storage>.blob.core.windows.net/audit' WITH (STATE = ON): Database auditing captures DML and DDL events to Azure Blob Storage or Log Analytics — enables compliance evidence and unauthorized access detection.

Programmability Objects and Partitioning

CREATE VIEW dbo.ActiveCustomers AS SELECT * FROM Customers WHERE IsActive = 1: Views encapsulate SELECT logic and can be used to limit column/row exposure — not a security boundary on their own unless combined with DENY on the underlying table.
CREATE FUNCTION dbo.GetTax(@price DECIMAL) RETURNS DECIMAL AS BEGIN RETURN @price * 0.1 END: Scalar user-defined function — returns a single value; use inline table-valued functions (ITVF) instead when performance matters, as scalar UDFs inhibit parallelism.
CREATE PROCEDURE dbo.UpsertProduct @id INT, @name NVARCHAR(100) AS BEGIN MERGE ...: Stored procedures encapsulate DML logic, support parameters, and can use MERGE for upsert patterns — grant EXECUTE permission to avoid granting direct table access.
CREATE TRIGGER trg_AfterInsert ON Orders AFTER INSERT AS BEGIN ... END: AFTER trigger fires after the DML completes — use the inserted and deleted pseudo-tables to access new/old row values. Use cautiously: triggers add latency to every DML operation.
CREATE PARTITION FUNCTION pf_Monthly(DATE) AS RANGE RIGHT FOR VALUES ('2024-01-01','2024-02-01','2024-03-01'): Partition function defines the boundary values and range direction (LEFT or RIGHT) that determine which partition a row belongs to.
CREATE PARTITION SCHEME ps_Monthly AS PARTITION pf_Monthly ALL TO ([PRIMARY]): Partition scheme maps each partition defined by the partition function to a filegroup — separating hot/cold partitions to different filegroups enables archiving and targeted I/O.
CREATE EXTERNAL TABLE ext_Sales (...) WITH (DATA_SOURCE = AzureBlobDS, LOCATION = 'sales/', FILE_FORMAT = CsvFormat): External table references data in Azure Blob Storage or another database without importing it — enables querying remote data in T-SQL without ETL.

AI-Assisted Development Tools

.github/copilot-instructions.md: GitHub Copilot instruction file placed in the repo root — customizes Copilot behavior for the project (coding conventions, naming standards, SQL style rules).
GitHub Copilot for SQL: Provides T-SQL code suggestions, query completion, and optimization recommendations within VS Code and Azure Data Studio — AI-generated code still requires security review.
MCP (Model Context Protocol) Server: Standard protocol that allows AI tools like GitHub Copilot to connect to and query SQL Server or Fabric lakehouse directly as a data source — exposed by Data API builder.
Copilot in Microsoft Fabric: AI-assisted experience within Fabric that helps write SQL queries, analyze data, and generate reports — enabled per workspace and operates on Fabric SQL databases.
AI-generated code security review: Copilot instruction files do NOT replace manual code review — always audit AI-generated T-SQL for SQL injection risks, permission escalation, and logic errors before deploying.
CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs ON Sales(OrderDate, Amount, Region): Columnstore index stores data column-by-column for analytics workloads — dramatically faster for aggregation queries over large tables compared to rowstore (B-tree) indexes.
Rowstore (B-tree) vs Columnstore index: Rowstore indexes suit OLTP: point lookups, narrow range scans, frequent INSERT/UPDATE/DELETE. Columnstore suits analytics: full or large-range scans with aggregations — choose based on the dominant query pattern.
BEGIN TRY ... END TRY BEGIN CATCH THROW; END CATCH: TRY/CATCH block catches runtime errors in T-SQL — use THROW to re-raise the caught error, or construct a custom message with RAISERROR. Essential for transactional stored procedures.

Ready to test yourself?

Start a timed DP-800 mock exam or review practice questions by domain.