Quick Navigation
Advanced T-SQL: Window Functions and CTEsAdvanced T-SQL: JSON, Regex, and Fuzzy MatchingTable Types and Schema DesignSecurity: Always Encrypted, RLS, and DDMPerformance: Query Store, DMVs, and Execution PlansSQL Database Projects and CI/CDData API Builder (DAB)Vector Data Type and Distance FunctionsEmbedding Generation and External ModelsVector Search and Hybrid SearchRAG Workflow and Embedding MaintenanceChange Patterns and IntegrationProgrammability Objects and PartitioningAI-Assisted Development Tools
Advanced T-SQL: Window Functions and CTEs
- ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)
- Assigns a unique sequential integer to each row within a partition, restarting at 1 for each new partition value.
- LAG(salary, 1) OVER (PARTITION BY dept ORDER BY hire_date)
- Returns the value from the previous row within the partition — use for comparing current row to the prior row.
- LEAD(salary, 1) OVER (PARTITION BY dept ORDER BY hire_date)
- Returns the value from the next row within the partition — LAG looks back, LEAD looks forward.
- SUM(amount) OVER (PARTITION BY dept ORDER BY date ROWS UNBOUNDED PRECEDING)
- Calculates a running total within each partition from the first row to the current row.
- WITH cte AS (SELECT ...) SELECT * FROM cte
- Common Table Expression — a temporary named result set valid for one query; supports recursion via WITH RECURSIVE.
- WITH rcte AS (SELECT 1 AS n UNION ALL SELECT n+1 FROM rcte WHERE n < 10) SELECT * FROM rcte
- Recursive CTE — anchor member union-all with recursive member; used for hierarchical data like org charts or bill-of-materials.
- DENSE_RANK() vs RANK()
- DENSE_RANK has no gaps after ties; RANK skips numbers after ties — e.g., two rows tied at rank 2 give DENSE_RANK 3 next vs RANK 4 next.
- JARO_WINKLER_DISTANCE(str1, str2)
- Returns a 0.0-1.0 similarity score that heavily weights matching prefixes — use for person name matching where first-character agreement is more significant than EDIT_DISTANCE.
Advanced T-SQL: JSON, Regex, and Fuzzy Matching
- JSON_OBJECT('name', name, 'age', age)
- Constructs a JSON object from key-value pairs using scalar expressions — available in SQL Server 2022+.
- JSON_ARRAYAGG(product_name ORDER BY product_name)
- Aggregate function that collects multiple rows into a single JSON array — unlike JSON_ARRAY which takes scalar values, not rows.
- JSON_CONTAINS(json_col, '"Manager"', '$.roles')
- Returns 1 if the specified path in a JSON column contains the target value — useful for filtering semi-structured data.
- OPENJSON(json_col) WITH (id INT '$.id', name NVARCHAR(100) '$.name')
- Parses a JSON string and returns rows/columns using a typed WITH clause — the primary way to shred JSON into relational form.
- REGEXP_LIKE(email, '^[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$')
- Returns 1 if the string matches the regular expression pattern — SQL Server 2025 native regex, no CLR required.
- REGEXP_REPLACE(phone, '[^0-9]', '')
- Replaces all substrings matching the pattern with the replacement string — use to normalize dirty string data.
- EDIT_DISTANCE(str1, str2)
- Returns the Levenshtein edit distance (minimum character edits) between two strings — lower = more similar.
- EDIT_DISTANCE_SIMILARITY(str1, str2)
- Returns a 0-100 integer percentage similarity score based on edit distance — 100 = identical strings.
Table Types and Schema Design
- CREATE TABLE Employees (...) WITH (SYSTEM_VERSIONING = ON, HISTORY_TABLE = dbo.EmployeesHistory)
- Temporal (system-versioned) table — SQL Server automatically maintains a history table tracking all row versions with valid-from/valid-to timestamps.
- SELECT * FROM Employees FOR SYSTEM_TIME AS OF '2025-01-01'
- Queries the temporal table to see how the data looked at a specific point in time — reads from the history table transparently.
- CREATE TABLE Transactions (...) WITH (LEDGER = ON)
- Ledger table — provides tamper-evident cryptographic hashing of row changes; cannot be silently modified without detection.
- CREATE TABLE Person AS NODE; CREATE TABLE Knows AS EDGE;
- Graph node and edge tables for modeling many-to-many relationships — queried using the MATCH operator.
- SELECT p1.name, p2.name FROM Person p1, Knows, Person p2 WHERE MATCH(p1-(Knows)->p2)
- Graph MATCH query traversing edges from node to node — arrow direction matters for directional relationships.
- CREATE TABLE HotData (...) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)
- In-memory OLTP table stored entirely in RAM with optional durability — for ultra-high-throughput transactional workloads.
- Updatable vs Append-Only Ledger
- Updatable ledger tables allow INSERT/UPDATE/DELETE with full history; append-only ledger tables allow only INSERT — use append-only for immutable audit logs.
- CREATE SEQUENCE dbo.OrderSeq AS INT START WITH 1 INCREMENT BY 1; SELECT NEXT VALUE FOR dbo.OrderSeq
- SEQUENCE generates ordered numeric values independently of any table — unlike IDENTITY, one sequence can be shared across multiple tables and can be reset or cycled.
Security: Always Encrypted, RLS, and DDM
- Always Encrypted — Deterministic vs Randomized
- Deterministic encryption allows equality lookups (WHERE ssn = ?) but not range queries; randomized encryption prevents any comparison — choose based on query patterns.
- Always Encrypted Key Hierarchy
- Column Master Key (CMK) stored in client key store (Azure Key Vault, Windows cert store) encrypts the Column Encryption Key (CEK) which encrypts column data — the database engine never holds plaintext keys.
- CREATE SECURITY POLICY SalesFilter ADD FILTER PREDICATE fn_securitypredicate(SalesRep) ON Sales.Orders WITH (STATE = ON)
- Row-Level Security — filter predicate silently hides rows; block predicate prevents writes to rows the user cannot see.
- ALTER TABLE Customers ALTER COLUMN Email ADD MASKED WITH (FUNCTION = 'email()')
- Dynamic Data Masking — masks email as aXXX@XXXX.com for non-privileged users; original data is unchanged in storage.
- DDM Masking Functions
- default() masks all characters; email() shows first letter and domain; random(min, max) returns random number; partial(prefix, padding, suffix) for custom formats.
- GRANT UNMASK ON dbo.Customers TO UserName
- Users with UNMASK permission bypass Dynamic Data Masking and see real values — DDM is NOT a security boundary against privileged users.
- RLS hides ROWS; DDM hides COLUMN VALUES
- RLS makes non-visible rows appear nonexistent to the user; DDM shows all rows but replaces sensitive column values with masked substitutes.
- Column-Level Encryption: OPEN SYMMETRIC KEY vs Always Encrypted
- Column-level encryption with symmetric keys stores the key in the database (DBAs can decrypt); Always Encrypted stores keys with the client application so the engine never sees plaintext — different threat models.
Performance: Query Store, DMVs, and Execution Plans
- ALTER DATABASE MyDB SET QUERY_STORE = ON (OPERATION_MODE = READ_WRITE)
- Enables Query Store on the database — captures query text, execution plans, and runtime statistics automatically over time.
- sys.query_store_query + sys.query_store_plan + sys.query_store_runtime_stats
- Core Query Store DMV tables — join these to identify regressed queries by comparing average duration across time intervals.
- EXEC sp_query_store_force_plan @query_id = 1, @plan_id = 5
- Forces Query Store to use a specific execution plan for a query — use after identifying that a previous plan was faster.
- SELECT * FROM sys.dm_exec_query_stats CROSS APPLY sys.dm_exec_sql_text(sql_handle)
- DMV query to find top resource-consuming queries by total CPU, elapsed time, or logical reads since last SQL Server restart.
- Table Scan vs Index Seek in Execution Plans
- Table Scan reads all rows (no useful index); Index Seek navigates the B-tree directly to matching rows — always prefer Seek for large tables.
- Key Lookup operator
- Occurs when a nonclustered index seek must then look up non-indexed columns in the clustered index — indicates a INCLUDE column is needed on the nonclustered index.
- Transaction Isolation Level Comparison
- READ UNCOMMITTED: dirty reads allowed. READ COMMITTED (default): no dirty reads. REPEATABLE READ: no non-repeatable reads. SERIALIZABLE: no phantoms. SNAPSHOT: optimistic, readers don't block writers.
- SELECT blocking_session_id, session_id, wait_type, wait_time FROM sys.dm_exec_requests WHERE blocking_session_id > 0
- Identifies blocked sessions and the blocking session ID — use to diagnose lock chains; deadlocks are logged automatically to the system_health extended events session.
SQL Database Projects and CI/CD
- dotnet new sqlproject -n MyDatabase
- Creates an SDK-style SQL Database Project (.sqlproj) — stores the entire database schema as source-controlled .sql files.
- dotnet build MyDatabase.sqlproj
- Validates and compiles the SQL project into a .dacpac artifact — catches syntax errors and object reference issues before deployment.
- SqlPackage /Action:Publish /SourceFile:MyDB.dacpac /TargetConnectionString:"..."
- Deploys the .dacpac to a target database — generates a migration script and applies only the differences (schema compare).
- Schema Drift Detection
- SQL Database Projects compare the live database against the source-controlled .sqlproj to detect unauthorized out-of-band schema changes — detects drift but does NOT auto-fix it.
- CI/CD Pipeline Stages
- Source control → branch/PR → build validation (dotnet build) → schema drift check → automated tests → deployment trigger → approval gate → publish to target.
- Secrets Management in Pipelines
- Store connection strings and credentials in Azure Key Vault or pipeline variable groups — never commit credentials to the .sqlproj or source control.
- SDK-style .sqlproj vs Legacy SSDT
- SDK-style projects use the MSBuild SDK and support cross-platform dotnet CLI builds; legacy SSDT projects require Visual Studio and Windows — DP-800 focuses on SDK-style.
Data API Builder (DAB)
- dab init --database-type mssql --connection-string "..."
- Initializes a Data API builder project and creates the dab-config.json configuration file.
- dab add Product --source dbo.Products --permissions "anonymous:read"
- Adds an entity to DAB configuration, exposing the Products table as REST and GraphQL endpoints with the specified permissions.
- dab start
- Starts the DAB runtime server locally, generating REST endpoints at /api/{entity} and GraphQL at /graphql — both from one config file.
- DAB entity — REST vs GraphQL
- One DAB entity configuration generates /api/Product (REST) AND a GraphQL query/mutation simultaneously — you do not choose one; both are always available.
- DAB Pagination
- REST uses $first and $after (cursor-based) query parameters; GraphQL uses first and after arguments on connections — both prevent unbounded result sets.
- DAB Caching
- Configure cache.enabled and cache.ttl-seconds per entity in dab-config.json to cache GET responses — reduces database load for frequently read data.
- DAB MCP Endpoint
- Data API builder can expose a Model Context Protocol (MCP) endpoint alongside REST and GraphQL — allows AI tools like GitHub Copilot to query the database directly.
- DAB Stored Procedure Exposure
- Add stored procedures as DAB entities with source.type set to stored-procedure — they become callable via REST POST or GraphQL mutation.
Vector Data Type and Distance Functions
- CREATE TABLE Docs (id INT PRIMARY KEY, content NVARCHAR(MAX), embedding VECTOR(1536))
- Declares a vector column with 1536 dimensions (matching text-embedding-3-small output) — SQL Server 2025 native VECTOR data type.
- VECTOR_DISTANCE('cosine', v1, v2)
- Computes exact cosine distance between two vectors — values 0 (identical) to 2 (opposite). Use for text and NLP embeddings.
- VECTOR_DISTANCE('euclidean', v1, v2)
- Computes Euclidean (L2) straight-line distance between vectors — use for numeric/spatial data where magnitude differences matter.
- VECTOR_DISTANCE('dot', v1, v2)
- Computes dot product distance — requires pre-normalized vectors; using unnormalized vectors with dot product gives inaccurate similarity scores.
- VECTOR_NORMALIZE(embedding, 'norm2')
- Returns a unit-length vector using L2 norm (norm2), norm1 (sum of absolute values), or norminf (max component) — does NOT modify stored data, returns a new vector.
- VECTORPROPERTY(embedding, 'Dimensions')
- Returns the number of dimensions in a vector column as INT — also supports 'BaseType' to return the underlying numeric type name.
- VECTOR_DISTANCE scans ALL rows (ENN); VECTOR_SEARCH uses index (ANN)
- VECTOR_DISTANCE = exact nearest neighbor, no index needed, slow at scale. VECTOR_SEARCH = approximate nearest neighbor via DiskANN index, fast but approximate.
Embedding Generation and External Models
- CREATE EXTERNAL MODEL AzureOpenAIEmbedding WITH (LOCATION = 'https://...openai.azure.com/...', CREDENTIAL = AzureOpenAICred, API_TYPE = 'azure_openai', API_FORMAT = 'embedding')
- Registers an external AI model (e.g., Azure OpenAI embedding model) in the database so it can be referenced by AI_GENERATE_EMBEDDINGS.
- AI_GENERATE_EMBEDDINGS(content USING AzureOpenAIEmbedding)
- Native T-SQL function that calls a registered CREATE EXTERNAL MODEL to generate a VECTOR embedding from a text expression — the exam-preferred embedding method.
- EXEC sp_invoke_external_rest_endpoint @url = 'https://...openai.azure.com/openai/deployments/embedding/embeddings?api-version=2024-02-01', @payload = '{"input":"text"}', @response = @result OUTPUT
- Calls an external REST API directly from T-SQL — lower-level alternative to AI_GENERATE_EMBEDDINGS; used for completions and chat endpoints too.
- AI_GENERATE_EMBEDDINGS vs sp_invoke_external_rest_endpoint
- AI_GENERATE_EMBEDDINGS is the high-level function requiring CREATE EXTERNAL MODEL registration; sp_invoke_external_rest_endpoint is the low-level raw REST call — both produce embeddings but through different abstraction layers.
- Chunking for Embeddings
- Split large documents into chunks before embedding — too small loses context, too large dilutes relevance; use overlapping chunks (10-20%) to preserve context across boundaries.
- CREATE EXTERNAL MODEL with Managed Identity
- Use CREDENTIAL pointing to a Managed Identity so no connection string passwords are stored in the database — the server authenticates to Azure OpenAI automatically.
Vector Search and Hybrid Search
- CREATE VECTOR INDEX idx_docs_embedding ON Docs(embedding) WITH (METRIC = 'cosine')
- Creates a DiskANN vector index on the embedding column — enables VECTOR_SEARCH for approximate nearest neighbor (ANN) queries.
- SELECT TOP 5 id, content, VECTOR_DISTANCE('cosine', embedding, @qvec) AS dist FROM Docs ORDER BY dist
- Exact nearest neighbor (ENN) search using VECTOR_DISTANCE — scans all vectors, accurate but O(n) — appropriate for small datasets only.
- SELECT id, content FROM VECTOR_SEARCH(TABLE Docs, COLUMN embedding, @qvec, TOP_N => 5, METRIC => 'cosine')
- Approximate nearest neighbor (ANN) search using DiskANN index — much faster on large datasets but may miss some true nearest neighbors.
- Hybrid Search = Full-Text + Vector Search merged with RRF
- Combine CONTAINSTABLE (keyword relevance) with VECTOR_SEARCH (semantic relevance) results; merge the two ranked lists using Reciprocal Rank Fusion to improve overall relevance.
- Reciprocal Rank Fusion (RRF) formula
- RRF score = sum(1 / (k + rank_i)) for each ranked list; k is typically 60 — higher-ranked results get more weight; no normalization of original scores needed.
- Full-Text Search: CONTAINS vs FREETEXT
- CONTAINS matches exact words and inflections with Boolean logic (AND, OR, NEAR); FREETEXT uses word breaking and thesaurus for natural language matching — less precise but more recall.
- CREATE FULLTEXT INDEX ON Docs(content) KEY INDEX PK_Docs ON ft_catalog
- Creates a full-text index required for CONTAINS and FREETEXT predicates — a full-text catalog must exist before creating the index.
RAG Workflow and Embedding Maintenance
- RAG Workflow Steps
- 1) Receive user query. 2) Generate query embedding with AI_GENERATE_EMBEDDINGS. 3) Search for context with VECTOR_SEARCH. 4) Build prompt with retrieved chunks. 5) Call LLM via sp_invoke_external_rest_endpoint. 6) Extract and return answer.
- FOR JSON PATH in RAG prompts
- Convert retrieved rows into JSON using SELECT ... FOR JSON PATH before embedding in the LLM prompt — models parse structured JSON better than unformatted row output.
- RAG does NOT retrain the model
- RAG augments the LLM prompt with retrieved context at query time — model weights are never changed; this is distinct from fine-tuning.
- Embedding Maintenance: Triggers
- AFTER INSERT/UPDATE trigger regenerates embeddings synchronously when source rows change — simplest approach but adds latency to every write operation.
- Embedding Maintenance: CDC (Change Data Capture)
- CDC records full before/after row images in change tables — downstream process reads changes and regenerates embeddings asynchronously; captures WHAT changed.
- Embedding Maintenance: Change Tracking
- Tracks only WHICH rows changed (primary key only) — lighter weight than CDC; use when you just need to know what to re-embed, not the old values.
- CDC vs Change Tracking distinction
- CDC = full row data (before and after images). Change Tracking = primary keys of changed rows only. Choose CDC when you need old/new values; Change Tracking for lightweight sync.
- Azure Functions with SQL trigger binding
- Azure Function triggered automatically when rows change in a SQL table — reads new/changed rows and regenerates embeddings without polling or CDC setup.
Change Patterns and Integration
- EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'Products', @role_name = NULL
- Enables Change Data Capture on a table — creates change tables that record full row before/after images for INSERT, UPDATE, DELETE.
- SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Products(@from_lsn, @to_lsn, N'all')
- Queries all CDC changes between two LSN (Log Sequence Number) values — returns operation type (__$operation: 1=delete, 2=insert, 3=before-update, 4=after-update).
- ALTER TABLE Products ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = ON)
- Enables Change Tracking on a table — records only primary keys of changed rows and which columns changed, not actual old/new values.
- CHANGETABLE(CHANGES Products, @last_sync_version) AS CT
- Returns rows changed since the specified sync version — use with INNER JOIN to the source table to get current column values.
- Change Event Streaming (CES)
- Newer SQL Server streaming mechanism for database change events — an alternative to CDC and Change Tracking for near-real-time embedding maintenance pipelines.
- Azure Logic Apps with SQL connector
- No-code approach to trigger workflows when rows are inserted or updated in SQL — routes changed data to Azure OpenAI for embedding regeneration without custom code.
- Query Performance Insight (Azure SQL portal)
- Azure SQL portal blade that surfaces top resource-consuming queries by CPU, duration, or I/O — no T-SQL needed; wraps Query Store data in a visual dashboard for quick identification of regressions.
- CREATE SERVER AUDIT AuditName TO URL = 'https://<storage>.blob.core.windows.net/audit' WITH (STATE = ON)
- Database auditing captures DML and DDL events to Azure Blob Storage or Log Analytics — enables compliance evidence and unauthorized access detection.
Programmability Objects and Partitioning
- CREATE VIEW dbo.ActiveCustomers AS SELECT * FROM Customers WHERE IsActive = 1
- Views encapsulate SELECT logic and can be used to limit column/row exposure — not a security boundary on their own unless combined with DENY on the underlying table.
- CREATE FUNCTION dbo.GetTax(@price DECIMAL) RETURNS DECIMAL AS BEGIN RETURN @price * 0.1 END
- Scalar user-defined function — returns a single value; use inline table-valued functions (ITVF) instead when performance matters, as scalar UDFs inhibit parallelism.
- CREATE PROCEDURE dbo.UpsertProduct @id INT, @name NVARCHAR(100) AS BEGIN MERGE ...
- Stored procedures encapsulate DML logic, support parameters, and can use MERGE for upsert patterns — grant EXECUTE permission to avoid granting direct table access.
- CREATE TRIGGER trg_AfterInsert ON Orders AFTER INSERT AS BEGIN ... END
- AFTER trigger fires after the DML completes — use the inserted and deleted pseudo-tables to access new/old row values. Use cautiously: triggers add latency to every DML operation.
- CREATE PARTITION FUNCTION pf_Monthly(DATE) AS RANGE RIGHT FOR VALUES ('2024-01-01','2024-02-01','2024-03-01')
- Partition function defines the boundary values and range direction (LEFT or RIGHT) that determine which partition a row belongs to.
- CREATE PARTITION SCHEME ps_Monthly AS PARTITION pf_Monthly ALL TO ([PRIMARY])
- Partition scheme maps each partition defined by the partition function to a filegroup — separating hot/cold partitions to different filegroups enables archiving and targeted I/O.
- CREATE EXTERNAL TABLE ext_Sales (...) WITH (DATA_SOURCE = AzureBlobDS, LOCATION = 'sales/', FILE_FORMAT = CsvFormat)
- External table references data in Azure Blob Storage or another database without importing it — enables querying remote data in T-SQL without ETL.
AI-Assisted Development Tools
- .github/copilot-instructions.md
- GitHub Copilot instruction file placed in the repo root — customizes Copilot behavior for the project (coding conventions, naming standards, SQL style rules).
- GitHub Copilot for SQL
- Provides T-SQL code suggestions, query completion, and optimization recommendations within VS Code and Azure Data Studio — AI-generated code still requires security review.
- MCP (Model Context Protocol) Server
- Standard protocol that allows AI tools like GitHub Copilot to connect to and query SQL Server or Fabric lakehouse directly as a data source — exposed by Data API builder.
- Copilot in Microsoft Fabric
- AI-assisted experience within Fabric that helps write SQL queries, analyze data, and generate reports — enabled per workspace and operates on Fabric SQL databases.
- AI-generated code security review
- Copilot instruction files do NOT replace manual code review — always audit AI-generated T-SQL for SQL injection risks, permission escalation, and logic errors before deploying.
- CREATE NONCLUSTERED COLUMNSTORE INDEX idx_cs ON Sales(OrderDate, Amount, Region)
- Columnstore index stores data column-by-column for analytics workloads — dramatically faster for aggregation queries over large tables compared to rowstore (B-tree) indexes.
- Rowstore (B-tree) vs Columnstore index
- Rowstore indexes suit OLTP: point lookups, narrow range scans, frequent INSERT/UPDATE/DELETE. Columnstore suits analytics: full or large-range scans with aggregations — choose based on the dominant query pattern.
- BEGIN TRY ... END TRY BEGIN CATCH THROW; END CATCH
- TRY/CATCH block catches runtime errors in T-SQL — use THROW to re-raise the caught error, or construct a custom message with RAISERROR. Essential for transactional stored procedures.