DatabricksGenAI Engineer Associate6 domains

GenAI Engineer Associate Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Generative AI Engineer Associate exam.

General Exam Tips

1.Read ALL answer options before selecting — many wrong answers are plausible Databricks tools that solve a different problem than what the question asks.
2.The exam has ~45 scored questions plus a small number of unscored pilot items. Never skip a question — wrong answers carry no penalty.
3.Mark uncertain questions and revisit them. Scenario questions are long; a careful second read often resolves ambiguity.
4.Application Development (30%) and Assembling/Deploying (22%) together make up over half the exam. Weight your study time accordingly.
5.Most questions are scenario-based, not definition-based. Ask yourself: given these constraints, what is the BEST choice? Often two answers are correct in isolation but one better fits the stated constraint (latency, cost, update frequency, etc.).
6.When the question names a specific Databricks feature (AI Gateway, Inference Tables, Usage Tables, Agent Bricks), it is always intentional. Match the tool to its exact purpose, not a similar-sounding alternative.
7.Governance (8%) is the smallest domain — don't over-invest here, but don't skip it either. PII masking and licensing questions are direct and scorable.
8.The exam was significantly updated in March 2026. If using study materials from before that date, verify coverage of Agent Bricks, MCP servers, MLflow 3 Tracing and Scorers, Prompt Registry, and Databricks Apps.

Quick Navigation

Design Applications Data Preparation Application Development Assembling and Deploying Applications Governance Evaluation and Monitoring

Domain 114% of exam

Design Applications

Must-Know Facts

The three Agent Bricks types and exactly when to use each: Knowledge Assistant = RAG Q&A returning natural language; Information Extraction = structured JSON from unstructured text; Multiagent Supervisor = routing tasks across specialized sub-agents.
Agents are NOT always better than chains. A RAG chain (retriever → prompt → LLM) is more predictable and easier to test. Reserve agents (non-deterministic tool loops) for tasks that genuinely require dynamic decision-making across multiple tools.
Prompt structure for enforcing output format: include the exact JSON schema AND a filled example in the system prompt. Just instructing 'respond in JSON' produces inconsistent results without a schema example.
Tool descriptions are read by the LLM to decide when to invoke a tool. Vague or overlapping tool descriptions cause incorrect tool selection — write them as precise, unambiguous specifications.
Chain-of-thought prompting improves reasoning accuracy on multi-step problems; it does NOT enforce output format or structure. Use output format instructions separately.
Converting a business goal to a pipeline means identifying: input source, required context (RAG vs structured data), output format, latency class (real-time vs batch), and quality constraints — before selecting any technology.

Common Traps

TrapInformation Extraction vs Knowledge Assistant Agent Bricks look similar because both process unstructured text.

RealityKnowledge Assistant returns a free-text natural language answer for human consumption. Information Extraction returns structured JSON conforming to a Pydantic schema for programmatic use. If the output goes into a database table, use Information Extraction.

TrapMultiagent Supervisor is the right choice any time multiple documents are involved.

RealityMultiagent Supervisor is for routing across DIFFERENT EXPERTISE DOMAINS (e.g., billing agent vs. support agent vs. logistics agent). A single RAG Knowledge Assistant handles multiple documents within one domain. Don't use a Supervisor just because there are many documents.

TrapIncreasing temperature improves output quality for structured format tasks.

RealityHigher temperature increases randomness, making structured output (JSON, specific formats) LESS reliable. For structured output, use low temperature (0–0.3) and explicit schema in the prompt.

TrapAgent Bricks replace MLflow Agent Framework for all agent use cases.

RealityAgent Bricks are pre-built patterns for common use cases. For highly custom agent logic, non-standard tool patterns, or applications that don't fit the three Bricks types, build directly with MLflow Agent Framework.

Confusing Pairs

Agent (MLflow Agent Framework)Chain (LangChain LCEL)

Chain = fixed execution path (deterministic, testable, predictable). Agent = dynamic tool-calling loop (non-deterministic, flexible, harder to debug). Use a chain for RAG Q&A; use an agent when the solution requires choosing from multiple tools at runtime.

Chain-of-Thought PromptingFew-Shot Prompting

Chain-of-thought = instruct the model to reason step-by-step before answering (improves accuracy on multi-step tasks). Few-shot = provide example input-output pairs to show the desired format/behavior (improves format consistency). They solve different problems and can be combined.

Genie SpacesKnowledge Assistant (Agent Bricks)

Genie Spaces = natural language queries over STRUCTURED Delta tables (generates and runs SQL). Knowledge Assistant = natural language Q&A over UNSTRUCTURED documents (RAG with Vector Search). If the data is in a Delta table with rows and columns, use Genie.

Scenario Tips

If the question asks about:

Question asks which Agent Bricks type to use when the goal is to 'extract contract terms (parties, dates, amounts) and write them to a database table'...

Answer:

Information Extraction — the output is structured data for programmatic consumption.

Distractor to avoid:

Knowledge Assistant sounds plausible since it handles documents, but it returns natural language answers, not structured JSON for a database.

If the question asks about:

Question asks how to enforce JSON output with specific fields from an LLM...

Answer:

Include the exact JSON schema definition AND a filled example in the system prompt. This is the only reliable way to get consistent structured output.

Distractor to avoid:

Chain-of-thought prompting is a common wrong answer — it improves reasoning but does nothing for output format consistency.

Last-Minute Facts

1Agent Bricks: 3 types — Knowledge Assistant (RAG Q&A), Information Extraction (structured JSON), Multiagent Supervisor (orchestration/routing).

2Chain-of-thought = for reasoning accuracy. Few-shot = for format consistency. Output format instructions = for structured output.

3Temperature 0 = deterministic. Temperature 1+ = highly random/creative.

4Genie Spaces = SQL over Delta tables. Vector Search = semantic search over unstructured documents.

Domain 214% of exam

Data Preparation

Must-Know Facts

Chunking strategy selection logic: fixed-size for uniform content (transcripts, code); sentence/paragraph for general prose; recursive for hierarchical documents (technical manuals with chapters/sections); semantic for heterogeneous content where topic boundaries are unclear.
Chunk overlap (10–20% of chunk size) prevents context loss at boundaries. Missing overlap = retrieval gaps at every chunk boundary.
The embedding model used at INDEXING time must be the SAME model used at QUERY time. They must operate in the same vector space. Swapping models after indexing requires full re-embedding and index rebuild.
Extraneous content to strip before chunking: navigation menus, footers, page numbers, cookie banners, ads, boilerplate disclaimers. These pollute chunk embeddings with irrelevant text.
Delta Lake chunks table required columns: chunk_id (unique identifier), source_document (provenance), chunk_text (the content to embed), metadata (optional but useful for filtering). This table is the source for a Delta Sync Vector Search index.
Retrieval evaluation: Precision@k measures quality (are top-k results relevant?). Recall@k measures coverage (are all relevant chunks found?). MRR measures rank (how high is the first relevant result?). Low Recall → increase k or use hybrid search. Low Precision → use re-ranking or smaller/better-aligned chunks.
Re-ranking uses a cross-encoder model to re-score already-retrieved chunks. It improves precision (ordering quality) but NOT recall (it can't surface chunks that weren't retrieved). It adds latency — inappropriate for real-time sub-second applications.

Common Traps

TrapRe-ranking improves retrieval recall.

RealityRe-ranking only re-orders already-retrieved chunks. It cannot surface chunks that were missed in the initial vector search. To improve recall, increase k or add hybrid search (keyword + vector).

TrapSmaller chunks always improve retrieval quality.

RealityVery small chunks may lack enough context to represent the concept being embedded. The chunk must be large enough to carry semantic meaning. Balance chunk size against embedding model context length and retrieval quality metrics.

TrapRemoving all extra content from documents improves retrieval.

RealityRemoving truly extraneous content (ads, navigation) helps. But removing section headings and structural context that provides meaning DEGRADES retrieval. Only strip elements that have no informational value.

TrapThe Delta Sync Vector Search index can use any Delta table as a source.

RealityThe source Delta table must have specific schema: a designated text column for embedding and an ID column. Schema must be set correctly before index creation — changing it later requires dropping and recreating the index.

Confusing Pairs

Precision@kRecall@k

Precision@k = fraction of retrieved chunks that are relevant (quality of what was returned). Recall@k = fraction of ALL relevant chunks that appear in top-k (coverage). Low Precision → re-rank. Low Recall → increase k or use hybrid search.

Recursive ChunkingSemantic Chunking

Recursive = splits using document structure hierarchy (headings → paragraphs → sentences). Best for documents that already have a structure like manuals. Semantic = clusters sentences by topic similarity regardless of structure. Best for free-form heterogeneous text where structure is absent.

Delta Sync Vector Search IndexDirect Vector Access Index

Delta Sync = Databricks auto-syncs from a Delta table source; fully managed; requires Delta table with correct schema. Direct Vector Access = you manage upserts/deletes via SDK; no Delta table required as source; use for custom embedding pipelines or non-Delta data sources.

Scenario Tips

If the question asks about:

Question describes a technical manual with chapters, sections, and subsections where initial retrieval quality is poor because chunks lack context...

Answer:

Recursive chunking that respects document hierarchy. The structure is already present — use it.

Distractor to avoid:

Semantic chunking is a wrong answer here because it ignores existing structure and is more appropriate for unstructured/free-form text.

If the question asks about:

Question says Recall@5 is 0.45 (only 45% of relevant chunks appear in top-5). What improves this?

Answer:

Increase k (retrieve more candidates) and/or add hybrid search (keyword + vector). More candidates = more chance to include relevant chunks.

Distractor to avoid:

Re-ranking is the most common wrong answer — it improves precision/ordering of existing results, not recall of missed chunks.

If the question asks about:

Question says HTML articles contain navigation menus, footer links, and cookie banners alongside main content. What to do before chunking?

Answer:

Remove the extraneous content (navigation, footer, ads, banners) before chunking. Only the main article text should be chunked and embedded.

Distractor to avoid:

Chunking raw HTML is wrong — the navigation/footer noise pollutes chunk embeddings.

Last-Minute Facts

1Embedding model must be IDENTICAL at indexing and query time. Different models = incompatible vector spaces.

2Chunk overlap: 10–20% of chunk size to prevent boundary context loss.

3Retrieval metrics: Precision@k (quality), Recall@k (coverage), MRR (rank of first relevant result).

4Re-ranking improves precision, NOT recall. Latency cost: cross-encoder is slower than vector search alone.

5Hybrid search = dense vector (semantic) + sparse keyword (BM25). Improves recall for exact terminology queries.

Domain 330% of exam

Application Development

Must-Know Facts

LangChain LCEL pipe syntax: retriever | prompt | llm | output_parser. Each component is a Runnable. ChatPromptTemplate takes {context} and {question} variables filled at runtime.
The embedding model context length must be >= the largest chunk size. If a 600-token chunk is embedded by a 512-token model, the embedding only represents the first 512 tokens — the remaining 88 tokens are silently truncated, producing a degraded embedding.
LLM context window limits total prompt length (system prompt + retrieved context chunks + user query). If retrieved chunks fill the window, the model cannot process the full query. Size chunks with the LLM context window in mind.
Guardrail type selection by threat: topic classifier for off-topic/competitor queries; prompt injection detector for adversarial inputs; PII masking for personal data; output validator for hallucination/policy violations. Each threat has a specific appropriate guardrail.
LLM selection attributes: task type (instruction following, code generation, multi-step reasoning, classification), context window length, latency SLA, cost per token, multilingual support, and license (commercial vs. non-commercial).
mlflow.evaluate() core parameters: model (URI or callable), data (eval DataFrame), targets (ground truth column name for metrics that need it), evaluators or extra_metrics (list of metrics to compute). Results auto-logged to the MLflow run.
LLM-judge metrics do NOT require ground truth: faithfulness, answer_relevance, harmfulness, coherence. Ground truth metrics DO require labeled reference answers: answer_correctness, exact_match, ROUGE, BLEU.
Faithfulness specifically measures whether the response is SUPPORTED BY THE RETRIEVED CONTEXT (not whether it is factually true in general). A response can be faithful to wrong context.
MLflow experiment lifecycle: log experiment with metrics/params → compare runs → register best model to Unity Catalog → deploy to serving endpoint. Track which prompt version + model + eval data produced each result.

Common Traps

TrapFaithfulness measures whether the answer is factually correct.

RealityFaithfulness measures whether the answer is grounded in the RETRIEVED CONTEXT, not universal factual correctness. If the context contains a wrong fact and the model accurately repeats it, faithfulness is HIGH but correctness is LOW. These are different metrics.

TrapA topic classifier guardrail prevents prompt injection.

RealityA topic classifier blocks queries about off-topic subjects (competitors, irrelevant domains). Prompt injection is a separate threat — it requires a dedicated injection detector that inspects whether user input is attempting to override system instructions.

TrapLarger embedding model context length is always better.

RealityA larger context length has higher compute cost and is only needed if chunks are large. Match the embedding model's context length to your actual maximum chunk size. Larger than needed wastes resources without quality benefit.

TrapMLflow tracing and MLflow evaluate() serve the same purpose.

RealityMLflow Tracing is observability — it records WHAT happened during execution (inputs, outputs, latency at each step). mlflow.evaluate() scores WHETHER it was good (quality metrics). Tracing = debugging. evaluate() = quality measurement.

TrapChanging the prompt does not require re-evaluation.

RealityPrompts directly affect output quality. Any prompt change — even small wording adjustments — requires re-running evaluation to confirm quality has not regressed. Prompt changes should go through the same CI/CD gate as code changes.

Trapmlflow.evaluate() automatically uses the newest registered model version.

Realitymlflow.evaluate() takes an explicit model URI or a predictions DataFrame — it does NOT auto-select the latest model version. You must specify the exact model URI (e.g., 'models:/catalog.schema.model/3') or pass pre-computed predictions. Forgetting to update the model URI means you silently evaluate the wrong version.

Confusing Pairs

faithfulnessanswer_relevance

faithfulness = is the response SUPPORTED BY the retrieved context? (no ground truth needed; judge checks response vs. context). answer_relevance = does the response ADDRESS the user's question? (no ground truth needed; judge checks response vs. question). Both are LLM-judge metrics but measure completely different dimensions.

faithfulness (LLM-judge)answer_correctness (ground truth)

faithfulness = no labeled data needed; scores grounding in context. answer_correctness = REQUIRES a reference answer; scores accuracy against known-correct answer. Use faithfulness when you have no labeled answers. Use answer_correctness when you have a curated QA dataset.

Foundation Model APIsExternal Model Endpoints (via AI Gateway)

Foundation Model APIs = Databricks-HOSTED models (DBRX, Llama, Mixtral). No infrastructure required; billed within Databricks. External Model Endpoints = THIRD-PARTY models (OpenAI GPT-4, Anthropic Claude) proxied through AI Gateway. Use when a specific external model capability is required — unified governance still applies.

ReAct PatternChain-of-Thought

Chain-of-Thought = single-pass reasoning in the prompt (Think step-by-step before answering). ReAct = iterative loop: Reason about which tool to use → Act (call tool) → Observe result → repeat. CoT stays in one LLM call; ReAct spans multiple calls with external tool execution.

Prompt Registry (MLflow 3)Unity Catalog Model Registry

Prompt Registry = manages PROMPT TEMPLATE versions with aliases (dev/staging/prod); code references an alias so prompt changes don't require code deployments. Unity Catalog Model Registry = manages MODEL artifact versions (weights, serialized chain). Prompt Registry is for text templates; Model Registry is for model binaries. Both use alias-based promotion but manage different artifact types.

Custom Databricks ScorerLLM-judge metric (built-in)

Built-in LLM-judge metrics (faithfulness, answer_relevance, harmfulness) are Databricks-provided and cover general quality dimensions — no extra code needed. Custom Scorers are user-defined Python functions registered in MLflow for domain-specific criteria (e.g., regulatory phrasing, medical terminology) that built-in metrics cannot capture. Use custom only when built-in metrics are insufficient.

Scenario Tips

If the question asks about:

Question says the chatbot must never discuss competitor products. Which guardrail technique?

Answer:

Topic classifier guardrail on the INPUT that detects competitor-related queries and rejects them before reaching the LLM.

Distractor to avoid:

PII masking is the most common wrong answer — it handles personal data, not topic scope.

If the question asks about:

Question says a 600-token chunk is embedded with a 512-token context window embedding model, and retrieval results are unexpected...

Answer:

The chunk exceeds the model's context length and is silently truncated at 512 tokens. Fix: reduce chunk size to ≤512 tokens OR choose an embedding model with a larger context window.

Distractor to avoid:

Wrong answer: rebuild the vector index with a different similarity metric — this doesn't address the truncation problem.

If the question asks about:

Question asks for a metric that measures whether LLM answers are supported by retrieved context, without requiring labeled answers...

Answer:

faithfulness — LLM-judge metric, no ground truth needed, checks response vs. context.

Distractor to avoid:

answer_correctness requires labeled reference answers. ROUGE-L also requires a reference. These are wrong.

If the question asks about:

Question asks which guardrail prevents users from sending their SSN in a chat message to the application...

Answer:

Input PII masking — detects and redacts PII in user inputs before they reach the LLM.

Distractor to avoid:

Output PII masking only catches PII in the LLM's response — it doesn't sanitize what the user sends in.

If the question asks about:

Question says prompt wording was updated and the team wants to test new prompt in staging without changing application code. What mechanism enables this?

Answer:

Prompt Registry with aliases — create a new prompt version, assign the 'staging' alias to it, and code that references the alias automatically uses the new version. No code deployment required.

Distractor to avoid:

Registering a new MLflow model version — wrong; that is for model binaries, not prompt template text. Prompt Registry is the correct tool.

If the question asks about:

Question describes a scenario where the team evaluates whether the chatbot response is relevant to the user's question. No labeled reference answers exist. Which metric?

Answer:

answer_relevance — LLM-judge metric that scores whether the response addresses the user's question without requiring a reference answer.

Distractor to avoid:

answer_correctness requires a labeled reference answer and is wrong when no ground truth is available. faithfulness checks grounding in context, not question relevance.

Last-Minute Facts

1faithfulness: no ground truth needed. answer_correctness: needs ground truth. answer_relevance: no ground truth needed. exact_match, ROUGE, BLEU: all need ground truth.

2Embedding model context length must be >= maximum chunk size. Overflow is silently truncated.

3LLM context window = system prompt + retrieved chunks + user query. If chunks are too large, the query gets cut off.

4Topic classifier guardrail = blocks off-topic queries. Injection detector = blocks adversarial system prompt override attempts.

5MLflow Tracing = debugging (what happened). mlflow.evaluate() = quality scoring (was it good).

6BLEU = machine translation quality metric. ROUGE = summarization quality metric. Don't swap these.

7Prompt Registry aliases: code references alias name (e.g., 'prod'), not version number. Assigning a new version to the alias promotes it — no code change required.

8Custom Scorer = Python function, registered in MLflow, for domain-specific quality criteria. Built-in LLM-judge metrics cover general quality dimensions — use custom only when built-ins are insufficient.

Domain 422% of exam

Assembling and Deploying Applications

Must-Know Facts

Two-step MLflow deployment: mlflow.log_model() logs the artifact to a run. mlflow.register_model() creates a versioned entry in Unity Catalog. BOTH steps are required before deploying to Model Serving. Logging alone is not enough.
Model serving endpoints run with a SERVICE PRINCIPAL's identity, not the developer's identity. The service principal must be explicitly granted Unity Catalog permissions to Vector Search indexes, Delta tables, and secrets. A notebook that works for the developer will fail in a serving endpoint if permissions are not propagated.
ai_query() is for BATCH SQL workloads: SELECT ai_query('catalog.schema.endpoint', prompt_col) FROM delta_table. It is NOT suitable for real-time interactive applications due to SQL query latency overhead.
pyfunc models require manually specifying: model signature (input/output schema), Python dependencies (conda/pip), and a predict(context, model_input) method. The langchain flavor handles these automatically for standard LangChain chains — use pyfunc only when custom pre/post-processing is needed.
Vector Search query time parameters: specify k (number of results), similarity metric (must match the metric the index was created with), and optional metadata filters. Changing the similarity metric requires rebuilding the index.
Prompt Registry lifecycle (MLflow 3): author in Playground → commit version → assign aliases (dev, staging, prod) → code references alias, not version number → promote alias to move prompt to next stage. This is analogous to Unity Catalog model lifecycle stages.
CI/CD for GenAI pipelines includes automating: Delta table schema validation, Vector Search index sync triggers, prompt version promotion gates, model registration and endpoint deployment, and integration tests for each component.
MCP server types: Managed = Databricks-hosted, uses Unity Catalog functions as tools (simplest). External = third-party tool providers (adds external dependency). Custom = user-implemented Python server (most development effort, most flexibility).

Common Traps

Trapmlflow.log_model() is all that's needed before deployment.

Realitylog_model() creates an artifact in the MLflow run. To deploy to Model Serving, the model must ALSO be registered to Unity Catalog via mlflow.register_model(). These are two completely separate operations. Forgetting register_model() is the #1 deployment mistake.

TrapAn agent that works in a notebook will work in a serving endpoint.

RealityNotebooks run with the USER's permissions. Serving endpoints run with a SERVICE PRINCIPAL's permissions. The service principal needs explicit GRANT statements on all downstream resources (Vector Search, Delta tables, secrets). Always test endpoint permissions separately.

Trapai_query() can replace a real-time LLM endpoint for interactive apps.

Realityai_query() runs SQL queries synchronously. For large tables or expensive LLM calls, it can be very slow and costly. For interactive use cases requiring sub-second latency, use the Model Serving REST API directly.

TrapChanging the similarity metric on an existing Vector Search index updates the index.

RealitySimilarity metric (cosine vs. dot product) is set at index creation time. Changing it requires dropping and recreating the index with full re-embedding. There is no in-place migration.

Confusing Pairs

mlflow.log_model()mlflow.register_model()

log_model() = saves the model artifact to the current MLflow run (local to that experiment). register_model() = creates a versioned model entry in Unity Catalog that can be deployed to serving endpoints. log_model() is run-scoped; register_model() is organization-scoped.

ai_query() (batch SQL)Model Serving REST API (real-time)

ai_query() = Databricks SQL function for batch inference on Delta table columns. Best for overnight enrichment jobs on millions of rows. Model Serving REST = HTTP endpoint for interactive/real-time LLM calls with low latency requirements. They serve fundamentally different latency profiles.

pyfunc model flavorlangchain model flavor

langchain flavor = use for standard LangChain chains; MLflow handles signature, dependencies, and serialization automatically. pyfunc flavor = use when custom pre/post-processing logic doesn't fit any specific framework; requires manually implementing PythonModel.predict() and specifying signature and dependencies.

Evaluation (pre-deployment)Monitoring (post-deployment)

Evaluation = offline quality assessment BEFORE deployment using a test dataset. Monitoring = live quality tracking AFTER deployment using Inference Tables and Agent Monitoring. Both are needed: evaluation gates deployment; monitoring catches drift and regression in production.

Scenario Tips

If the question asks about:

Question says engineer ran mlflow.langchain.log_model() and now wants to deploy it as a REST endpoint. What is the next required step?

Answer:

mlflow.register_model() to register to Unity Catalog. Without this step, the model cannot be deployed to Model Serving.

Distractor to avoid:

Rebuild the Vector Search index — wrong, this is not a deployment prerequisite.

If the question asks about:

Question says a deployed RAG agent returns an access error on the Vector Search index but works in a notebook. Most likely cause?

Answer:

The model serving endpoint's service principal lacks Unity Catalog permissions to query the Vector Search index. The developer's personal credentials work in a notebook but service principals need explicit grants.

Distractor to avoid:

Wrong similarity metric — would cause poor results, not an access error.

If the question asks about:

Question asks for the best approach to run LLM classification on 10 million rows in a Delta table overnight without writing a Python pipeline...

Answer:

ai_query() in a SQL query. It calls the model serving endpoint for each row directly from SQL.

Distractor to avoid:

MLflow Tracing — this is for observability, not inference.

If the question asks about:

Question asks how to integrate Unity Catalog functions as agent tools using MCP...

Answer:

Managed MCP servers — Databricks-hosted, uses UC functions directly, minimal configuration required.

Distractor to avoid:

Custom MCP servers require the most implementation effort and are for tools not available as UC functions.

Last-Minute Facts

1log_model() → artifact in run. register_model() → versioned entry in Unity Catalog. BOTH required for deployment.

2Serving endpoint = service principal identity. Grant UC permissions explicitly. Notebook identity does NOT carry over.

3ai_query() syntax: SELECT ai_query('catalog.schema.endpoint', column) FROM table

4MCP types: Managed (UC functions) = easiest. External = third-party tools. Custom = build your own.

5pyfunc needs: PythonModel class + predict() method + manual signature + dependencies. langchain flavor handles these automatically.

6Evaluation = offline, pre-deployment. Monitoring = live, post-deployment. Not the same thing.

7Prompt Registry aliases (dev/staging/prod) let code reference the alias, not a specific version number — enables promotion without code changes.

Domain 58% of exam

Governance

Must-Know Facts

PII masking applies to BOTH inputs and outputs. Users can include PII in their queries (input PII masking required). LLMs can pull PII from retrieved context and include it in responses (output PII masking required). Both directions need protection.
CC-BY-NC (Creative Commons Attribution Non-Commercial) prohibits use in any commercial product, regardless of public availability. Using CC-BY-NC data in a commercial SaaS product violates the license. Always audit RAG source licenses before production deployment.
Guardrail technique by threat: PII masking → PII in inputs/outputs. Topic classifier → off-topic or restricted subject queries. Prompt injection detector → adversarial instruction override attempts. Output validator → hallucination, policy violation, harmful content in generated responses.
Unity Catalog permissions for GenAI: model serving endpoints need EXECUTE on UC functions, SELECT on Delta tables, USE on schemas. Grant to the endpoint's service principal. Follow least-privilege — endpoints should only access resources they actually query.
Handling problematic RAG source text: exclude the document, replace with a curated/filtered alternative, or apply post-processing to sanitize. Do not include harmful or legally restricted content just because it is technically accessible.

Common Traps

TrapPublicly available data can always be used in a commercial RAG application.

RealityPublic availability is not the same as commercial use rights. CC-BY-NC, many scraped web content, and copyrighted materials prohibit commercial use even if freely accessible. Always verify the license, not just accessibility.

TrapOutput PII masking is sufficient to protect PII in a GenAI application.

RealityOutput masking prevents PII from reaching end users in responses. But input PII masking is also needed to prevent PII from being stored in logs, used in prompts that are cached, or exposed to the LLM provider. Both directions must be protected.

TrapAdding attribution to responses resolves CC-BY-NC licensing issues.

RealityAttribution satisfies the BY (attribution) clause of Creative Commons. The NC (non-commercial) clause is a SEPARATE restriction that is not resolved by attribution. A CC-BY-NC work still cannot be used commercially even with full attribution.

Confusing Pairs

Input GuardrailsOutput Guardrails

Input guardrails validate user messages BEFORE they reach the LLM — blocks injection, PII in queries, harmful requests, off-topic queries. Output guardrails validate LLM responses BEFORE returning to users — catches PII leakage from context, hallucinations, policy violations. They protect against different threat vectors and both are needed in production.

CC-BY (permissive)CC-BY-NC (non-commercial)

CC-BY = allows commercial use with attribution. CC-BY-NC = allows non-commercial use only, even with attribution. For commercial products, CC-BY-NC sources must be excluded or replaced. Exam questions test this licensing distinction explicitly.

Scenario Tips

If the question asks about:

Question says healthcare chatbot must never return patient identifiers (HIPAA compliance). Which guardrail?

Answer:

Output PII masking — detects and redacts PHI in LLM responses before they reach users.

Distractor to avoid:

Input PII masking handles what users send in, not what the LLM outputs from retrieved data.

If the question asks about:

Question says team found CC-BY-NC articles in their RAG knowledge base for a commercial SaaS product. What should they do?

Answer:

Exclude CC-BY-NC articles and find alternatives with commercially compatible licenses (CC-BY, CC0, or custom commercial licenses).

Distractor to avoid:

Include with attribution — wrong; attribution doesn't satisfy the NC restriction.

Last-Minute Facts

1PII masking: BOTH input AND output. Input = protect LLM/logs from user PII. Output = protect users from LLM-generated PII.

2CC-BY = commercial OK with attribution. CC-BY-NC = commercial PROHIBITED. CC0 = public domain, fully free.

3Guardrail matching: PII masking ≠ injection protection ≠ topic filtering. Each threat needs its specific tool.

4Model serving endpoint permissions: grant to the SERVICE PRINCIPAL, not the developer. Use least privilege.

Domain 612% of exam

Evaluation and Monitoring

Must-Know Facts

mlflow.evaluate() key parameters: model or predictions, data (eval dataset), targets (column name for ground truth, required for answer_correctness/exact_match), extra_metrics or evaluators. Results auto-logged as MLflow run artifacts.
LLM-judge metrics (no ground truth needed): faithfulness (response vs. context), answer_relevance (response vs. question), harmfulness, coherence, fluency. The judge LLM must be accessible from the evaluation environment.
Ground truth metrics (require labeled data): answer_correctness, exact_match, ROUGE-1, ROUGE-2, ROUGE-L, BLEU. The exam trap: teams often choose BLEU for RAG quality because it sounds rigorous, but BLEU is a machine translation metric — it is a wrong fit for RAG. Use ROUGE or LLM-judge metrics for RAG evaluation.
Inference Tables: auto-log ALL request/response payloads to a Delta table. Enabled per endpoint. Purpose: quality monitoring, drift detection, offline evaluation, debugging bad responses. Adds slight latency and storage cost.
Usage Tables (AI Gateway): log TOKEN CONSUMPTION and COST per request. The exam trap: questions describe a cost overrun scenario and list Inference Tables as an answer choice — Inference Tables record PAYLOADS, not costs. Only Usage Tables give token counts and cost estimates. Separate tables, separate purposes.
Agent Monitoring (Lakehouse Monitoring): analyzes inference table data over time. Tracks quality metrics, latency distributions, error rates, token usage trends. Post-deployment, historical, NOT real-time alerting by default.
Custom Databricks Scorers: user-defined Python evaluation functions registered in MLflow. Use when built-in metrics don't cover domain-specific quality criteria (e.g., correct medical terminology, industry jargon accuracy, regulatory compliance language).
SME feedback loop: collect expert ratings on agent responses → identify systematic prompt weaknesses → update prompts and/or RAG source content → re-evaluate. SME feedback improves the application layer, not the base model weights (unless used for fine-tuning).

Common Traps

TrapInference Tables and Usage Tables contain the same information.

RealityInference Tables = full request/response PAYLOADS (what was said). Usage Tables = TOKEN COUNTS and COST ESTIMATES (what it cost). They are separate tables with completely different schemas. Use Inference Tables for quality monitoring; use Usage Tables for cost management.

TrapAgent Monitoring provides real-time alerts on deployed endpoints.

RealityAgent Monitoring analyzes historical inference data after the fact. It does NOT provide built-in real-time alerting — you must configure separate alert rules based on Lakehouse Monitoring metric outputs.

Trapfaithfulness requires the model's answer and a reference answer.

Realityfaithfulness requires the model's RESPONSE and the RETRIEVED CONTEXT. It scores whether the response is supported by the context — no reference answer needed. answer_correctness is what requires a reference answer.

TrapBLEU can measure RAG summarization quality.

RealityBLEU is specifically designed for MACHINE TRANSLATION (comparing translated output to reference translation). ROUGE is for SUMMARIZATION. Using BLEU for summarization tasks is a metric mismatch. ROUGE-L or LLM-judge metrics are appropriate for RAG evaluation.

Confusing Pairs

Inference TablesUsage Tables

Inference Tables = WHAT the model said (full request/response payload logged to Delta). Usage Tables = WHAT IT COST (token counts and cost estimates via AI Gateway). Inference = quality and debugging. Usage = cost attribution and budget control.

MLflow Tracingmlflow.evaluate()

MLflow Tracing = observability during execution; captures real-time chain/agent execution steps, latencies, intermediate outputs. mlflow.evaluate() = offline quality assessment; runs metrics on a batch evaluation dataset. Tracing = what happened. evaluate() = was it good.

LLM-judge metric (faithfulness)Ground truth metric (answer_correctness)

faithfulness = LLM scores whether the response matches the context — no human-labeled answers required. answer_correctness = compare response against a known-correct reference answer — requires labeled QA dataset. Use faithfulness when you have no labels; use answer_correctness when you do.

Scenario Tips

If the question asks about:

Team wants to measure whether RAG responses are supported by context. They have no labeled reference answers. Which metric?

Answer:

faithfulness — LLM-judge metric, no ground truth needed, specifically measures grounding in retrieved context.

Distractor to avoid:

answer_correctness requires ground truth reference answers and is wrong here.

If the question asks about:

Team notices LLM costs are much higher than projected. Which Databricks feature to identify the most expensive queries?

Answer:

AI Gateway Usage Tables — logs token consumption per request, enables identifying which query patterns consume the most tokens.

Distractor to avoid:

Inference Tables log payloads (content), not token costs. MLflow Tracing inspects individual runs, not aggregate cost patterns.

If the question asks about:

Team needs to evaluate whether responses use correct domain-specific medical terminology. Built-in MLflow metrics don't cover this. What to use?

Answer:

Custom Databricks Scorer — a user-defined Python evaluation function registered in MLflow with a medical terminology rubric.

Distractor to avoid:

Switching to a larger LLM model — this might improve terminology knowledge but does not EVALUATE the current model's output quality.

Last-Minute Facts

1Metrics needing ground truth: answer_correctness, exact_match, ROUGE, BLEU.

2Metrics NOT needing ground truth: faithfulness, answer_relevance, harmfulness, coherence.

3faithfulness inputs: (response, retrieved_context). answer_relevance inputs: (response, question). answer_correctness inputs: (response, reference_answer).

4Inference Tables = payload logging (quality). Usage Tables = token/cost logging (cost management).

5BLEU = translation. ROUGE = summarization. Don't swap them in RAG evaluation contexts.

6Agent Monitoring = historical, post-deployment analysis. NOT real-time alerts by default.

7Custom Scorers = for domain-specific quality criteria beyond built-in MLflow metrics.

Feeling confident?

Put your knowledge to the test with a timed GenAI Engineer Associate mock exam.