General Exam Tips
- 1.Read ALL answer options before selecting — many wrong answers are plausible Databricks tools that solve a different problem than what the question asks.
- 2.The exam has ~45 scored questions plus a small number of unscored pilot items. Never skip a question — wrong answers carry no penalty.
- 3.Mark uncertain questions and revisit them. Scenario questions are long; a careful second read often resolves ambiguity.
- 4.Application Development (30%) and Assembling/Deploying (22%) together make up over half the exam. Weight your study time accordingly.
- 5.Most questions are scenario-based, not definition-based. Ask yourself: given these constraints, what is the BEST choice? Often two answers are correct in isolation but one better fits the stated constraint (latency, cost, update frequency, etc.).
- 6.When the question names a specific Databricks feature (AI Gateway, Inference Tables, Usage Tables, Agent Bricks), it is always intentional. Match the tool to its exact purpose, not a similar-sounding alternative.
- 7.Governance (8%) is the smallest domain — don't over-invest here, but don't skip it either. PII masking and licensing questions are direct and scorable.
- 8.The exam was significantly updated in March 2026. If using study materials from before that date, verify coverage of Agent Bricks, MCP servers, MLflow 3 Tracing and Scorers, Prompt Registry, and Databricks Apps.
Quick Navigation
Design Applications
Must-Know Facts
- The three Agent Bricks types and exactly when to use each: Knowledge Assistant = RAG Q&A returning natural language; Information Extraction = structured JSON from unstructured text; Multiagent Supervisor = routing tasks across specialized sub-agents.
- Agents are NOT always better than chains. A RAG chain (retriever → prompt → LLM) is more predictable and easier to test. Reserve agents (non-deterministic tool loops) for tasks that genuinely require dynamic decision-making across multiple tools.
- Prompt structure for enforcing output format: include the exact JSON schema AND a filled example in the system prompt. Just instructing 'respond in JSON' produces inconsistent results without a schema example.
- Tool descriptions are read by the LLM to decide when to invoke a tool. Vague or overlapping tool descriptions cause incorrect tool selection — write them as precise, unambiguous specifications.
- Chain-of-thought prompting improves reasoning accuracy on multi-step problems; it does NOT enforce output format or structure. Use output format instructions separately.
- Converting a business goal to a pipeline means identifying: input source, required context (RAG vs structured data), output format, latency class (real-time vs batch), and quality constraints — before selecting any technology.
Common Traps
Confusing Pairs
Scenario Tips
Question asks which Agent Bricks type to use when the goal is to 'extract contract terms (parties, dates, amounts) and write them to a database table'...
Information Extraction — the output is structured data for programmatic consumption.
Knowledge Assistant sounds plausible since it handles documents, but it returns natural language answers, not structured JSON for a database.
Question asks how to enforce JSON output with specific fields from an LLM...
Include the exact JSON schema definition AND a filled example in the system prompt. This is the only reliable way to get consistent structured output.
Chain-of-thought prompting is a common wrong answer — it improves reasoning but does nothing for output format consistency.
Last-Minute Facts
Data Preparation
Must-Know Facts
- Chunking strategy selection logic: fixed-size for uniform content (transcripts, code); sentence/paragraph for general prose; recursive for hierarchical documents (technical manuals with chapters/sections); semantic for heterogeneous content where topic boundaries are unclear.
- Chunk overlap (10–20% of chunk size) prevents context loss at boundaries. Missing overlap = retrieval gaps at every chunk boundary.
- The embedding model used at INDEXING time must be the SAME model used at QUERY time. They must operate in the same vector space. Swapping models after indexing requires full re-embedding and index rebuild.
- Extraneous content to strip before chunking: navigation menus, footers, page numbers, cookie banners, ads, boilerplate disclaimers. These pollute chunk embeddings with irrelevant text.
- Delta Lake chunks table required columns: chunk_id (unique identifier), source_document (provenance), chunk_text (the content to embed), metadata (optional but useful for filtering). This table is the source for a Delta Sync Vector Search index.
- Retrieval evaluation: Precision@k measures quality (are top-k results relevant?). Recall@k measures coverage (are all relevant chunks found?). MRR measures rank (how high is the first relevant result?). Low Recall → increase k or use hybrid search. Low Precision → use re-ranking or smaller/better-aligned chunks.
- Re-ranking uses a cross-encoder model to re-score already-retrieved chunks. It improves precision (ordering quality) but NOT recall (it can't surface chunks that weren't retrieved). It adds latency — inappropriate for real-time sub-second applications.
Common Traps
Confusing Pairs
Scenario Tips
Question describes a technical manual with chapters, sections, and subsections where initial retrieval quality is poor because chunks lack context...
Recursive chunking that respects document hierarchy. The structure is already present — use it.
Semantic chunking is a wrong answer here because it ignores existing structure and is more appropriate for unstructured/free-form text.
Question says Recall@5 is 0.45 (only 45% of relevant chunks appear in top-5). What improves this?
Increase k (retrieve more candidates) and/or add hybrid search (keyword + vector). More candidates = more chance to include relevant chunks.
Re-ranking is the most common wrong answer — it improves precision/ordering of existing results, not recall of missed chunks.
Question says HTML articles contain navigation menus, footer links, and cookie banners alongside main content. What to do before chunking?
Remove the extraneous content (navigation, footer, ads, banners) before chunking. Only the main article text should be chunked and embedded.
Chunking raw HTML is wrong — the navigation/footer noise pollutes chunk embeddings.
Last-Minute Facts
Application Development
Must-Know Facts
- LangChain LCEL pipe syntax: retriever | prompt | llm | output_parser. Each component is a Runnable. ChatPromptTemplate takes {context} and {question} variables filled at runtime.
- The embedding model context length must be >= the largest chunk size. If a 600-token chunk is embedded by a 512-token model, the embedding only represents the first 512 tokens — the remaining 88 tokens are silently truncated, producing a degraded embedding.
- LLM context window limits total prompt length (system prompt + retrieved context chunks + user query). If retrieved chunks fill the window, the model cannot process the full query. Size chunks with the LLM context window in mind.
- Guardrail type selection by threat: topic classifier for off-topic/competitor queries; prompt injection detector for adversarial inputs; PII masking for personal data; output validator for hallucination/policy violations. Each threat has a specific appropriate guardrail.
- LLM selection attributes: task type (instruction following, code generation, multi-step reasoning, classification), context window length, latency SLA, cost per token, multilingual support, and license (commercial vs. non-commercial).
- mlflow.evaluate() core parameters: model (URI or callable), data (eval DataFrame), targets (ground truth column name for metrics that need it), evaluators or extra_metrics (list of metrics to compute). Results auto-logged to the MLflow run.
- LLM-judge metrics do NOT require ground truth: faithfulness, answer_relevance, harmfulness, coherence. Ground truth metrics DO require labeled reference answers: answer_correctness, exact_match, ROUGE, BLEU.
- Faithfulness specifically measures whether the response is SUPPORTED BY THE RETRIEVED CONTEXT (not whether it is factually true in general). A response can be faithful to wrong context.
- MLflow experiment lifecycle: log experiment with metrics/params → compare runs → register best model to Unity Catalog → deploy to serving endpoint. Track which prompt version + model + eval data produced each result.
Common Traps
Confusing Pairs
Scenario Tips
Question says the chatbot must never discuss competitor products. Which guardrail technique?
Topic classifier guardrail on the INPUT that detects competitor-related queries and rejects them before reaching the LLM.
PII masking is the most common wrong answer — it handles personal data, not topic scope.
Question says a 600-token chunk is embedded with a 512-token context window embedding model, and retrieval results are unexpected...
The chunk exceeds the model's context length and is silently truncated at 512 tokens. Fix: reduce chunk size to ≤512 tokens OR choose an embedding model with a larger context window.
Wrong answer: rebuild the vector index with a different similarity metric — this doesn't address the truncation problem.
Question asks for a metric that measures whether LLM answers are supported by retrieved context, without requiring labeled answers...
faithfulness — LLM-judge metric, no ground truth needed, checks response vs. context.
answer_correctness requires labeled reference answers. ROUGE-L also requires a reference. These are wrong.
Question asks which guardrail prevents users from sending their SSN in a chat message to the application...
Input PII masking — detects and redacts PII in user inputs before they reach the LLM.
Output PII masking only catches PII in the LLM's response — it doesn't sanitize what the user sends in.
Question says prompt wording was updated and the team wants to test new prompt in staging without changing application code. What mechanism enables this?
Prompt Registry with aliases — create a new prompt version, assign the 'staging' alias to it, and code that references the alias automatically uses the new version. No code deployment required.
Registering a new MLflow model version — wrong; that is for model binaries, not prompt template text. Prompt Registry is the correct tool.
Question describes a scenario where the team evaluates whether the chatbot response is relevant to the user's question. No labeled reference answers exist. Which metric?
answer_relevance — LLM-judge metric that scores whether the response addresses the user's question without requiring a reference answer.
answer_correctness requires a labeled reference answer and is wrong when no ground truth is available. faithfulness checks grounding in context, not question relevance.
Last-Minute Facts
Assembling and Deploying Applications
Must-Know Facts
- Two-step MLflow deployment: mlflow.log_model() logs the artifact to a run. mlflow.register_model() creates a versioned entry in Unity Catalog. BOTH steps are required before deploying to Model Serving. Logging alone is not enough.
- Model serving endpoints run with a SERVICE PRINCIPAL's identity, not the developer's identity. The service principal must be explicitly granted Unity Catalog permissions to Vector Search indexes, Delta tables, and secrets. A notebook that works for the developer will fail in a serving endpoint if permissions are not propagated.
- ai_query() is for BATCH SQL workloads: SELECT ai_query('catalog.schema.endpoint', prompt_col) FROM delta_table. It is NOT suitable for real-time interactive applications due to SQL query latency overhead.
- pyfunc models require manually specifying: model signature (input/output schema), Python dependencies (conda/pip), and a predict(context, model_input) method. The langchain flavor handles these automatically for standard LangChain chains — use pyfunc only when custom pre/post-processing is needed.
- Vector Search query time parameters: specify k (number of results), similarity metric (must match the metric the index was created with), and optional metadata filters. Changing the similarity metric requires rebuilding the index.
- Prompt Registry lifecycle (MLflow 3): author in Playground → commit version → assign aliases (dev, staging, prod) → code references alias, not version number → promote alias to move prompt to next stage. This is analogous to Unity Catalog model lifecycle stages.
- CI/CD for GenAI pipelines includes automating: Delta table schema validation, Vector Search index sync triggers, prompt version promotion gates, model registration and endpoint deployment, and integration tests for each component.
- MCP server types: Managed = Databricks-hosted, uses Unity Catalog functions as tools (simplest). External = third-party tool providers (adds external dependency). Custom = user-implemented Python server (most development effort, most flexibility).
Common Traps
Confusing Pairs
Scenario Tips
Question says engineer ran mlflow.langchain.log_model() and now wants to deploy it as a REST endpoint. What is the next required step?
mlflow.register_model() to register to Unity Catalog. Without this step, the model cannot be deployed to Model Serving.
Rebuild the Vector Search index — wrong, this is not a deployment prerequisite.
Question says a deployed RAG agent returns an access error on the Vector Search index but works in a notebook. Most likely cause?
The model serving endpoint's service principal lacks Unity Catalog permissions to query the Vector Search index. The developer's personal credentials work in a notebook but service principals need explicit grants.
Wrong similarity metric — would cause poor results, not an access error.
Question asks for the best approach to run LLM classification on 10 million rows in a Delta table overnight without writing a Python pipeline...
ai_query() in a SQL query. It calls the model serving endpoint for each row directly from SQL.
MLflow Tracing — this is for observability, not inference.
Question asks how to integrate Unity Catalog functions as agent tools using MCP...
Managed MCP servers — Databricks-hosted, uses UC functions directly, minimal configuration required.
Custom MCP servers require the most implementation effort and are for tools not available as UC functions.
Last-Minute Facts
Governance
Must-Know Facts
- PII masking applies to BOTH inputs and outputs. Users can include PII in their queries (input PII masking required). LLMs can pull PII from retrieved context and include it in responses (output PII masking required). Both directions need protection.
- CC-BY-NC (Creative Commons Attribution Non-Commercial) prohibits use in any commercial product, regardless of public availability. Using CC-BY-NC data in a commercial SaaS product violates the license. Always audit RAG source licenses before production deployment.
- Guardrail technique by threat: PII masking → PII in inputs/outputs. Topic classifier → off-topic or restricted subject queries. Prompt injection detector → adversarial instruction override attempts. Output validator → hallucination, policy violation, harmful content in generated responses.
- Unity Catalog permissions for GenAI: model serving endpoints need EXECUTE on UC functions, SELECT on Delta tables, USE on schemas. Grant to the endpoint's service principal. Follow least-privilege — endpoints should only access resources they actually query.
- Handling problematic RAG source text: exclude the document, replace with a curated/filtered alternative, or apply post-processing to sanitize. Do not include harmful or legally restricted content just because it is technically accessible.
Common Traps
Confusing Pairs
Scenario Tips
Question says healthcare chatbot must never return patient identifiers (HIPAA compliance). Which guardrail?
Output PII masking — detects and redacts PHI in LLM responses before they reach users.
Input PII masking handles what users send in, not what the LLM outputs from retrieved data.
Question says team found CC-BY-NC articles in their RAG knowledge base for a commercial SaaS product. What should they do?
Exclude CC-BY-NC articles and find alternatives with commercially compatible licenses (CC-BY, CC0, or custom commercial licenses).
Include with attribution — wrong; attribution doesn't satisfy the NC restriction.
Last-Minute Facts
Evaluation and Monitoring
Must-Know Facts
- mlflow.evaluate() key parameters: model or predictions, data (eval dataset), targets (column name for ground truth, required for answer_correctness/exact_match), extra_metrics or evaluators. Results auto-logged as MLflow run artifacts.
- LLM-judge metrics (no ground truth needed): faithfulness (response vs. context), answer_relevance (response vs. question), harmfulness, coherence, fluency. The judge LLM must be accessible from the evaluation environment.
- Ground truth metrics (require labeled data): answer_correctness, exact_match, ROUGE-1, ROUGE-2, ROUGE-L, BLEU. The exam trap: teams often choose BLEU for RAG quality because it sounds rigorous, but BLEU is a machine translation metric — it is a wrong fit for RAG. Use ROUGE or LLM-judge metrics for RAG evaluation.
- Inference Tables: auto-log ALL request/response payloads to a Delta table. Enabled per endpoint. Purpose: quality monitoring, drift detection, offline evaluation, debugging bad responses. Adds slight latency and storage cost.
- Usage Tables (AI Gateway): log TOKEN CONSUMPTION and COST per request. The exam trap: questions describe a cost overrun scenario and list Inference Tables as an answer choice — Inference Tables record PAYLOADS, not costs. Only Usage Tables give token counts and cost estimates. Separate tables, separate purposes.
- Agent Monitoring (Lakehouse Monitoring): analyzes inference table data over time. Tracks quality metrics, latency distributions, error rates, token usage trends. Post-deployment, historical, NOT real-time alerting by default.
- Custom Databricks Scorers: user-defined Python evaluation functions registered in MLflow. Use when built-in metrics don't cover domain-specific quality criteria (e.g., correct medical terminology, industry jargon accuracy, regulatory compliance language).
- SME feedback loop: collect expert ratings on agent responses → identify systematic prompt weaknesses → update prompts and/or RAG source content → re-evaluate. SME feedback improves the application layer, not the base model weights (unless used for fine-tuning).
Common Traps
Confusing Pairs
Scenario Tips
Team wants to measure whether RAG responses are supported by context. They have no labeled reference answers. Which metric?
faithfulness — LLM-judge metric, no ground truth needed, specifically measures grounding in retrieved context.
answer_correctness requires ground truth reference answers and is wrong here.
Team notices LLM costs are much higher than projected. Which Databricks feature to identify the most expensive queries?
AI Gateway Usage Tables — logs token consumption per request, enables identifying which query patterns consume the most tokens.
Inference Tables log payloads (content), not token costs. MLflow Tracing inspects individual runs, not aggregate cost patterns.
Team needs to evaluate whether responses use correct domain-specific medical terminology. Built-in MLflow metrics don't cover this. What to use?
Custom Databricks Scorer — a user-defined Python evaluation function registered in MLflow with a medical terminology rubric.
Switching to a larger LLM model — this might improve terminology knowledge but does not EVALUATE the current model's output quality.