Quick Navigation
Prompt Engineering Fundamentals
- System Prompt
- Instruction block passed to the LLM that defines behavior, role, and constraints. Processed before user input. Controls tone, format, and safety boundaries.
- Few-Shot Prompting
- Include example input-output pairs in the prompt to guide the LLM toward the desired response format without fine-tuning the model weights.
- Chain-of-Thought Prompting
- Instruct the model to reason step-by-step before giving the final answer. Improves accuracy on multi-step reasoning tasks.
- Output Formatting Instructions
- Direct the LLM to respond in JSON, Markdown, bullet lists, or other structured formats by explicitly specifying the schema in the prompt.
- Temperature Parameter
- Controls randomness. Low temperature (0.0-0.3): deterministic, factual. High temperature (0.7-1.0): creative, varied. Use low for RAG, higher for generation.
- Prompt Injection Risk
- Malicious user input that overrides system instructions. Mitigate with input validation, guardrails, and separating trusted instructions from user content.
RAG (Retrieval-Augmented Generation)
- RAG Pipeline Components
- Source documents → chunking → embedding → vector store → retrieval → prompt augmentation → LLM → response. Each stage can be optimized independently.
- Chunking Strategies
- Fixed-size: simple, predictable. Sentence/paragraph: preserves semantic units. Recursive: splits by hierarchy (paragraph → sentence → word). Choose based on document structure.
- Chunk Overlap
- Include overlapping tokens between adjacent chunks to prevent context loss at chunk boundaries. Typical overlap: 10-20% of chunk size.
- Embedding Models
- Transform text chunks into dense vector representations. Choose context length based on average chunk size. Longer context models handle larger chunks.
- Mosaic AI Vector Search
- Databricks-managed vector database integrated with Unity Catalog. Supports Delta Sync (auto-sync from Delta table) and Direct Vector Access modes.
- Vector Search Index Types
- Delta Sync Index: auto-updated from a Delta table source, fully managed. Direct Vector Access Index: you manage upserts/deletes directly. Choose based on update frequency.
- Similarity Search
- Query the vector index with an embedded query vector. Returns top-k most similar chunks by cosine similarity or dot product. Tune k based on retrieval quality.
- Re-ranking
- Post-retrieval step that re-scores retrieved chunks using a cross-encoder model. Improves relevance precision after initial vector search recall. Adds latency but improves quality.
Data Preparation for GenAI
- Document Extraction Libraries
- PyPDF2/pdfplumber for PDFs, python-docx for Word files, BeautifulSoup for HTML, unstructured for mixed document types. Choose based on source format.
- Extraneous Content Removal
- Strip headers, footers, page numbers, navigation menus, boilerplate disclaimers, and formatting artifacts before chunking. They degrade retrieval relevance.
- Writing Chunks to Delta Lake
- Store chunked text in a Delta Lake table in Unity Catalog. Include columns: chunk_id, source_doc, chunk_text, metadata. Used as source for Vector Search sync.
- Unity Catalog for RAG Data
- Govern embedding tables, vector search indexes, and source document tables under Unity Catalog. Enables lineage tracking and access control for GenAI data assets.
- Retrieval Evaluation Metrics
- Precision@k: fraction of retrieved chunks that are relevant. Recall@k: fraction of relevant chunks retrieved. MRR: mean reciprocal rank of first relevant result.
- Chunk Size Trade-offs
- Small chunks: higher retrieval precision, may lack context. Large chunks: more context, lower precision, higher embedding cost. Balance based on eval metrics.
Application Development with LangChain and MLflow
- LangChain LCEL (LangChain Expression Language)
- Pipe-based syntax to compose chains: retriever | prompt | llm | parser. Each component is a Runnable. Enables easy composition and streaming.
- ChatPromptTemplate
- LangChain template combining system message, optional examples, and human message with {variable} placeholders filled at runtime from user input.
- LLM Guardrails
- Input/output validation layers preventing harmful content, PII exposure, topic drift, or prompt injection. Implement with Llama Guard, custom classifiers, or moderation APIs.
- Foundation Model APIs
- Databricks-hosted LLM endpoints (DBRX, Llama, Mixtral, etc.) accessible via standard OpenAI-compatible API. No infrastructure management required.
- MLflow AI Gateway
- Unified proxy for LLM calls supporting both Databricks-hosted and external (OpenAI, Anthropic) models. Provides rate limiting, Inference Tables logging, and Usage Tables for cost tracking.
- MLflow Tracing
- Automatic instrumentation capturing the full execution trace of an LLM chain: inputs, outputs, latencies, and intermediate steps. Essential for debugging multi-step agents.
- pyfunc Model Flavor
- Generic MLflow model type using Python function interface. Use for RAG chains with pre/post-processing logic that does not fit a specific framework flavor.
- mlflow.langchain.log_model()
- Log a LangChain chain or agent to MLflow as a langchain flavor model. Captures the entire runnable including prompts, retriever config, and LLM endpoint.
Assembling and Deploying Applications
- Model Registration to Unity Catalog
- mlflow.register_model(model_uri, 'catalog.schema.model_name') — registers a logged model to Unity Catalog. Enables governance, versioning, and lineage.
- Model Serving Endpoints
- Deploy registered MLflow models as REST API endpoints via Databricks Model Serving. Auto-scaling, serverless option available. Accessed via standard REST or Databricks SDK.
- Model Serving — Resource Access
- Grant endpoints access to external resources (Vector Search, Delta tables, secrets) using Databricks service principals. Endpoints run with the service principal's permissions.
- Serving Endpoint Environment Variables
- Pass API keys, tokens, and config values as environment variables or Databricks Secrets to model serving endpoints. Never hardcode credentials in model artifacts.
- ai_query() Function
- Databricks SQL function that calls a model serving endpoint directly from a SQL query. Enables batch inference on Delta tables without writing Python pipeline code.
- Batch Inference Pattern
- Read source Delta table → apply ai_query() or spark.udf with LLM call → write results to output Delta table. Suitable for offline enrichment at scale.
- Prompt Version Control
- Track prompt templates as versioned MLflow artifacts or Unity Catalog assets. Enables rollback, A/B comparison, and promotion between environments (dev → staging → prod).
- CI/CD for GenAI Apps
- Automate: Vector Search index updates, prompt version promotion, model registration, endpoint deployment. Use Databricks Asset Bundles or GitHub Actions.
Agentic Systems and Multi-Agent Patterns
- MLflow Agent Framework
- Databricks framework for building, evaluating, and deploying agentic systems. Provides tool calling, state management, and MLflow tracing integration out of the box.
- Tool Definition
- Agents use tools (Python functions, SQL queries, API calls, Vector Search) to take actions. Each tool has a name, description, and typed input/output schema for LLM selection.
- ReAct Pattern (Reason + Act)
- Agent loop: Think (LLM decides which tool to use) → Act (execute tool) → Observe (process result) → Repeat until goal is achieved or max iterations reached.
- Agent Bricks — Knowledge Assistant
- Pre-built agent type for Q&A over documents using RAG. Configurable with a Vector Search index and LLM endpoint. Minimal custom code required.
- Agent Bricks — Multiagent Supervisor
- Orchestrator agent that routes subtasks to specialized sub-agents. Use when tasks require different expertise domains (e.g., SQL agent + document agent).
- Agent Bricks — Information Extraction
- Pre-built agent for extracting structured data from unstructured text into a defined schema. Outputs JSON conforming to a user-specified Pydantic model.
- Genie Spaces
- Databricks feature enabling natural language querying of structured data (Delta tables, SQL warehouses) via a conversational interface. Enables multi-agent data access.
- Multi-Agent Communication
- Agents communicate via function calls or conversational APIs. The supervisor agent passes context and receives results from sub-agents to compose a final response.
Governance and Guardrails
- Input Guardrails
- Validate and filter user inputs before sending to the LLM. Detect prompt injection, harmful content, PII, and off-topic queries. Block or sanitize before processing.
- Output Guardrails
- Validate LLM outputs before returning to users. Check for hallucinations, PII leakage, harmful content, or policy violations. Can invoke a secondary LLM as judge.
- PII Masking
- Detect and replace PII (names, emails, SSNs, phone numbers) in inputs and outputs using NER models or regex patterns. Prevents accidental PII exposure via LLM.
- Data Source Licensing
- Verify licenses of training/RAG documents (CC-BY, CC-BY-SA, commercial restrictions). Some licenses prohibit use in commercial GenAI applications.
- Unity Catalog Permissions for GenAI
- Grant EXECUTE on functions, SELECT on tables, and USE on schemas to model serving service principals. Follows standard Unity Catalog privilege model.
- Problematic Text Mitigation
- Replace harmful or biased text in RAG sources with: filtered datasets, curated alternatives, or content policy flagging rather than using raw data.
Evaluation and Monitoring
- MLflow evaluate()
- mlflow.evaluate(model, data, targets, evaluators=[...]) — runs automated evaluation of LLM/agent outputs against metrics. Logs results as MLflow run artifacts.
- LLM-as-Judge Metrics
- Score responses with a powerful LLM (no ground truth needed): faithfulness (response supported by context?), answer_relevance (addresses the question?), harmfulness.
- Ground Truth Metrics
- Metrics requiring labeled reference answers: exact_match, ROUGE, BLEU, answer_correctness. Use when a curated QA dataset with known correct answers is available.
- Faithfulness (Groundedness)
- Measures whether the LLM response is supported by the retrieved context. High faithfulness = no hallucination. Scored by LLM judge — no ground truth required.
- Inference Tables
- Auto-logging of every request and response to a Delta table for a model serving endpoint. Enables offline analysis, drift detection, and quality monitoring over time.
- Agent Monitoring (Lakehouse Monitoring)
- Databricks feature that monitors deployed agent endpoints using inference table data. Tracks latency, token usage, error rates, and LLM-scored quality metrics.
- Usage Tables (AI Gateway)
- Log token consumption per LLM request routed through AI Gateway. Use for cost attribution, budget enforcement, and identifying expensive query patterns.
- Databricks Scorers
- Custom evaluation functions registered in MLflow that score model outputs on domain-specific criteria. Extend built-in metrics with business-logic quality checks.