Quick Navigation
Foundry SDK — Client Setup and AuthenticationAzure OpenAI Service — Model Deployment and PromptingRAG Architecture — Ingestion and Retrieval PipelineFoundry Agent Service — Agents and Tool CallingMulti-Agent Orchestration and Foundry IQComputer Vision — Image Generation and Multimodal UnderstandingText Analysis — Language, Speech, and TranslationInformation Extraction — Document Intelligence and IndexingResponsible AI — Safety, Guardrails, and EvaluationPlan and Manage — Security, Monitoring, and CI/CD
Foundry SDK — Client Setup and Authentication
- pip install azure-ai-projects azure-identity
- Install the Microsoft Foundry SDK (v2.2.0+) and Azure identity library required for keyless Entra ID authentication.
- from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential with ( DefaultAzureCredential() as credential, AIProjectClient( endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], credential=credential ) as project_client, ):
- Initialize AIProjectClient using managed identity (DefaultAzureCredential) — the recommended keyless pattern for production deployments.
- with project_client.get_openai_client() as openai_client: response = openai_client.responses.create( model=os.environ["FOUNDRY_MODEL_NAME"], input="Your prompt here", ) print(response.output_text)
- Get an authenticated OpenAI client from the project client to run Responses, Conversations, Evaluations, and Fine-Tuning operations.
- FOUNDRY_PROJECT_ENDPOINT=https://<ai-services-name>.services.ai.azure.com/api/projects/<project-name>
- Environment variable format for the Foundry project endpoint — find it on the Microsoft Foundry Project home page.
- Managed Identity vs. API Keys
- Managed identity authenticates automatically without storing secrets in code; use it for all production deployments. API keys are acceptable only for local development.
- DefaultAzureCredential() resolution order
- Tries: environment variables → workload identity → managed identity → Azure CLI → Visual Studio Code — ensuring seamless auth in both local dev and deployed environments.
Azure OpenAI Service — Model Deployment and Prompting
- Deployment types: Standard, Provisioned-Managed, Global-Standard
- Standard uses shared capacity (pay-per-token); Provisioned-Managed reserves dedicated throughput (PTUs); Global-Standard routes globally for highest availability.
- temperature (0.0–1.0)
- Controls response randomness — use 0.0–0.3 for factual/RAG tasks, 0.7–1.0 for creative generation.
- top_p, frequency_penalty, presence_penalty, max_tokens
- top_p limits token sampling pool; frequency_penalty reduces repetition of frequent tokens; presence_penalty encourages new topics; max_tokens caps response length.
- Zero-shot / Few-shot / Chain-of-Thought
- Zero-shot: no examples; few-shot: include input-output examples in the prompt; chain-of-thought: instruct the model to reason step-by-step before answering.
- System prompt
- Instruction block passed before user input that defines the model's role, constraints, format, and safety boundaries — processed with higher priority than the user turn.
- LLM vs. SLM selection
- Use LLMs (GPT-4o, o1) for broad reasoning and multimodal tasks; use SLMs (Phi family) for cost-efficient, latency-sensitive, or edge deployment scenarios.
- Content filtering (Azure OpenAI)
- Configured at the Azure OpenAI resource level — not per-prompt — and applies hate, violence, sexual, and self-harm category filters with configurable severity thresholds.
RAG Architecture — Ingestion and Retrieval Pipeline
- RAG flow: documents → chunking → embedding → index → retrieval → prompt augmentation → LLM → response
- Each stage is independently configurable: chunking strategy affects retrieval precision, embedding model determines semantic accuracy, retrieval config controls grounding quality.
- Chunking strategies: fixed-size, sentence/paragraph, recursive
- Fixed-size is simple and predictable; sentence/paragraph preserves semantic units; recursive splits by hierarchy (paragraph → sentence → word) for complex documents.
- RAG vs. Fine-tuning
- RAG retrieves data at inference time without modifying model weights — use it for dynamic, frequently updated knowledge; fine-tuning modifies model weights and requires retraining.
- Hybrid search (Azure AI Search)
- Runs keyword (BM25) and vector search in parallel; results are merged using Reciprocal Rank Fusion (RRF) — typically outperforms either method alone for RAG retrieval.
- from azure.search.documents import SearchClient from azure.search.documents.models import VectorizedQuery vector_query = VectorizedQuery( vector=query_vector, k_nearest_neighbors=10, fields="DescriptionVector" ) results = client.search( search_text="your keyword query", vector_queries=[vector_query], top=10 )
- Python SDK pattern for hybrid search combining keyword and vector queries against Azure AI Search.
- Semantic ranking (queryType: semantic)
- AI-powered reranking step applied AFTER hybrid retrieval that rescores results by meaning — set k=50 when combining with semantic ranking to provide sufficient input documents.
- Vector search vs. semantic search
- Vector search finds conceptually similar content using embedding similarity; semantic search reranks keyword results using AI understanding — hybrid search combines both approaches.
- Embedding model selection for RAG
- Use text-embedding-3-large for highest retrieval accuracy; text-embedding-3-small for cost/speed tradeoffs — the embedding model must be consistent across ingestion and query time.
Foundry Agent Service — Agents and Tool Calling
- Agent = role + goals + memory + tools + constraints
- An agent is defined by its assigned role, the goals it pursues, its conversation memory, the tool schemas it can call, and the behavioral constraints/approval workflows applied to it.
- from azure.ai.projects.models import FunctionTool def get_order_status(order_id: str) -> str: return f"Order {order_id}: Shipped" tool = FunctionTool(functions={get_order_status})
- Define a function tool using FunctionTool from azure.ai.projects.models — the agent calls this function when it determines the tool is needed to satisfy user intent.
- Function calling vs. prompt injection
- Function calling is an authorized, structured mechanism for agents to invoke external APIs; prompt injection is a malicious attack where user input overrides agent instructions — they are not the same.
- Agent tools: Azure AI Search, Code Interpreter, File Search, OpenAPI, Bing Grounding, Azure Functions, MCP
- Foundry Agent Service supports a broad tool catalog — agents autonomously select which tool to invoke based on the task, unlike Prompt Flow where the developer defines the sequence.
- Foundry Agent Service vs. Prompt Flow
- Agents are autonomous — they decide which tools to use and when; Prompt Flow pipelines are deterministic with developer-defined sequences — use agents for open-ended tasks, Prompt Flow for repeatable workflows.
- Conversation memory
- Tracks dialogue history across turns so the agent maintains context — memory is per-agent and must be explicitly shared via a coordination layer in multi-agent orchestration.
- Autonomous vs. semi-autonomous (approval workflows)
- Fully autonomous agents act without human approval — only appropriate when risk is low and safeguards are in place; semi-autonomous agents pause for human-in-the-loop approval on high-risk actions.
Multi-Agent Orchestration and Foundry IQ
- Multi-agent orchestration pattern
- A routing/supervisor agent directs user requests to specialized sub-agents (e.g., sales agent, support agent, billing agent) and coordinates shared context between them.
- Agent-to-Agent (A2A) protocol
- Preview capability in Foundry Agent Service enabling agents to invoke other agents as tools — provides structured inter-agent communication for complex orchestration workflows.
- Foundry IQ (knowledge layer)
- A managed knowledge layer BUILT ON TOP OF Azure AI Search that adds agentic retrieval, permission-aware multi-source knowledge bases, and automated chunking and embedding for agents — not the same as Azure AI Search itself.
- Foundry IQ vs. Azure AI Search
- Azure AI Search is the underlying retrieval infrastructure; Foundry IQ is the higher-level abstraction that wraps it with agentic retrieval, multi-source knowledge bases, and permission-aware responses for agents.
- ReAct loop: Think → Act → Observe → Repeat
- Standard agent reasoning pattern: the LLM decides which tool to use (Think), executes it (Act), processes the result (Observe), and repeats until the goal is achieved or max iterations are reached.
- Shared context in multi-agent systems
- Each agent has its own memory; a coordination layer is required to pass information between agents — without it, sub-agents cannot access each other's conversation history.
Computer Vision — Image Generation and Multimodal Understanding
- DALL-E: text-to-image, inpainting, mask-based editing
- DALL-E generates images from text prompts; inpainting fills in missing/selected areas; mask-based editing uses an explicit mask to target specific regions — inpainting and mask-based editing are distinct, not interchangeable.
- Video generation from text prompts and reference media
- Generate video clips from text descriptions or reference images using Azure AI video generation models — distinct from video analysis, which processes existing video content.
- Caption generation: concise vs. detailed captions
- Azure AI Vision supports concise captions (one sentence) and detailed captions (dense description) for single or multiple images — use detailed captions for accessibility-focused or RAG grounding scenarios.
- Content Understanding: single-task (standard) mode vs. pro mode
- Single-task mode supports ALL content types (documents, images, audio, video) with lower cost and latency; pro mode is documents-ONLY and adds multi-step reasoning, multi-input document support, and cross-file analysis.
- Alt-text generation (accessibility)
- Extended image descriptions for accessibility must follow WCAG guidelines — not just describe the image but convey meaning and context appropriate for screen readers.
- Indirect prompt injection via image text
- Malicious instructions can be embedded as text inside user-uploaded images — scan image text content for injected instructions before passing it to the model.
- GPT-4 Vision / multimodal models
- Multimodal models accept image and text inputs simultaneously — use for visual question answering, image captioning, and analyzing visual content to ground AI responses.
- Video analysis: Content Understanding pipeline vs. Azure Video Indexer
- Use Content Understanding pipelines for agentic video processing (transcription, segment extraction, structured output); use Azure Video Indexer for pre-built video insight extraction (faces, topics, keyframes) without custom pipeline configuration.
Text Analysis — Language, Speech, and Translation
- Azure AI Language (Foundry Tool): entity extraction, sentiment, key phrases, language detection
- Use Foundry Tools for high-volume, standardized text analysis tasks — more cost-effective than LLM-based analysis at scale for predictable extraction workloads.
- LLM-based text analysis vs. Foundry Tools
- LLM-based analysis is more flexible and handles complex nuanced tasks but is significantly more expensive; use Foundry Tools for high-volume standardized extraction at scale.
- Structured JSON output from LLMs
- Requires explicit schema definition in the prompt or API call — models do not automatically produce structured output without guidance specifying the expected JSON format.
- Azure AI Speech: STT (speech-to-text) + TTS (text-to-speech)
- A voice-enabled agent requires BOTH speech-to-text for input AND text-to-speech for output — STT alone does not create a complete voice interaction.
- Custom speech models
- Require training with domain-specific audio and text pair data — they are not simple configuration changes and are used for specialized vocabulary or accent handling.
- Azure Translator vs. LLM-powered translation
- Azure Translator provides deterministic, high-quality translation with custom terminology support across 100+ languages; LLM translation is more flexible but less consistent for standardized terminology.
- Speech translation (Azure AI Speech)
- Converts spoken audio directly into translated text or speech in another language — combines speech-to-text and translation in a single pipeline, distinct from text-only Azure Translator workflows.
Information Extraction — Document Intelligence and Indexing
- Document Intelligence prebuilt models: invoice, receipt, ID, business card, W-2, health insurance card
- Prebuilt models handle standard document types (invoices, receipts) without training; custom models are needed for proprietary document formats with unique layouts.
- Document Intelligence: prebuilt vs. custom vs. composed models
- Prebuilt: common document types out of the box; Custom: trained on your specific layouts; Composed: chains multiple custom models to handle varied document types in a single API call.
- RAG ingestion for scanned PDFs: OCR → layout analysis → table extraction → embedding → index
- Scanned PDFs require OCR to extract text — without it, the indexer cannot read image-based content; layout analysis preserves structure for tables and multi-column documents.
- Content Understanding output: structured JSON vs. markdown
- Configure the analyzer schema to produce structured JSON for typed field extraction or markdown output for downstream LLM reasoning — the output format depends on analyzer configuration.
- Vector search requires pre-computed embeddings
- You cannot perform vector search on raw text — documents must be converted to embedding vectors during ingestion before they can be queried by semantic similarity.
- Connect Azure AI Search index as agent tool
- Register the search index as an agent tool so the agent can dynamically retrieve relevant information during conversations — do not embed all content in the system prompt (exceeds token limits).
- Enrichment skills: run at indexing time, not query time
- Enrichment skills (OCR, language detection, entity extraction) execute during the indexing pipeline — for real-time processing of new content, a separate streaming pipeline is required.
Responsible AI — Safety, Guardrails, and Evaluation
- Safety filters (input side) vs. guardrails (output side)
- Safety filters inspect and block harmful prompts BEFORE they reach the model; guardrails constrain and validate model outputs AFTER generation — they operate on opposite sides of the model.
- Content moderation configuration scope
- Safety filters are configured at the Azure OpenAI resource level, not at the individual prompt level — a single resource can have multiple deployments each with different content filtering policies.
- Foundry evaluators: fabrication, relevance, quality, safety
- Run evaluators on RAG outputs to measure hallucination rate (fabrication), whether the response addresses the query (relevance), overall quality, and safety compliance.
- Fabrication detection vs. guardrails
- Fabrication detection (hallucination checking) is an evaluation step that measures quality after generation; guardrails are constraints that actively filter or modify outputs — they serve different purposes.
- Trace logging and provenance metadata
- Capture full execution traces (inputs, outputs, tool calls, latencies) and provenance metadata (which documents grounded each response) for auditability and debugging.
- Agent governance: oversight modes and tool-access controls
- Configure agent oversight mode (autonomous vs. semi-autonomous), restrict which tools agents can access, and define behavioral constraints to limit the scope of autonomous actions.
- project_client.beta.red_teams.create(...)
- Run automated adversarial (red team) scans against your generative AI application to identify safety risks and policy violations before production deployment.
Plan and Manage — Security, Monitoring, and CI/CD
- RBAC roles for Foundry: Azure AI Developer, Cognitive Services User, Search Index Data Reader, Search Index Data Contributor
- Assign the minimum required RBAC role — never use owner/contributor for application identities; use Azure AI Developer for Foundry project access with managed identity.
- Private networking: private endpoints + VNet integration
- Isolate Azure OpenAI, AI Search, and Foundry resources behind private endpoints to prevent public internet access — required for enterprise security and compliance deployments.
- Foundry observability: tracing + token analytics + safety signals + latency breakdowns
- Configure all four observability dimensions in Foundry for complete visibility — monitoring only Azure OpenAI metrics misses agent behavior, search quality, and safety signals.
- Grounding quality monitoring vs. model performance monitoring
- Grounding quality measures whether retrieved documents are relevant to the query; model performance measures generation accuracy — these are distinct metrics requiring separate monitoring.
- Quota management: token quotas, rate limits, PTU scaling
- Manage TPM (tokens per minute) and RPM (requests per minute) quotas per deployment; use provisioned throughput (PTUs) for predictable workloads requiring guaranteed capacity.
- CI/CD integration with Foundry projects
- CI/CD pipelines must connect at the Foundry project level — not just the individual service — to orchestrate model version promotion, prompt updates, and agent deployment across environments.
- Model deployment options: serverless, managed compute, provisioned throughput
- Serverless: pay-per-token with shared capacity; Managed compute: dedicated container instances; Provisioned throughput: reserved PTU capacity for predictable high-volume workloads.
- Azure Key Vault for secret storage
- Store API keys and connection strings in Key Vault rather than in code or environment files — reference them via Key Vault references in app configuration, not by reading the secret value at deploy time.