General Exam Tips
- 1.Read ALL answer options before selecting — many questions have two plausible answers where one is specifically better for the given constraint (cost, latency, compliance).
- 2.The exam is operations-first: always ask yourself 'which option makes this AI system most reliable, observable, and automated in production?' not 'which option trains the best model?'
- 3.Time management: budget about 2 minutes per question. Case studies may take 4-5 minutes each. Flag and return to uncertain questions.
- 4.No penalty for wrong answers — always answer every question even if you must guess.
- 5.When two answers both seem correct, the one that uses native Azure ML or Foundry features over custom code or external tools is almost always preferred.
- 6.Scenario questions often include a red-herring constraint (e.g., 'must not manage GPU infrastructure') that eliminates most options — identify the real constraint first.
- 7.Domain 2 (28%) and Domain 3 (24%) together make up more than half the exam. Do not under-prepare these.
- 8.This exam is in beta — questions may feel less polished. If a question seems ambiguous, go with the most 'Azure-native MLOps' answer.
Quick Navigation
Design and Implement an MLOps Infrastructure
Must-Know Facts
- Datastores define the CONNECTION to Azure storage services (Blob, ADLS Gen2, SQL). Data assets are versioned REFERENCES to specific data within a datastore. They are separate objects.
- Compute targets: Compute Instances = interactive development/notebooks. Compute Clusters = parallel/distributed training jobs. Serverless Compute = on-demand, no cluster management. Inference Compute = endpoint deployments.
- Azure ML Registries share models, environments, components, and data assets ACROSS multiple workspaces. A workspace-level model registry is LOCAL to one workspace only.
- Environments encapsulate Python packages and Docker images for reproducible runs — they are VERSIONED assets, not config files.
- Components are REUSABLE, versioned pipeline steps with defined inputs, outputs, code, and an environment. They are the building block of ML pipelines.
- Bicep templates are DECLARATIVE IaC (define desired state). GitHub Actions is CI/CD AUTOMATION that executes those templates and other operations.
- Network security options for ML workspaces: private endpoints, VNet integration, managed network isolation. Know when each is appropriate.
- Managed identities eliminate credential management for workspace-to-storage access — preferred over service principal secrets.
Common Traps
Confusing Pairs
Scenario Tips
A team needs to share a trained model from Workspace A with teams using Workspace B in a different region
Use Azure ML Registry — it is designed for cross-workspace, cross-region sharing of ML assets including models, environments, and components.
Azure Blob Storage (shares raw files but not versioned ML assets) or copying the workspace (not a real option). Both miss the point of managed ML asset sharing.
The question asks for consistent, repeatable deployment of ML workspaces across dev/staging/prod
Bicep templates deployed via GitHub Actions workflows — Bicep ensures idempotent, repeatable deployments; GitHub Actions provides the CI/CD trigger.
Python SDK scripts or Azure Portal manual configuration — neither provides repeatable IaC deployments at scale.
A training script needs to access Azure Data Lake Storage without storing credentials
Configure a Managed Identity on the workspace and grant it RBAC access to the storage account. Create a datastore using the managed identity.
Storing a storage account key or SAS token in the training script or environment variable — this is a security anti-pattern the exam will never accept.
Last-Minute Facts
Implement Machine Learning Model Lifecycle and Operations
Must-Know Facts
- MLflow is natively integrated into Azure ML — every workspace exposes an MLflow tracking URI. You log experiments with mlflow.log_metric() and register models with mlflow.register_model().
- Progressive rollout on managed online endpoints works by TRAFFIC SPLITTING between deployments — you route a percentage (e.g., 10%) of requests to the new version, not by deploying to a subset of instances.
- Managed online endpoints support BLUE-GREEN deployment: both old and new deployments exist simultaneously with configurable traffic weights.
- Archiving a model in the registry does NOT delete it. It marks it deprecated while preserving it for compliance and audit trails.
- Data drift monitors INPUT distribution changes vs. training data. Prediction drift monitors OUTPUT distribution changes. Both are monitoring signals, but they detect different problems.
- Data collection for monitoring is AUTOMATIC with managed online endpoints. For batch endpoints, it requires MANUAL configuration.
- AutoML automates algorithm selection and hyperparameter tuning, NOT data preparation. Data quality still matters.
- Hyperparameter tuning uses sweep jobs with: search spaces (discrete or continuous), sampling methods (random, grid, Bayesian), and early termination policies (Bandit, Median, Truncation).
- Retraining can be triggered by monitoring alerts via Azure Event Hubs, Azure Functions, or Azure Logic Apps — NOT directly from Azure Monitor alerts.
- Feature retrieval specifications can be packaged with the MLflow model artifact to describe how to fetch features at inference time.
Common Traps
Confusing Pairs
Scenario Tips
A new model version must be tested in production with minimal risk — rollback must be instant if error rates spike
Deploy the new version to a second deployment slot on the managed online endpoint. Set traffic weight to 10% (new) / 90% (old). Monitor error rates. If issues arise, set traffic back to 100% old instantly.
Blue-green with a new endpoint (loses rollback speed) or batch endpoint canary (batch doesn't support traffic splitting).
Production model accuracy has dropped. Azure ML monitoring shows input feature distributions have diverged significantly from training data
This is DATA DRIFT. Configure an alert trigger connected to Azure Event Hubs or Logic Apps to automatically initiate a retraining pipeline with fresh data.
Prediction drift (that monitors OUTPUT, not input) or data quality (structural issue, not distribution shift).
A training job needs to try 100+ hyperparameter combinations efficiently, stopping poor configurations early
Configure a sweep job with Bayesian sampling (uses prior results) plus an early termination policy (Bandit or Median). Bayesian is most efficient for large search spaces.
Grid sampling (only for tiny discrete spaces) or random sampling without early termination (wastes compute on clearly bad configs).
Question asks when to use batch endpoints instead of online endpoints
Batch endpoints when: processing large static datasets, no latency requirements, want to avoid always-on compute costs, need parallel inference at scale.
Do not choose batch if the scenario mentions real-time, interactive, low-latency, or chatbot requirements — those always need online endpoints.
Last-Minute Facts
Design and Implement a GenAIOps Infrastructure
Must-Know Facts
- Microsoft Foundry is the current platform name — it replaced Azure AI Studio and Azure AI Foundry (classic). The exam uses 'Microsoft Foundry' as the canonical name.
- Foundry uses a hub-and-project architecture: Hubs are shared infrastructure (compute, networking, connections). Projects are team workspaces within a hub.
- Two deployment options for foundation models: Serverless API (MaaS — no GPU management, pay-as-you-go) vs Managed Compute (dedicated VMs, more control, higher commitment).
- Serverless API deployment scopes: Global Standard (worldwide routing), Data Zone (geographic boundary), Regional (specific Azure region for compliance).
- Provisioned Throughput Units (PTUs) are pre-purchased capacity on serverless endpoints — they guarantee consistent performance for high-volume production workloads.
- Prompt versioning uses Git repositories — track prompt changes with commits, create variants as branches, compare performance across versions.
- Managed identities + RBAC is the recommended security pattern for Foundry resources — eliminates credential management.
- Bicep templates can deploy Foundry resources (hubs, projects, model deployments) just as they deploy Azure ML resources.
Common Traps
Confusing Pairs
Scenario Tips
A chatbot needs to handle 50,000 requests/hour with guaranteed sub-2-second response times
Deploy the foundation model on a Serverless API endpoint with Provisioned Throughput Units (PTUs). PTUs guarantee consistent throughput; pay-as-you-go serverless does not.
Plain pay-as-you-go serverless — this seems simpler but cannot guarantee performance at high volume. Managed compute is also valid but the question says 'no GPU management.'
A healthcare company must ensure all AI inference data stays within EU boundaries
Deploy on a Serverless API endpoint with Data Zone scope. This restricts all routing to the EU geographic boundary, meeting data residency requirements without full regional pinning.
Global Standard (no data residency control) or Regional (would work but unnecessarily restricts to one region, reducing availability).
A team wants to track system prompt changes, compare response quality across different prompt versions, and collaborate on prompt development
Implement prompt versioning with a Git repository in Microsoft Foundry. Git tracks changes, branches enable variants, and comparison tools evaluate performance across versions.
Storing prompts in Azure Blob Storage (no version comparison) or using the model registry (models, not prompts).
An organization's security policy prohibits storing credentials in code or config files for Foundry access
Configure Managed Identities on the Foundry resources and grant RBAC permissions. Managed identities authenticate without any stored credentials.
Service principal with secret in Key Vault — still involves credential management. Managed identities eliminate this entirely.
Last-Minute Facts
Implement Generative AI Quality Assurance and Observability
Must-Know Facts
- Four AI quality metrics: Groundedness (factually supported by SOURCE DATA), Relevance (response addresses the query), Coherence (logical flow and consistency), Fluency (natural language quality).
- Groundedness is about the SOURCE DOCUMENTS — a fluent, relevant, coherent response can still be UNGROUNDED if it includes facts not present in the retrieved context.
- Safety evaluations are SEPARATE from quality evaluations. A response can be high-quality and unsafe, or low-quality and safe.
- Automated evaluation workflows can run on both TEST DATASETS (pre-deployment gate) and PRODUCTION TRAFFIC (continuous monitoring).
- Distributed tracing is the correct tool for identifying which step in a multi-step GenAI pipeline (embedding, retrieval, LLM inference) is the latency bottleneck.
- Token consumption has TWO components: input tokens (prompt + context) and output tokens (generated response). Cost optimization may target either.
- Continuous monitoring in Foundry: dashboards track groundedness trends, latency, token usage, and safety violations over time.
Common Traps
Confusing Pairs
Scenario Tips
A GenAI response is grammatically correct, well-structured, directly addresses the question, but cites facts not found in the retrieved documents
Low GROUNDEDNESS — the response is fluent, coherent, and relevant, but contains unsupported content relative to the source data. This is the classic hallucination scenario.
Fluency, Coherence, or Relevance — these are all high in this scenario. The specific problem is the gap between the response and the source documents.
An agent pipeline with 5 steps (intent detection, embedding, vector search, prompt assembly, LLM call) shows high latency but you cannot identify which step is slow
Configure DISTRIBUTED TRACING — it captures timing and status for each step individually, allowing you to pinpoint the bottleneck.
Token consumption monitoring (cost, not latency) or endpoint latency monitoring (total only, not per-step).
A team wants to prevent unsafe or harmful responses from reaching production, in addition to quality checks
Configure SEPARATE safety evaluations and quality evaluations. Safety evaluations check for harmful content, bias, and policy violations. They are not captured by groundedness/relevance/coherence/fluency alone.
Adding groundedness as the safety gate — high groundedness does not prevent harmful content if the source documents themselves contain harmful information.
Last-Minute Facts
Optimize Generative AI Systems and Model Performance
Must-Know Facts
- RAG vs fine-tuning: RAG adds KNOWLEDGE at query time without changing the model. Fine-tuning changes MODEL BEHAVIOR (style, format, task specialization) permanently by updating weights.
- If you need the model to know new facts from private or frequently-changing data — use RAG. If you need the model to respond differently (tone, format, persona) — use fine-tuning.
- Chunk size trade-off: smaller chunks (256-512 tokens) = more precise retrieval. Larger chunks (1024+ tokens) = more context but diluted relevance. The optimal size depends on query patterns.
- Chunk OVERLAP (e.g., 10-20% overlap between consecutive chunks) prevents information loss at chunk boundaries.
- Hybrid search = semantic search (vector/embedding-based) PLUS keyword search (BM25-based). Almost always outperforms either approach alone.
- Semantic search handles conceptual similarity. Keyword search handles exact term matching (proper nouns, codes, IDs). Hybrid captures both.
- Similarity threshold controls how many documents are retrieved — lower threshold = more documents (higher recall, lower precision), higher threshold = fewer documents (lower recall, higher precision).
- Synthetic data generation: when real labeled data is scarce, use an LLM to generate diverse training examples from a small seed set for fine-tuning.
- A/B testing for RAG: hold the LLM constant, vary ONE retrieval parameter at a time to isolate its impact.
Common Traps
Confusing Pairs
Scenario Tips
A medical knowledge base RAG system misses exact ICD codes and drug names even though conceptual queries work well
Add keyword (BM25) search alongside semantic search — implement HYBRID SEARCH. Keyword search catches exact medical codes and proper nouns that embeddings generalize over.
Increase chunk size (adds context but doesn't fix exact-match retrieval) or use a larger embedding model (still may not fix precise term matching).
A foundation model produces correct content but in the wrong format — responses are always in bullet points but the use case requires paragraphs
FINE-TUNING on examples in the desired format. RAG cannot change how the model structures its output. Fine-tuning modifies output behavior.
RAG (adds knowledge, not format behavior) or prompt engineering alone (may not be reliable enough for consistent format enforcement in production).
A fine-tuning project has only 50 real customer support examples but needs at least 500 for effective training
Use SYNTHETIC DATA GENERATION — prompt an LLM with the 50 real examples to generate diverse synthetic training examples in the same style and domain.
Use RAG instead of fine-tuning (doesn't solve the format problem the fine-tuning is meant to address) or deploy with 50 examples (insufficient for reliable fine-tuning).
A RAG system retrieves documents but the LLM includes facts not present in those documents
This is a HALLUCINATION issue at the generation layer, not a retrieval issue. Fine-tune the LLM to improve groundedness, or add explicit prompt instructions to only use retrieved context. Groundedness evaluation will flag this.
Increasing chunk size or adding more documents (misidentifies the problem as a retrieval issue when it's a generation issue).