MicrosoftAI-3005 domains

AI-300 Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Microsoft Certified: Machine Learning Operations Engineer Associate exam.

General Exam Tips

1.Read ALL answer options before selecting — many questions have two plausible answers where one is specifically better for the given constraint (cost, latency, compliance).
2.The exam is operations-first: always ask yourself 'which option makes this AI system most reliable, observable, and automated in production?' not 'which option trains the best model?'
3.Time management: budget about 2 minutes per question. Case studies may take 4-5 minutes each. Flag and return to uncertain questions.
4.No penalty for wrong answers — always answer every question even if you must guess.
5.When two answers both seem correct, the one that uses native Azure ML or Foundry features over custom code or external tools is almost always preferred.
6.Scenario questions often include a red-herring constraint (e.g., 'must not manage GPU infrastructure') that eliminates most options — identify the real constraint first.
7.Domain 2 (28%) and Domain 3 (24%) together make up more than half the exam. Do not under-prepare these.
8.This exam is in beta — questions may feel less polished. If a question seems ambiguous, go with the most 'Azure-native MLOps' answer.

Quick Navigation

Design and Implement an MLOps Infrastructure Implement Machine Learning Model Lifecycle and Operations Design and Implement a GenAIOps Infrastructure Implement Generative AI Quality Assurance and Observability Optimize Generative AI Systems and Model Performance

Domain 118% of exam

Design and Implement an MLOps Infrastructure

Must-Know Facts

Datastores define the CONNECTION to Azure storage services (Blob, ADLS Gen2, SQL). Data assets are versioned REFERENCES to specific data within a datastore. They are separate objects.
Compute targets: Compute Instances = interactive development/notebooks. Compute Clusters = parallel/distributed training jobs. Serverless Compute = on-demand, no cluster management. Inference Compute = endpoint deployments.
Azure ML Registries share models, environments, components, and data assets ACROSS multiple workspaces. A workspace-level model registry is LOCAL to one workspace only.
Environments encapsulate Python packages and Docker images for reproducible runs — they are VERSIONED assets, not config files.
Components are REUSABLE, versioned pipeline steps with defined inputs, outputs, code, and an environment. They are the building block of ML pipelines.
Bicep templates are DECLARATIVE IaC (define desired state). GitHub Actions is CI/CD AUTOMATION that executes those templates and other operations.
Network security options for ML workspaces: private endpoints, VNet integration, managed network isolation. Know when each is appropriate.
Managed identities eliminate credential management for workspace-to-storage access — preferred over service principal secrets.

Common Traps

TrapTreating datastores and data assets as the same thing

RealityA datastore is a connection definition (credentials, endpoint URL). A data asset is a versioned reference to actual data (a folder, file, or table) within a datastore. You create both separately.

TrapUsing Compute Instances for training jobs to save time

RealityCompute Instances are for interactive development (notebooks, IDEs). Training jobs run on Compute Clusters or Serverless Compute. Using an instance for training is anti-pattern — it does not scale.

TrapThinking Azure ML Registries and the workspace model registry are the same

RealityThe workspace model registry stores models LOCAL to a single workspace. Azure ML Registries are separate top-level resources that share assets ACROSS workspaces and regions.

TrapThinking Bicep and GitHub Actions serve the same purpose

RealityBicep defines WHAT infrastructure exists (declarative). GitHub Actions defines WHEN and HOW to deploy and run operations (automation). They complement each other — Actions workflows typically call Bicep deployments.

TrapAssuming service principal credentials are the preferred authentication method

RealityManaged identities are preferred — they require no credential storage or rotation. Service principals with secrets are a fallback when managed identities are not available.

Confusing Pairs

Compute InstanceCompute Cluster

Instance = single-node VM for interactive notebooks and development, always-on or idle. Cluster = multi-node autoscaling pool for batch training jobs, scales to zero when idle. Instance cannot be used for distributed training.

Azure ML Workspace Registry (local)Azure ML Registry (cross-workspace)

Workspace model registry = local to ONE workspace, for managing lifecycle stages of models used in that workspace. Azure ML Registry = top-level resource for SHARING assets across multiple workspaces. Questions about multi-team or multi-region asset sharing always point to the cross-workspace Registry.

DatastoreData Asset

Datastore = connection to storage account (authenticated, no versioning). Data Asset = versioned reference to specific data within a datastore (folder, file, table). You need both: datastore first, then data asset.

Bicep TemplatesGitHub Actions Workflows

Bicep = declarative definition of Azure resources (the WHAT). GitHub Actions = automated pipeline that runs Bicep deployments and ML operations (the WHEN and HOW). Not interchangeable.

Scenario Tips

If the question asks about:

A team needs to share a trained model from Workspace A with teams using Workspace B in a different region

Answer:

Use Azure ML Registry — it is designed for cross-workspace, cross-region sharing of ML assets including models, environments, and components.

Distractor to avoid:

Azure Blob Storage (shares raw files but not versioned ML assets) or copying the workspace (not a real option). Both miss the point of managed ML asset sharing.

If the question asks about:

The question asks for consistent, repeatable deployment of ML workspaces across dev/staging/prod

Answer:

Bicep templates deployed via GitHub Actions workflows — Bicep ensures idempotent, repeatable deployments; GitHub Actions provides the CI/CD trigger.

Distractor to avoid:

Python SDK scripts or Azure Portal manual configuration — neither provides repeatable IaC deployments at scale.

If the question asks about:

A training script needs to access Azure Data Lake Storage without storing credentials

Answer:

Configure a Managed Identity on the workspace and grant it RBAC access to the storage account. Create a datastore using the managed identity.

Distractor to avoid:

Storing a storage account key or SAS token in the training script or environment variable — this is a security anti-pattern the exam will never accept.

Last-Minute Facts

1Compute Instance = 1 node, interactive, always-on or idle. Compute Cluster = N nodes, autoscaling, scales to 0 when idle.

2Datastores supported: Azure Blob, ADLS Gen2, Azure Files, Azure SQL, Azure Data Explorer.

3Azure ML Registry is a separate top-level Azure resource, NOT a feature inside a workspace.

4Network isolation modes: Disabled, Allow Internet Outbound, Allow Only Approved Outbound (strictest).

5Managed identity types: system-assigned (tied to resource lifecycle) vs. user-assigned (standalone, shareable).

Domain 228% of exam

Implement Machine Learning Model Lifecycle and Operations

Must-Know Facts

MLflow is natively integrated into Azure ML — every workspace exposes an MLflow tracking URI. You log experiments with mlflow.log_metric() and register models with mlflow.register_model().
Progressive rollout on managed online endpoints works by TRAFFIC SPLITTING between deployments — you route a percentage (e.g., 10%) of requests to the new version, not by deploying to a subset of instances.
Managed online endpoints support BLUE-GREEN deployment: both old and new deployments exist simultaneously with configurable traffic weights.
Archiving a model in the registry does NOT delete it. It marks it deprecated while preserving it for compliance and audit trails.
Data drift monitors INPUT distribution changes vs. training data. Prediction drift monitors OUTPUT distribution changes. Both are monitoring signals, but they detect different problems.
Data collection for monitoring is AUTOMATIC with managed online endpoints. For batch endpoints, it requires MANUAL configuration.
AutoML automates algorithm selection and hyperparameter tuning, NOT data preparation. Data quality still matters.
Hyperparameter tuning uses sweep jobs with: search spaces (discrete or continuous), sampling methods (random, grid, Bayesian), and early termination policies (Bandit, Median, Truncation).
Retraining can be triggered by monitoring alerts via Azure Event Hubs, Azure Functions, or Azure Logic Apps — NOT directly from Azure Monitor alerts.
Feature retrieval specifications can be packaged with the MLflow model artifact to describe how to fetch features at inference time.

Common Traps

TrapThinking progressive rollout means deploying to a subset of compute nodes

RealityProgressive rollout uses TRAFFIC SPLITTING — you keep both the old and new deployment alive simultaneously and adjust what percentage of incoming requests goes to each. Traffic weights are configurable from 0-100%.

TrapAssuming you need to convert MLflow models to a different format before deploying to managed endpoints

RealityMLflow models registered in Azure ML can be deployed DIRECTLY to managed online endpoints without format conversion. Azure ML handles the MLflow model flavor natively.

TrapConflating data drift (inputs changed) with prediction drift (outputs changed) or data quality (schema/null violations)

RealityFour distinct monitoring signals: (1) data drift = input distribution vs training, (2) prediction drift = output distribution over time, (3) data quality = schema, nulls, types, (4) feature attribution drift = feature importance changes. Each has its own threshold.

TrapExpecting Azure Monitor alerts to directly trigger retraining pipelines

RealityAzure Monitor handles infrastructure alerts. ML monitoring alerts use Event Hubs, Azure Functions, or Logic Apps as event handlers to trigger ML pipeline reruns.

TrapAssuming archiving removes the model from Azure ML

RealityArchiving sets the lifecycle stage to 'archived' — the model artifact is preserved. This is important for compliance. True deletion requires an explicit delete operation.

Confusing Pairs

Managed Online EndpointsBatch Endpoints

Online = real-time REST API, always-on, auto-scales, supports traffic splitting for rollout. Batch = offline processing of large datasets, no always-on cost, no traffic splitting. Online for interactive requests; Batch for bulk scoring.

Data DriftPrediction Drift

Data Drift = distribution of INPUT FEATURES changed compared to training baseline. Prediction Drift = distribution of MODEL OUTPUTS changed over time. You can have prediction drift without data drift if concept drift occurs (world changed, inputs look the same but mean something different).

Data DriftData Quality

Data Drift = statistical distribution of values changed (a valid column with different distribution). Data Quality = structural problems like nulls, type mismatches, schema violations. Drift is about 'values moved.' Quality is about 'data is broken.'

Random SamplingBayesian Sampling

Random = explores search space without using prior results (fast, good baseline). Bayesian = uses prior experiment results to pick next hyperparameters (slower per trial, fewer trials needed). Grid = exhaustive (only for small discrete spaces).

MLflow Model RegistryAzure ML Model Registry

MLflow registry = open-source API-compatible, portable, uses MLflow lifecycle stages. Azure ML registry = Azure-native with Responsible AI dashboard integration, RBAC, and direct endpoint deployment. Both exist in the same workspace and registering via MLflow API also appears in Azure ML registry.

Scenario Tips

If the question asks about:

A new model version must be tested in production with minimal risk — rollback must be instant if error rates spike

Answer:

Deploy the new version to a second deployment slot on the managed online endpoint. Set traffic weight to 10% (new) / 90% (old). Monitor error rates. If issues arise, set traffic back to 100% old instantly.

Distractor to avoid:

Blue-green with a new endpoint (loses rollback speed) or batch endpoint canary (batch doesn't support traffic splitting).

If the question asks about:

Production model accuracy has dropped. Azure ML monitoring shows input feature distributions have diverged significantly from training data

Answer:

This is DATA DRIFT. Configure an alert trigger connected to Azure Event Hubs or Logic Apps to automatically initiate a retraining pipeline with fresh data.

Distractor to avoid:

Prediction drift (that monitors OUTPUT, not input) or data quality (structural issue, not distribution shift).

If the question asks about:

A training job needs to try 100+ hyperparameter combinations efficiently, stopping poor configurations early

Answer:

Configure a sweep job with Bayesian sampling (uses prior results) plus an early termination policy (Bandit or Median). Bayesian is most efficient for large search spaces.

Distractor to avoid:

Grid sampling (only for tiny discrete spaces) or random sampling without early termination (wastes compute on clearly bad configs).

If the question asks about:

Question asks when to use batch endpoints instead of online endpoints

Answer:

Batch endpoints when: processing large static datasets, no latency requirements, want to avoid always-on compute costs, need parallel inference at scale.

Distractor to avoid:

Do not choose batch if the scenario mentions real-time, interactive, low-latency, or chatbot requirements — those always need online endpoints.

Last-Minute Facts

1Azure ML workspace exposes a unique MLflow tracking URI — set it via ws.get_mlflow_tracking_uri() before logging any runs.

24 monitoring signal types: data drift, prediction drift, data quality, feature attribution drift

33 sweep sampling methods: random, grid, Bayesian (also SobolQMC for quasi-random)

4Early termination policies: Bandit (relative slack), Median (compared to median), TruncationSelection (bottom X% cut)

5Traffic split: online endpoint deployments can have weights 0-100, must sum to 100

6AutoML supports: classification, regression, time-series forecasting (NOT clustering — that's unsupervised)

7Model lifecycle stages in MLflow: None, Staging, Production, Archived

Domain 324% of exam

Design and Implement a GenAIOps Infrastructure

Must-Know Facts

Microsoft Foundry is the current platform name — it replaced Azure AI Studio and Azure AI Foundry (classic). The exam uses 'Microsoft Foundry' as the canonical name.
Foundry uses a hub-and-project architecture: Hubs are shared infrastructure (compute, networking, connections). Projects are team workspaces within a hub.
Two deployment options for foundation models: Serverless API (MaaS — no GPU management, pay-as-you-go) vs Managed Compute (dedicated VMs, more control, higher commitment).
Serverless API deployment scopes: Global Standard (worldwide routing), Data Zone (geographic boundary), Regional (specific Azure region for compliance).
Provisioned Throughput Units (PTUs) are pre-purchased capacity on serverless endpoints — they guarantee consistent performance for high-volume production workloads.
Prompt versioning uses Git repositories — track prompt changes with commits, create variants as branches, compare performance across versions.
Managed identities + RBAC is the recommended security pattern for Foundry resources — eliminates credential management.
Bicep templates can deploy Foundry resources (hubs, projects, model deployments) just as they deploy Azure ML resources.

Common Traps

TrapThinking pay-as-you-go serverless guarantees performance for high-volume production

RealityPay-as-you-go serverless is flexible but does NOT guarantee capacity during demand spikes. PTUs (Provisioned Throughput Units) must be explicitly purchased to guarantee performance at scale.

TrapTreating Global Standard, Data Zone, and Regional deployment scopes as synonyms

RealityGlobal Standard = routes across worldwide Azure infrastructure for best availability. Data Zone = keeps traffic within a geographic region (e.g., EU) for data residency. Regional = pins to a single Azure region for strictest compliance. Each serves a different compliance requirement.

TrapConfusing prompt versioning with model versioning

RealityPrompt versioning tracks changes to PROMPT TEXT in Git. Model versioning tracks different foundation model versions in the model catalog/registry. These are separate concerns with separate workflows.

TrapThinking managed compute and serverless endpoints are always interchangeable

RealityManaged compute integrates deeply with the MLOps lifecycle (custom containers, pipeline triggers). Serverless is simpler but less configurable. The exam will give you a constraint (e.g., 'no GPU management' = serverless, or 'custom container needed' = managed compute) that eliminates one option.

TrapTreating Azure AI Studio as a current tool for exam questions

RealityAzure AI Studio was rebranded. The exam uses 'Microsoft Foundry' as the current platform. If a question mentions Azure AI Foundry as a legacy portal, the intended answer likely references Microsoft Foundry capabilities.

Confusing Pairs

Serverless API Endpoint (MaaS)Managed Compute Deployment

Serverless MaaS = no GPU infrastructure to manage, pay per token, scale handled by Microsoft, supports PTUs for guaranteed throughput. Managed Compute = you control the VM size and count, supports custom containers, deeper MLOps integration but requires infrastructure planning. Use serverless unless you need custom containers or specific GPU configurations.

PTUs (Provisioned Throughput Units)Pay-as-you-go serverless

Pay-as-you-go = flexible, no commitment, cost scales with usage, no throughput guarantee. PTUs = pre-purchased capacity, guaranteed throughput, predictable cost, required for SLA-sensitive production workloads at high volume.

Global Standard scopeData Zone scopeRegional scope

Global Standard = best availability, no data residency control. Data Zone = data stays within geographic boundary (e.g., EU) for GDPR-like requirements. Regional = data pinned to a single Azure region for strictest sovereignty needs. Tighter = less availability, stricter compliance.

Foundry HubFoundry Project

Hub = shared infrastructure layer (compute, networking, connections, governance). Project = team workspace within a hub for a specific AI application. Multiple projects share one hub's infrastructure. Hubs are created by platform teams; projects are created by AI teams.

Scenario Tips

If the question asks about:

A chatbot needs to handle 50,000 requests/hour with guaranteed sub-2-second response times

Answer:

Deploy the foundation model on a Serverless API endpoint with Provisioned Throughput Units (PTUs). PTUs guarantee consistent throughput; pay-as-you-go serverless does not.

Distractor to avoid:

Plain pay-as-you-go serverless — this seems simpler but cannot guarantee performance at high volume. Managed compute is also valid but the question says 'no GPU management.'

If the question asks about:

A healthcare company must ensure all AI inference data stays within EU boundaries

Answer:

Deploy on a Serverless API endpoint with Data Zone scope. This restricts all routing to the EU geographic boundary, meeting data residency requirements without full regional pinning.

Distractor to avoid:

Global Standard (no data residency control) or Regional (would work but unnecessarily restricts to one region, reducing availability).

If the question asks about:

A team wants to track system prompt changes, compare response quality across different prompt versions, and collaborate on prompt development

Answer:

Implement prompt versioning with a Git repository in Microsoft Foundry. Git tracks changes, branches enable variants, and comparison tools evaluate performance across versions.

Distractor to avoid:

Storing prompts in Azure Blob Storage (no version comparison) or using the model registry (models, not prompts).

If the question asks about:

An organization's security policy prohibits storing credentials in code or config files for Foundry access

Answer:

Configure Managed Identities on the Foundry resources and grant RBAC permissions. Managed identities authenticate without any stored credentials.

Distractor to avoid:

Service principal with secret in Key Vault — still involves credential management. Managed identities eliminate this entirely.

Last-Minute Facts

1Foundry architecture: Hub = shared infra layer. Project = team workspace within a hub.

23 serverless deployment scopes: Global Standard, Data Zone, Regional (strictest compliance = Regional).

3PTUs = guaranteed throughput. Pay-as-you-go = flexible but no throughput guarantee.

4Prompt versioning tool: Git repositories (not the model registry, not Blob Storage).

5Platform name: Microsoft Foundry (replaces Azure AI Studio / Azure AI Foundry classic).

Domain 415% of exam

Implement Generative AI Quality Assurance and Observability

Must-Know Facts

Four AI quality metrics: Groundedness (factually supported by SOURCE DATA), Relevance (response addresses the query), Coherence (logical flow and consistency), Fluency (natural language quality).
Groundedness is about the SOURCE DOCUMENTS — a fluent, relevant, coherent response can still be UNGROUNDED if it includes facts not present in the retrieved context.
Safety evaluations are SEPARATE from quality evaluations. A response can be high-quality and unsafe, or low-quality and safe.
Automated evaluation workflows can run on both TEST DATASETS (pre-deployment gate) and PRODUCTION TRAFFIC (continuous monitoring).
Distributed tracing is the correct tool for identifying which step in a multi-step GenAI pipeline (embedding, retrieval, LLM inference) is the latency bottleneck.
Token consumption has TWO components: input tokens (prompt + context) and output tokens (generated response). Cost optimization may target either.
Continuous monitoring in Foundry: dashboards track groundedness trends, latency, token usage, and safety violations over time.

Common Traps

TrapThinking groundedness measures general factual accuracy

RealityGroundedness specifically measures whether the response is supported by the PROVIDED SOURCE DATA (the retrieved documents in a RAG system). A response can be globally accurate but ungrounded if the supporting document was not retrieved. The metric is relative to the sources, not general knowledge.

TrapUsing overall endpoint latency monitoring to debug multi-step pipeline slowness

RealityEndpoint latency gives you the TOTAL time only. To identify which step (embedding generation, vector search, LLM call) is the bottleneck, you need DISTRIBUTED TRACING which captures timing per step.

TrapTreating safety evaluations and quality evaluations as the same check

RealityQuality (groundedness, relevance, coherence, fluency) measures response usefulness. Safety (harmful content, bias, policy violations) measures response appropriateness. Configure both as separate evaluation passes.

TrapOnly evaluating GenAI quality on test datasets before deployment

RealityProduction traffic behaves differently from test datasets. Continuous monitoring on production traffic is essential to catch quality degradation that did not appear in pre-deployment tests.

Confusing Pairs

GroundednessRelevance

Groundedness = Is the response FACTUALLY SUPPORTED by the retrieved source documents? Relevance = Does the response ANSWER the user's question? A response can be relevant but hallucinate details not in sources (high relevance, low groundedness).

CoherenceFluency

Coherence = Does the response have logical flow and internal consistency? (Structure, reasoning) Fluency = Is the language natural and grammatically correct? (Language quality) A response can be logically coherent but awkwardly phrased, or fluently written but logically contradictory.

Quality EvaluationSafety Evaluation

Quality = How useful is the response? (Groundedness, relevance, coherence, fluency — output quality metrics). Safety = How appropriate is the response? (Harmful content, bias, jailbreak detection — risk metrics). Both are necessary; neither replaces the other.

Latency MonitoringDistributed Tracing

Latency monitoring = total request-to-response time (useful for SLA alerting). Distributed tracing = per-step timing and execution details across all pipeline components (useful for debugging bottlenecks). Use tracing when you need to know WHERE the slowness is.

Scenario Tips

If the question asks about:

A GenAI response is grammatically correct, well-structured, directly addresses the question, but cites facts not found in the retrieved documents

Answer:

Low GROUNDEDNESS — the response is fluent, coherent, and relevant, but contains unsupported content relative to the source data. This is the classic hallucination scenario.

Distractor to avoid:

Fluency, Coherence, or Relevance — these are all high in this scenario. The specific problem is the gap between the response and the source documents.

If the question asks about:

An agent pipeline with 5 steps (intent detection, embedding, vector search, prompt assembly, LLM call) shows high latency but you cannot identify which step is slow

Answer:

Configure DISTRIBUTED TRACING — it captures timing and status for each step individually, allowing you to pinpoint the bottleneck.

Distractor to avoid:

Token consumption monitoring (cost, not latency) or endpoint latency monitoring (total only, not per-step).

If the question asks about:

A team wants to prevent unsafe or harmful responses from reaching production, in addition to quality checks

Answer:

Configure SEPARATE safety evaluations and quality evaluations. Safety evaluations check for harmful content, bias, and policy violations. They are not captured by groundedness/relevance/coherence/fluency alone.

Distractor to avoid:

Adding groundedness as the safety gate — high groundedness does not prevent harmful content if the source documents themselves contain harmful information.

Last-Minute Facts

14 quality metrics: Groundedness, Relevance, Coherence, Fluency — know all four and their distinctions.

2Groundedness = supported by SOURCE DATA specifically (not general world knowledge).

32 token types to monitor: input tokens (prompt) and output tokens (response).

4Distributed tracing = per-step timing. Latency monitoring = total time only.

5Evaluation can run on: test datasets (pre-deployment) AND production traffic (post-deployment).

Domain 515% of exam

Optimize Generative AI Systems and Model Performance

Must-Know Facts

RAG vs fine-tuning: RAG adds KNOWLEDGE at query time without changing the model. Fine-tuning changes MODEL BEHAVIOR (style, format, task specialization) permanently by updating weights.
If you need the model to know new facts from private or frequently-changing data — use RAG. If you need the model to respond differently (tone, format, persona) — use fine-tuning.
Chunk size trade-off: smaller chunks (256-512 tokens) = more precise retrieval. Larger chunks (1024+ tokens) = more context but diluted relevance. The optimal size depends on query patterns.
Chunk OVERLAP (e.g., 10-20% overlap between consecutive chunks) prevents information loss at chunk boundaries.
Hybrid search = semantic search (vector/embedding-based) PLUS keyword search (BM25-based). Almost always outperforms either approach alone.
Semantic search handles conceptual similarity. Keyword search handles exact term matching (proper nouns, codes, IDs). Hybrid captures both.
Similarity threshold controls how many documents are retrieved — lower threshold = more documents (higher recall, lower precision), higher threshold = fewer documents (lower recall, higher precision).
Synthetic data generation: when real labeled data is scarce, use an LLM to generate diverse training examples from a small seed set for fine-tuning.
A/B testing for RAG: hold the LLM constant, vary ONE retrieval parameter at a time to isolate its impact.

Common Traps

TrapAssuming larger chunk sizes always improve RAG quality

RealityLarger chunks include more surrounding text, which provides more context but also introduces irrelevant content that dilutes the signal. Retrieval precision typically drops. The optimal size is query-dependent, not universally 'bigger is better.'

TrapThinking RAG and fine-tuning are alternatives that solve the same problem

RealityRAG and fine-tuning solve DIFFERENT problems. RAG = what the model KNOWS (injecting knowledge at runtime). Fine-tuning = how the model BEHAVES (changing style, format, task performance). They can and often should be combined.

TrapUsing pure vector search for all retrieval scenarios

RealityPure vector/semantic search misses exact term matches — proper nouns, product codes, model numbers, ICD codes. Hybrid search adds keyword (BM25) retrieval to catch exact matches that embeddings generalize over.

TrapAssuming 50 real examples are sufficient for fine-tuning without augmentation

Reality50 examples is typically insufficient for effective fine-tuning. Synthetic data generation using an LLM expands the seed set into hundreds or thousands of diverse training examples.

TrapLowering the similarity threshold to fix 'not enough documents retrieved' without understanding the precision trade-off

RealityLowering the threshold retrieves more documents (higher recall) but includes less relevant ones, which can dilute the context and confuse the LLM. Balance precision and recall; don't just lower the threshold blindly.

Confusing Pairs

RAG (Retrieval-Augmented Generation)Fine-Tuning

RAG = inject external knowledge at query time via retrieval. Model weights unchanged. Best for private, large, or frequently-updated knowledge. Fine-tuning = update model weights using training data. Knowledge baked into weights. Best for style/format changes, specialized task performance. They target different problems.

Semantic Search (Vector)Keyword Search (BM25)

Semantic = understands conceptual meaning, finds related content even with different words. Misses exact unique terms. Keyword = exact term matching with relevance ranking, catches proper nouns and codes, misses synonyms. Hybrid combines both.

Small Chunk SizeLarge Chunk Size

Small chunks (256-512 tokens): high retrieval precision, each chunk is tightly focused. Risk: cuts off important surrounding context. Large chunks (1024+ tokens): preserves context, but retrieval precision drops. Risk: dilutes the relevant content with surrounding noise. Overlap mitigates boundary losses.

Similarity Threshold (high)Similarity Threshold (low)

High threshold: fewer documents retrieved, high precision, potentially missing relevant docs (low recall). Low threshold: more documents retrieved, high recall, potentially many irrelevant docs (low precision). The right balance depends on the use case's tolerance for noise vs. completeness.

Scenario Tips

If the question asks about:

A medical knowledge base RAG system misses exact ICD codes and drug names even though conceptual queries work well

Answer:

Add keyword (BM25) search alongside semantic search — implement HYBRID SEARCH. Keyword search catches exact medical codes and proper nouns that embeddings generalize over.

Distractor to avoid:

Increase chunk size (adds context but doesn't fix exact-match retrieval) or use a larger embedding model (still may not fix precise term matching).

If the question asks about:

A foundation model produces correct content but in the wrong format — responses are always in bullet points but the use case requires paragraphs

Answer:

FINE-TUNING on examples in the desired format. RAG cannot change how the model structures its output. Fine-tuning modifies output behavior.

Distractor to avoid:

RAG (adds knowledge, not format behavior) or prompt engineering alone (may not be reliable enough for consistent format enforcement in production).

If the question asks about:

A fine-tuning project has only 50 real customer support examples but needs at least 500 for effective training

Answer:

Use SYNTHETIC DATA GENERATION — prompt an LLM with the 50 real examples to generate diverse synthetic training examples in the same style and domain.

Distractor to avoid:

Use RAG instead of fine-tuning (doesn't solve the format problem the fine-tuning is meant to address) or deploy with 50 examples (insufficient for reliable fine-tuning).

If the question asks about:

A RAG system retrieves documents but the LLM includes facts not present in those documents

Answer:

This is a HALLUCINATION issue at the generation layer, not a retrieval issue. Fine-tune the LLM to improve groundedness, or add explicit prompt instructions to only use retrieved context. Groundedness evaluation will flag this.

Distractor to avoid:

Increasing chunk size or adding more documents (misidentifies the problem as a retrieval issue when it's a generation issue).

Last-Minute Facts

1RAG = knowledge injection at query time, no weight change. Fine-tuning = weight update, behavior change.

2Hybrid search = semantic (vector/embedding) + keyword (BM25). Almost always better than either alone.

3Chunk overlap prevents information loss at chunk boundaries (typical overlap: 10-20% of chunk size).

4Synthetic data: generate training examples with an LLM when real labeled data is scarce.

5A/B test RAG changes: vary ONE parameter at a time, keep LLM constant to isolate retrieval impact.

6Similarity threshold: high = fewer docs, high precision. Low = more docs, high recall.

Feeling confident?

Put your knowledge to the test with a timed AI-300 mock exam.