MicrosoftAI-30075 concepts

AI-300 Cheat Sheet

Quick reference for the Microsoft Certified: Machine Learning Operations Engineer Associate exam.

Quick Navigation

Azure ML Workspace — Setup and CLI Infrastructure as Code — Bicep and GitHub Actions MLflow — Experiment Tracking and Model Registry Model Training — AutoML, Pipelines, and Hyperparameter Tuning Model Deployment — Online and Batch Endpoints Production Monitoring — Drift Detection and Retraining Triggers Microsoft Foundry — GenAIOps Infrastructure GenAI Quality Metrics and Evaluation Workflows GenAI Observability — Latency, Cost, and Tracing RAG Optimization — Chunking, Search, and Retrieval Tuning Fine-Tuning Foundation Models Security and Identity — RBAC, Managed Identities, and Networking

Azure ML Workspace — Setup and CLI

az ml workspace create -n <workspace-name> -g <resource-group>: Create an Azure ML workspace using CLI v2 — the workspace is the top-level resource for all ML assets (compute, data, environments, models, endpoints).
az extension add -n ml az configure --defaults group=<rg> workspace=<ws> location=<loc>: Install the Azure ML CLI v2 extension and configure defaults to avoid repeating --workspace and --resource-group flags on every command.
az ml workspace show -n <workspace-name> -g <resource-group>: Display workspace details including MLflow tracking URI, associated storage, key vault, and container registry.
Workspace-level model registry vs. Azure ML Registry: Workspace model registry stores models scoped to ONE workspace; Azure ML Registry shares models, environments, and components ACROSS multiple workspaces organization-wide.
Compute types: Compute Instance / Compute Cluster / Serverless Compute / Inference Compute: Compute Instance is for interactive dev (notebooks); Compute Cluster is for scalable training jobs; Serverless Compute auto-provisions for jobs; Inference Compute backs managed endpoints.
Datastore vs. Data Asset: A Datastore defines the CONNECTION to Azure storage (Blob, ADLS, SQL) without exposing credentials; a Data Asset is a versioned REFERENCE to specific data within that datastore.
az ml data create --name mydata --version 1 --type uri_folder --path azureml://datastores/<ds>/paths/<folder>: Register a versioned data asset pointing to a folder in a registered datastore — data assets are the recommended way to reference training and evaluation data.

Infrastructure as Code — Bicep and GitHub Actions

resource workspace 'Microsoft.MachineLearningServices/workspaces@2024-04-01' = { name: workspaceName location: location identity: { type: 'SystemAssigned' } properties: { storageAccount: storageAccount.id keyVault: keyVault.id applicationInsights: appInsights.id containerRegistry: containerRegistry.id } }: Bicep resource definition for an Azure ML workspace with system-assigned managed identity — declarative IaC for reproducible workspace deployments.
az deployment group create --resource-group <rg> --template-file main.bicep --parameters @params.json: Deploy a Bicep template to create Azure ML infrastructure — use parameter files to manage environment-specific (dev/staging/prod) configurations.
Bicep vs. GitHub Actions role distinction: Bicep defines WHAT Azure resources to deploy (declarative desired state); GitHub Actions defines WHEN and HOW to execute deployments (CI/CD orchestration) — they work together, not as alternatives.
# .github/workflows/deploy-workspace.yml jobs: deploy: runs-on: ubuntu-latest steps: - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: az deployment group create --template-file infra/main.bicep: GitHub Actions workflow snippet authenticating to Azure with OIDC (no stored secrets) before deploying Bicep templates for ML infrastructure.
GitHub Actions OIDC vs. service principal secret auth: OIDC federated credentials are preferred for GitHub Actions — they eliminate stored secrets and use short-lived tokens, unlike service principal client secrets which must be rotated manually.
Private endpoint + VNet isolation for workspace: Restrict workspace access by deploying private endpoints that route traffic through Azure VNet — public internet access to the workspace is disabled when private endpoints are enabled.

MLflow — Experiment Tracking and Model Registry

import mlflow mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()) mlflow.set_experiment("my-experiment") with mlflow.start_run(): mlflow.log_param("learning_rate", 0.01) mlflow.log_metric("accuracy", 0.95) mlflow.sklearn.log_model(model, artifact_path="model"): Configure MLflow tracking URI from the Azure ML workspace and log parameters, metrics, and a scikit-learn model artifact — all within a single tracked run.
mlflow.register_model( model_uri=f"runs:/{run.info.run_id}/model", name="my-registered-model" ): Register a trained model from a run into the MLflow model registry — in Azure ML this simultaneously registers the model in the Azure ML model registry.
Model lifecycle stages: None → Staging → Production → Archived: MLflow model registry tracks lifecycle stages — Archived marks the model as deprecated but does NOT delete it; the artifact is retained for compliance and rollback.
mlflow.autolog(): Enable automatic logging of parameters, metrics, and artifacts for supported frameworks (scikit-learn, TensorFlow, PyTorch) — reduces boilerplate MLflow instrumentation code.
MLflow tracking URI format: azureml:///<workspace-info>: Every Azure ML workspace exposes a unique MLflow tracking URI — point mlflow.set_tracking_uri() to this URI to store experiments directly in the workspace.
mlflow.evaluate(model_uri, data=test_df, targets="label", model_type="classifier"): Run MLflow model evaluation to compute classification metrics (accuracy, F1, ROC-AUC) against a test dataset — results are logged as run metrics for comparison.

Model Training — AutoML, Pipelines, and Hyperparameter Tuning

az ml job create --file sweep-job.yml # sweep-job.yml defines: search_space, sampling_algorithm, limits, objective: Submit a hyperparameter sweep job using CLI v2 — the YAML file defines the parameter search space, sampling method (random/grid/Bayesian), and early termination policy.
Sweep job sampling methods: random / grid / Bayesian: Random sampling is fastest for broad exploration; Grid exhaustively tests all combinations; Bayesian uses prior results to guide the search — Bayesian is most efficient when evaluation is expensive.
Early termination policies: Bandit / Median Stopping / Truncation Selection: Bandit terminates runs performing below a slack factor of the best run; Median Stopping cancels runs below the median primary metric; Truncation Selection cancels the lowest-performing X% each interval.
AutoML vs. manual hyperparameter sweep: AutoML explores algorithms AND hyperparameters automatically for classification/regression/time-series; a sweep job tunes hyperparameters of ONE fixed algorithm — AutoML does not replace data preparation.
Distributed training: data parallelism vs. model parallelism: Data parallelism splits the DATASET across GPUs and synchronizes gradients — for large datasets with models that fit on one GPU; model parallelism splits the MODEL across GPUs — for models too large for one GPU.
az ml component create --file train-component.yml az ml pipeline create --file pipeline.yml: Register a reusable component and compose it into a pipeline via YAML — components define inputs, outputs, code, and environment; pipelines chain components with defined data flow.
Environment types: curated (Microsoft-maintained) vs. custom: Curated environments come pre-built with common ML frameworks (sklearn, PyTorch, TensorFlow); custom environments let you specify Docker base images and conda dependencies — both are versioned.

Model Deployment — Online and Batch Endpoints

az ml online-endpoint create --name my-endpoint -g <rg> -w <ws> az ml online-deployment create --name blue --endpoint my-endpoint --file deployment.yml --all-traffic: Create a managed online endpoint and deploy a model to it with 100% traffic using the --all-traffic flag — the endpoint hosts the REST API URL.
az ml online-endpoint update --name my-endpoint \ --traffic "blue=90 green=10": Split traffic between two deployments on the same endpoint for progressive rollout — gradually shift traffic from old (blue) to new (green) deployment while monitoring performance.
az ml online-endpoint update --name my-endpoint --traffic "blue=100": Perform a safe rollback by routing 100% of traffic back to the stable deployment — the new deployment remains in place but receives no traffic until issues are resolved.
Managed online endpoints vs. Batch endpoints: Online endpoints are always-on REST APIs for low-latency real-time inference with auto-scaling and blue-green deployment; batch endpoints run parallel inference on large datasets with no always-on compute cost.
Data collection for monitoring: automatic (online) vs. manual (batch): Online endpoints automatically collect input/output data for monitoring when data collection is enabled; batch endpoints require manual configuration to capture prediction data.
az ml batch-endpoint invoke --name my-batch-endpoint \ --input azureml:my-data-asset:1: Trigger a batch inference job by invoking the batch endpoint with an input data asset reference — the job runs across compute cluster nodes in parallel.

Production Monitoring — Drift Detection and Retraining Triggers

Data drift vs. Prediction drift: Data drift detects changes in the statistical distribution of INPUT features vs. training data; prediction drift detects changes in the OUTPUT distribution — a model can have prediction drift without data drift if the data-label relationship changes.
Monitoring signals: data drift / prediction drift / data quality / feature attribution drift: Configure all four monitoring signals for comprehensive production visibility — data quality checks for nulls and schema violations; feature attribution drift detects changes in which features drive predictions.
Retraining trigger pipeline: Model Monitor alert → Event Hubs / Logic Apps / Azure Functions → training pipeline: When a monitoring signal exceeds its configured threshold, the alert can trigger Azure Event Hubs, Logic Apps, or Azure Functions to launch an automated retraining pipeline.
Responsible AI Dashboard components: fairness / interpretability / error analysis / causal inference: The Responsible AI Dashboard in Azure ML Studio provides a unified view of model fairness, feature explanations, error distributions, and causal impact — use before production deployment.
Feature retrieval specification: A specification packaged with the model artifact that describes how to retrieve features from feature stores at inference time — enables consistent feature engineering between training and serving.
Reference dataset for drift monitoring: Set the training dataset as the reference dataset in model monitoring — all drift calculations compare production data distribution against this reference, not against previous production windows.

Microsoft Foundry — GenAIOps Infrastructure

Microsoft Foundry hub-and-project architecture: A Foundry hub is the top-level governance resource (shared compute, networking, security); projects are isolated workspaces under the hub for individual teams or applications to deploy models and build GenAI apps.
Serverless API (MaaS) vs. Managed Compute deployment: Serverless API is pay-as-you-go with no GPU management and regional deployment scope; managed compute provides dedicated GPU infrastructure with more control and full MLOps lifecycle integration.
Serverless deployment scopes: Global Standard / Data Zone / Regional: Global Standard routes requests across worldwide Microsoft infrastructure for highest availability; Data Zone restricts to a geographic boundary for data residency; Regional pins to a specific Azure region for compliance.
Provisioned Throughput Units (PTUs): PTUs reserve a fixed amount of model processing capacity upfront — choose PTUs over pay-as-you-go serverless when you need guaranteed throughput and consistent latency for high-volume production workloads.
az cognitiveservices account deployment create \ --name <foundry-resource> \ --resource-group <rg> \ --deployment-name my-gpt4o \ --model-name gpt-4o \ --model-version 2024-08-06 \ --model-format OpenAI \ --sku-capacity 10 \ --sku-name GlobalStandard: Deploy a foundation model to Microsoft Foundry using Azure CLI — sku-name specifies the deployment scope (GlobalStandard, DataZoneStandard, or Standard for regional).
Managed identity + RBAC for Foundry resources: Use system-assigned or user-assigned managed identities for credential-free authentication to Foundry resources — assign granular RBAC roles (e.g., Azure AI Developer) rather than owner/contributor.
Prompt versioning with Git repositories: Store prompts in Git repositories within Microsoft Foundry to track prompt changes, create and compare variants, and enable team collaboration — prompt versioning and model versioning are separate concerns.

GenAI Quality Metrics and Evaluation Workflows

Groundedness: response factually supported by the source data: Groundedness measures whether each claim in the response is backed by the provided context documents — a fluent and relevant response can still be ungrounded if it introduces facts not in the source.
Relevance: response directly addresses the user's query: Relevance measures whether the generated response answers what the user asked — a grounded and coherent response can still be irrelevant if it discusses the wrong topic.
Coherence: logical flow and consistency across the response: Coherence measures whether the response is logically consistent and well-structured from sentence to sentence — distinct from fluency which measures grammar and naturalness.
Fluency: grammatically correct and natural-sounding language: Fluency measures the linguistic quality of the response — a fluent response can still be incoherent, irrelevant, or ungrounded; fluency alone does not indicate quality.
Risk and safety evaluations vs. quality evaluations: Safety evaluations detect harmful content, bias, and policy violations in model outputs — they are separate from quality metrics (groundedness, relevance) since a high-quality response can still be unsafe.
Automated evaluation workflow: test dataset → run metrics → compare → gate deployment: Configure automated evaluation in Foundry to run built-in and custom metrics on a test dataset on every deployment — use metric thresholds as quality gates before promoting to production.
Evaluation on test dataset vs. production traffic: Test dataset evaluation runs before deployment and catches regressions; production traffic evaluation monitors drift in quality metrics over time — both are necessary for complete quality assurance.

GenAI Observability — Latency, Cost, and Tracing

Distributed tracing for multi-step GenAI applications: Distributed tracing captures timing and execution details at each pipeline step (embedding, retrieval, LLM inference) — use it to identify which step is the latency bottleneck in RAG or agentic pipelines.
Token consumption: input tokens + output tokens: Monitor both input token count (prompt length) and output token count (response length) separately — cost optimization may target either side, and context window limits apply to the combined total.
Performance metrics: TTFT (time to first token) vs. total response time: Time to first token measures perceived latency in streaming responses; total response time measures complete generation — for streaming UIs, TTFT is the primary user-perceived latency metric.
Throughput: requests per second (RPS) / tokens per minute (TPM): Monitor both RPS and TPM to understand system capacity — PTU limits are defined in TPM, not RPS, so high-context requests consume PTU capacity faster than short requests at the same RPS.
Foundry observability dashboard: latency / throughput / token usage / quality signals / safety signals: Configure all five observability dimensions in the Foundry monitoring dashboard — monitoring only Azure resource metrics misses GenAI-specific quality and safety signals.
Logging and tracing for debugging: full request/response capture: Enable detailed logging to capture the full prompt, retrieved documents, and generated response for each request — essential for debugging quality issues and auditing production behavior.

RAG Optimization — Chunking, Search, and Retrieval Tuning

Chunk size strategies: smaller (e.g., 512 tokens) = precision; larger = context: Smaller chunks improve retrieval precision by reducing noise per chunk but may miss cross-chunk context; larger chunks provide more context per result but dilute relevance scores — optimal size depends on query patterns.
Chunk overlap: Overlapping adjacent chunks by 10–20% prevents information at chunk boundaries from being lost — without overlap, sentences that span a boundary are split and may not be retrieved.
Hybrid search = semantic (vector) + keyword (BM25) via Reciprocal Rank Fusion: Hybrid search with RRF merging almost always outperforms pure vector or pure keyword search alone — keyword search captures exact terminology that embeddings may not preserve.
Similarity threshold tuning: precision vs. recall tradeoff: A high similarity threshold filters out loosely related chunks (high precision, lower recall); a low threshold returns more chunks including marginally relevant ones (high recall, lower precision) — tune based on hallucination vs. missed-answer tradeoff.
Embedding model selection for RAG: Choose embedding models optimized for your domain and language — a general-purpose embedding model may not capture domain-specific vocabulary; fine-tuning embeddings on domain data improves retrieval quality.
RAG vs. Fine-tuning decision: RAG adds KNOWLEDGE at inference time without changing model weights — use for dynamic, private, or frequently updated data; fine-tuning changes model BEHAVIOR permanently — use for specialized response style or task format.
A/B testing for RAG parameter optimization: Hold the LLM constant and vary ONE retrieval parameter at a time (chunk size, top-k, threshold) to isolate the impact of each change on end-to-end response quality metrics.

Fine-Tuning Foundation Models

Fine-tuning methods: supervised / parameter-efficient (LoRA, QLoRA) / instruction tuning: Supervised fine-tuning trains on labeled input-output pairs; LoRA and QLoRA are parameter-efficient methods that train a small set of adapter weights rather than all model parameters; instruction tuning aligns models to follow natural language instructions.
Synthetic data generation for fine-tuning: When real labeled examples are scarce, use an LLM to generate diverse synthetic training examples based on a small seed set — synthetic data must be diverse and representative or it degrades model performance.
Fine-tuning deployment: serverless or managed compute in Microsoft Foundry: Fine-tuned models can be deployed to either serverless API endpoints (pay-as-you-go, less control) or managed compute deployments (dedicated GPU, full MLOps integration) within Microsoft Foundry.
Monitoring fine-tuned vs. base model performance: After deployment, compare fine-tuned model quality metrics against the base model on the same test dataset — fine-tuning can improve task performance but may degrade general capability (catastrophic forgetting).
Fine-tuning does NOT add knowledge — use RAG for that: Fine-tuning changes HOW the model responds (style, format, task specialization) but does not update the model's knowledge; if the goal is grounding responses in new facts, use RAG instead.

Security and Identity — RBAC, Managed Identities, and Networking

az role assignment create \ --role "Azure ML Data Scientist" \ --assignee <principal-id> \ --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.MachineLearningServices/workspaces/<ws>: Assign an RBAC role scoped to a specific Azure ML workspace — use built-in ML roles (AzureML Data Scientist, AzureML Compute Operator) rather than Owner/Contributor for least-privilege access.
System-assigned managed identity vs. user-assigned managed identity: System-assigned identity is tied to the resource lifecycle (deleted with the resource); user-assigned identity is standalone and can be shared across multiple resources — use user-assigned for shared access scenarios.
Private endpoint for ML workspace: disables public access, routes through VNet: Deploying a private endpoint to an Azure ML workspace disables public internet access to the workspace and Studio — all access goes through the private IP within the configured VNet.
Managed network isolation modes: Disabled / Allow Internet Outbound / Allow Only Approved Outbound: Workspace managed network isolation controls outbound traffic: Disabled allows all outbound; Allow Internet Outbound allows all outbound plus private endpoints; Allow Only Approved Outbound restricts outbound to configured rules only.
Azure ML Registries: cross-workspace asset sharing with governance: Registries share models, environments, components, and data assets ACROSS workspaces — assets promoted to a registry can be consumed by any workspace in the organization, enabling centralized governance.

Ready to test yourself?

Start a timed AI-300 mock exam or review practice questions by domain.