Quick Navigation
Azure ML Workspace — Setup and CLIInfrastructure as Code — Bicep and GitHub ActionsMLflow — Experiment Tracking and Model RegistryModel Training — AutoML, Pipelines, and Hyperparameter TuningModel Deployment — Online and Batch EndpointsProduction Monitoring — Drift Detection and Retraining TriggersMicrosoft Foundry — GenAIOps InfrastructureGenAI Quality Metrics and Evaluation WorkflowsGenAI Observability — Latency, Cost, and TracingRAG Optimization — Chunking, Search, and Retrieval TuningFine-Tuning Foundation ModelsSecurity and Identity — RBAC, Managed Identities, and Networking
Azure ML Workspace — Setup and CLI
- az ml workspace create -n <workspace-name> -g <resource-group>
- Create an Azure ML workspace using CLI v2 — the workspace is the top-level resource for all ML assets (compute, data, environments, models, endpoints).
- az extension add -n ml az configure --defaults group=<rg> workspace=<ws> location=<loc>
- Install the Azure ML CLI v2 extension and configure defaults to avoid repeating --workspace and --resource-group flags on every command.
- az ml workspace show -n <workspace-name> -g <resource-group>
- Display workspace details including MLflow tracking URI, associated storage, key vault, and container registry.
- Workspace-level model registry vs. Azure ML Registry
- Workspace model registry stores models scoped to ONE workspace; Azure ML Registry shares models, environments, and components ACROSS multiple workspaces organization-wide.
- Compute types: Compute Instance / Compute Cluster / Serverless Compute / Inference Compute
- Compute Instance is for interactive dev (notebooks); Compute Cluster is for scalable training jobs; Serverless Compute auto-provisions for jobs; Inference Compute backs managed endpoints.
- Datastore vs. Data Asset
- A Datastore defines the CONNECTION to Azure storage (Blob, ADLS, SQL) without exposing credentials; a Data Asset is a versioned REFERENCE to specific data within that datastore.
- az ml data create --name mydata --version 1 --type uri_folder --path azureml://datastores/<ds>/paths/<folder>
- Register a versioned data asset pointing to a folder in a registered datastore — data assets are the recommended way to reference training and evaluation data.
Infrastructure as Code — Bicep and GitHub Actions
- resource workspace 'Microsoft.MachineLearningServices/workspaces@2024-04-01' = { name: workspaceName location: location identity: { type: 'SystemAssigned' } properties: { storageAccount: storageAccount.id keyVault: keyVault.id applicationInsights: appInsights.id containerRegistry: containerRegistry.id } }
- Bicep resource definition for an Azure ML workspace with system-assigned managed identity — declarative IaC for reproducible workspace deployments.
- az deployment group create --resource-group <rg> --template-file main.bicep --parameters @params.json
- Deploy a Bicep template to create Azure ML infrastructure — use parameter files to manage environment-specific (dev/staging/prod) configurations.
- Bicep vs. GitHub Actions role distinction
- Bicep defines WHAT Azure resources to deploy (declarative desired state); GitHub Actions defines WHEN and HOW to execute deployments (CI/CD orchestration) — they work together, not as alternatives.
- # .github/workflows/deploy-workspace.yml jobs: deploy: runs-on: ubuntu-latest steps: - uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - run: az deployment group create --template-file infra/main.bicep
- GitHub Actions workflow snippet authenticating to Azure with OIDC (no stored secrets) before deploying Bicep templates for ML infrastructure.
- GitHub Actions OIDC vs. service principal secret auth
- OIDC federated credentials are preferred for GitHub Actions — they eliminate stored secrets and use short-lived tokens, unlike service principal client secrets which must be rotated manually.
- Private endpoint + VNet isolation for workspace
- Restrict workspace access by deploying private endpoints that route traffic through Azure VNet — public internet access to the workspace is disabled when private endpoints are enabled.
MLflow — Experiment Tracking and Model Registry
- import mlflow mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()) mlflow.set_experiment("my-experiment") with mlflow.start_run(): mlflow.log_param("learning_rate", 0.01) mlflow.log_metric("accuracy", 0.95) mlflow.sklearn.log_model(model, artifact_path="model")
- Configure MLflow tracking URI from the Azure ML workspace and log parameters, metrics, and a scikit-learn model artifact — all within a single tracked run.
- mlflow.register_model( model_uri=f"runs:/{run.info.run_id}/model", name="my-registered-model" )
- Register a trained model from a run into the MLflow model registry — in Azure ML this simultaneously registers the model in the Azure ML model registry.
- Model lifecycle stages: None → Staging → Production → Archived
- MLflow model registry tracks lifecycle stages — Archived marks the model as deprecated but does NOT delete it; the artifact is retained for compliance and rollback.
- mlflow.autolog()
- Enable automatic logging of parameters, metrics, and artifacts for supported frameworks (scikit-learn, TensorFlow, PyTorch) — reduces boilerplate MLflow instrumentation code.
- MLflow tracking URI format: azureml:///<workspace-info>
- Every Azure ML workspace exposes a unique MLflow tracking URI — point mlflow.set_tracking_uri() to this URI to store experiments directly in the workspace.
- mlflow.evaluate(model_uri, data=test_df, targets="label", model_type="classifier")
- Run MLflow model evaluation to compute classification metrics (accuracy, F1, ROC-AUC) against a test dataset — results are logged as run metrics for comparison.
Model Training — AutoML, Pipelines, and Hyperparameter Tuning
- az ml job create --file sweep-job.yml # sweep-job.yml defines: search_space, sampling_algorithm, limits, objective
- Submit a hyperparameter sweep job using CLI v2 — the YAML file defines the parameter search space, sampling method (random/grid/Bayesian), and early termination policy.
- Sweep job sampling methods: random / grid / Bayesian
- Random sampling is fastest for broad exploration; Grid exhaustively tests all combinations; Bayesian uses prior results to guide the search — Bayesian is most efficient when evaluation is expensive.
- Early termination policies: Bandit / Median Stopping / Truncation Selection
- Bandit terminates runs performing below a slack factor of the best run; Median Stopping cancels runs below the median primary metric; Truncation Selection cancels the lowest-performing X% each interval.
- AutoML vs. manual hyperparameter sweep
- AutoML explores algorithms AND hyperparameters automatically for classification/regression/time-series; a sweep job tunes hyperparameters of ONE fixed algorithm — AutoML does not replace data preparation.
- Distributed training: data parallelism vs. model parallelism
- Data parallelism splits the DATASET across GPUs and synchronizes gradients — for large datasets with models that fit on one GPU; model parallelism splits the MODEL across GPUs — for models too large for one GPU.
- az ml component create --file train-component.yml az ml pipeline create --file pipeline.yml
- Register a reusable component and compose it into a pipeline via YAML — components define inputs, outputs, code, and environment; pipelines chain components with defined data flow.
- Environment types: curated (Microsoft-maintained) vs. custom
- Curated environments come pre-built with common ML frameworks (sklearn, PyTorch, TensorFlow); custom environments let you specify Docker base images and conda dependencies — both are versioned.
Model Deployment — Online and Batch Endpoints
- az ml online-endpoint create --name my-endpoint -g <rg> -w <ws> az ml online-deployment create --name blue --endpoint my-endpoint --file deployment.yml --all-traffic
- Create a managed online endpoint and deploy a model to it with 100% traffic using the --all-traffic flag — the endpoint hosts the REST API URL.
- az ml online-endpoint update --name my-endpoint \ --traffic "blue=90 green=10"
- Split traffic between two deployments on the same endpoint for progressive rollout — gradually shift traffic from old (blue) to new (green) deployment while monitoring performance.
- az ml online-endpoint update --name my-endpoint --traffic "blue=100"
- Perform a safe rollback by routing 100% of traffic back to the stable deployment — the new deployment remains in place but receives no traffic until issues are resolved.
- Managed online endpoints vs. Batch endpoints
- Online endpoints are always-on REST APIs for low-latency real-time inference with auto-scaling and blue-green deployment; batch endpoints run parallel inference on large datasets with no always-on compute cost.
- Data collection for monitoring: automatic (online) vs. manual (batch)
- Online endpoints automatically collect input/output data for monitoring when data collection is enabled; batch endpoints require manual configuration to capture prediction data.
- az ml batch-endpoint invoke --name my-batch-endpoint \ --input azureml:my-data-asset:1
- Trigger a batch inference job by invoking the batch endpoint with an input data asset reference — the job runs across compute cluster nodes in parallel.
Production Monitoring — Drift Detection and Retraining Triggers
- Data drift vs. Prediction drift
- Data drift detects changes in the statistical distribution of INPUT features vs. training data; prediction drift detects changes in the OUTPUT distribution — a model can have prediction drift without data drift if the data-label relationship changes.
- Monitoring signals: data drift / prediction drift / data quality / feature attribution drift
- Configure all four monitoring signals for comprehensive production visibility — data quality checks for nulls and schema violations; feature attribution drift detects changes in which features drive predictions.
- Retraining trigger pipeline: Model Monitor alert → Event Hubs / Logic Apps / Azure Functions → training pipeline
- When a monitoring signal exceeds its configured threshold, the alert can trigger Azure Event Hubs, Logic Apps, or Azure Functions to launch an automated retraining pipeline.
- Responsible AI Dashboard components: fairness / interpretability / error analysis / causal inference
- The Responsible AI Dashboard in Azure ML Studio provides a unified view of model fairness, feature explanations, error distributions, and causal impact — use before production deployment.
- Feature retrieval specification
- A specification packaged with the model artifact that describes how to retrieve features from feature stores at inference time — enables consistent feature engineering between training and serving.
- Reference dataset for drift monitoring
- Set the training dataset as the reference dataset in model monitoring — all drift calculations compare production data distribution against this reference, not against previous production windows.
Microsoft Foundry — GenAIOps Infrastructure
- Microsoft Foundry hub-and-project architecture
- A Foundry hub is the top-level governance resource (shared compute, networking, security); projects are isolated workspaces under the hub for individual teams or applications to deploy models and build GenAI apps.
- Serverless API (MaaS) vs. Managed Compute deployment
- Serverless API is pay-as-you-go with no GPU management and regional deployment scope; managed compute provides dedicated GPU infrastructure with more control and full MLOps lifecycle integration.
- Serverless deployment scopes: Global Standard / Data Zone / Regional
- Global Standard routes requests across worldwide Microsoft infrastructure for highest availability; Data Zone restricts to a geographic boundary for data residency; Regional pins to a specific Azure region for compliance.
- Provisioned Throughput Units (PTUs)
- PTUs reserve a fixed amount of model processing capacity upfront — choose PTUs over pay-as-you-go serverless when you need guaranteed throughput and consistent latency for high-volume production workloads.
- az cognitiveservices account deployment create \ --name <foundry-resource> \ --resource-group <rg> \ --deployment-name my-gpt4o \ --model-name gpt-4o \ --model-version 2024-08-06 \ --model-format OpenAI \ --sku-capacity 10 \ --sku-name GlobalStandard
- Deploy a foundation model to Microsoft Foundry using Azure CLI — sku-name specifies the deployment scope (GlobalStandard, DataZoneStandard, or Standard for regional).
- Managed identity + RBAC for Foundry resources
- Use system-assigned or user-assigned managed identities for credential-free authentication to Foundry resources — assign granular RBAC roles (e.g., Azure AI Developer) rather than owner/contributor.
- Prompt versioning with Git repositories
- Store prompts in Git repositories within Microsoft Foundry to track prompt changes, create and compare variants, and enable team collaboration — prompt versioning and model versioning are separate concerns.
GenAI Quality Metrics and Evaluation Workflows
- Groundedness: response factually supported by the source data
- Groundedness measures whether each claim in the response is backed by the provided context documents — a fluent and relevant response can still be ungrounded if it introduces facts not in the source.
- Relevance: response directly addresses the user's query
- Relevance measures whether the generated response answers what the user asked — a grounded and coherent response can still be irrelevant if it discusses the wrong topic.
- Coherence: logical flow and consistency across the response
- Coherence measures whether the response is logically consistent and well-structured from sentence to sentence — distinct from fluency which measures grammar and naturalness.
- Fluency: grammatically correct and natural-sounding language
- Fluency measures the linguistic quality of the response — a fluent response can still be incoherent, irrelevant, or ungrounded; fluency alone does not indicate quality.
- Risk and safety evaluations vs. quality evaluations
- Safety evaluations detect harmful content, bias, and policy violations in model outputs — they are separate from quality metrics (groundedness, relevance) since a high-quality response can still be unsafe.
- Automated evaluation workflow: test dataset → run metrics → compare → gate deployment
- Configure automated evaluation in Foundry to run built-in and custom metrics on a test dataset on every deployment — use metric thresholds as quality gates before promoting to production.
- Evaluation on test dataset vs. production traffic
- Test dataset evaluation runs before deployment and catches regressions; production traffic evaluation monitors drift in quality metrics over time — both are necessary for complete quality assurance.
GenAI Observability — Latency, Cost, and Tracing
- Distributed tracing for multi-step GenAI applications
- Distributed tracing captures timing and execution details at each pipeline step (embedding, retrieval, LLM inference) — use it to identify which step is the latency bottleneck in RAG or agentic pipelines.
- Token consumption: input tokens + output tokens
- Monitor both input token count (prompt length) and output token count (response length) separately — cost optimization may target either side, and context window limits apply to the combined total.
- Performance metrics: TTFT (time to first token) vs. total response time
- Time to first token measures perceived latency in streaming responses; total response time measures complete generation — for streaming UIs, TTFT is the primary user-perceived latency metric.
- Throughput: requests per second (RPS) / tokens per minute (TPM)
- Monitor both RPS and TPM to understand system capacity — PTU limits are defined in TPM, not RPS, so high-context requests consume PTU capacity faster than short requests at the same RPS.
- Foundry observability dashboard: latency / throughput / token usage / quality signals / safety signals
- Configure all five observability dimensions in the Foundry monitoring dashboard — monitoring only Azure resource metrics misses GenAI-specific quality and safety signals.
- Logging and tracing for debugging: full request/response capture
- Enable detailed logging to capture the full prompt, retrieved documents, and generated response for each request — essential for debugging quality issues and auditing production behavior.
RAG Optimization — Chunking, Search, and Retrieval Tuning
- Chunk size strategies: smaller (e.g., 512 tokens) = precision; larger = context
- Smaller chunks improve retrieval precision by reducing noise per chunk but may miss cross-chunk context; larger chunks provide more context per result but dilute relevance scores — optimal size depends on query patterns.
- Chunk overlap
- Overlapping adjacent chunks by 10–20% prevents information at chunk boundaries from being lost — without overlap, sentences that span a boundary are split and may not be retrieved.
- Hybrid search = semantic (vector) + keyword (BM25) via Reciprocal Rank Fusion
- Hybrid search with RRF merging almost always outperforms pure vector or pure keyword search alone — keyword search captures exact terminology that embeddings may not preserve.
- Similarity threshold tuning: precision vs. recall tradeoff
- A high similarity threshold filters out loosely related chunks (high precision, lower recall); a low threshold returns more chunks including marginally relevant ones (high recall, lower precision) — tune based on hallucination vs. missed-answer tradeoff.
- Embedding model selection for RAG
- Choose embedding models optimized for your domain and language — a general-purpose embedding model may not capture domain-specific vocabulary; fine-tuning embeddings on domain data improves retrieval quality.
- RAG vs. Fine-tuning decision
- RAG adds KNOWLEDGE at inference time without changing model weights — use for dynamic, private, or frequently updated data; fine-tuning changes model BEHAVIOR permanently — use for specialized response style or task format.
- A/B testing for RAG parameter optimization
- Hold the LLM constant and vary ONE retrieval parameter at a time (chunk size, top-k, threshold) to isolate the impact of each change on end-to-end response quality metrics.
Fine-Tuning Foundation Models
- Fine-tuning methods: supervised / parameter-efficient (LoRA, QLoRA) / instruction tuning
- Supervised fine-tuning trains on labeled input-output pairs; LoRA and QLoRA are parameter-efficient methods that train a small set of adapter weights rather than all model parameters; instruction tuning aligns models to follow natural language instructions.
- Synthetic data generation for fine-tuning
- When real labeled examples are scarce, use an LLM to generate diverse synthetic training examples based on a small seed set — synthetic data must be diverse and representative or it degrades model performance.
- Fine-tuning deployment: serverless or managed compute in Microsoft Foundry
- Fine-tuned models can be deployed to either serverless API endpoints (pay-as-you-go, less control) or managed compute deployments (dedicated GPU, full MLOps integration) within Microsoft Foundry.
- Monitoring fine-tuned vs. base model performance
- After deployment, compare fine-tuned model quality metrics against the base model on the same test dataset — fine-tuning can improve task performance but may degrade general capability (catastrophic forgetting).
- Fine-tuning does NOT add knowledge — use RAG for that
- Fine-tuning changes HOW the model responds (style, format, task specialization) but does not update the model's knowledge; if the goal is grounding responses in new facts, use RAG instead.
Security and Identity — RBAC, Managed Identities, and Networking
- az role assignment create \ --role "Azure ML Data Scientist" \ --assignee <principal-id> \ --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.MachineLearningServices/workspaces/<ws>
- Assign an RBAC role scoped to a specific Azure ML workspace — use built-in ML roles (AzureML Data Scientist, AzureML Compute Operator) rather than Owner/Contributor for least-privilege access.
- System-assigned managed identity vs. user-assigned managed identity
- System-assigned identity is tied to the resource lifecycle (deleted with the resource); user-assigned identity is standalone and can be shared across multiple resources — use user-assigned for shared access scenarios.
- Private endpoint for ML workspace: disables public access, routes through VNet
- Deploying a private endpoint to an Azure ML workspace disables public internet access to the workspace and Studio — all access goes through the private IP within the configured VNet.
- Managed network isolation modes: Disabled / Allow Internet Outbound / Allow Only Approved Outbound
- Workspace managed network isolation controls outbound traffic: Disabled allows all outbound; Allow Internet Outbound allows all outbound plus private endpoints; Allow Only Approved Outbound restricts outbound to configured rules only.
- Azure ML Registries: cross-workspace asset sharing with governance
- Registries share models, environments, components, and data assets ACROSS workspaces — assets promoted to a registry can be consumed by any workspace in the organization, enabling centralized governance.