CertPrepNow
DatabricksML ProfessionalUpdated 2026-06-05

ML Professional Study Guide

Everything you need to pass the Databricks Certified Machine Learning Professional exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The Databricks Certified Machine Learning Professional exam is passable with free resources, but requires significant hands-on production ML experience (6-12 months minimum):

  • Databricks Academy free learning paths (Machine Learning Professional track)
  • Official Databricks documentation — MLflow, Feature Store, Lakehouse Monitoring, Model Serving
  • Databricks Community Edition for practicing SparkML pipelines, MLflow experiment tracking, and distributed tuning
  • 500+ free practice questions on this site covering all 3 professional-level domains
  • The Big Book of MLOps (free eBook from Databricks) — covers end-to-end MLOps architecture
  • databricks/mlops-stacks GitHub repository for production ML pipeline templates

This is a professional-level exam. Unlike the Associate, most questions present complex production ML scenarios — you need real-world experience with distributed training, drift monitoring, Feature Store pipelines, and model deployment strategies. Book knowledge alone is insufficient.

Choose Your Study Path

You passed the ML Associate exam and have 6-12 months of Databricks ML experience. You need to level up on advanced topics like distributed tuning, Lakehouse Monitoring, Feature Store pipelines, deployment strategies, and MLOps testing patterns.

Week 1Review the official Professional exam guide. Compare domains to the Associate exam — note what's new: distributed training with Ray/Optuna, Lakehouse Monitoring, Databricks Asset Bundles, blue-green/canary deployments, and ML pipeline testing
Week 2SparkML deep dive: Pipeline construction with stages/estimators/transformers, distributed feature engineering, CrossValidator with ParamGrid, choosing distributed SparkML vs single-node models, batch and streaming inference patterns
Week 3Distributed training and tuning: Optuna-MLflow integration, Ray framework for distributed tuning, pandas Function APIs for group-specific model training, vertical vs horizontal scaling, data parallelism vs model parallelism
Week 4Advanced MLflow and Feature Store: nested runs, PyFunc custom models, point-in-time correctness, online tables for low-latency serving, streaming feature ingestion, on-demand features for training-serving consistency
Week 5MLOps: Unity Catalog model management, model lifecycle (dev to staging to prod), CI/CD testing (unit tests with pytest, integration tests, end-to-end pipeline tests), Databricks Asset Bundles for multi-environment deployment
Week 6Lakehouse Monitoring deep dive: drift detection (KS, Chi-squared, Jensen-Shannon), inference tables, data table types (snapshot/time series/inference), custom metrics, alert configuration, baseline vs time-window comparison
Week 7Model Deployment: blue-green vs canary deployment strategies, Model Serving endpoints, PyFunc model registration in Unity Catalog, REST API querying, MLflow Deployments SDK, traffic splitting for rollout
Week 8Full practice exams across all 3 domains. Review explanations carefully. Target 80%+ before scheduling the real exam. Focus on scenario-based questions that test judgment, not syntax recall

Exam Overview

Format

59 questions, 120 minutes. Multiple choice (single select and multiple select). Scenario-heavy — most questions present production ML scenarios requiring you to choose the best approach. Covers advanced topics not on the Associate exam including distributed hyperparameter tuning, Lakehouse Monitoring, Databricks Asset Bundles, and deployment strategies.

Scoring

Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure. Questions are weighted equally across all domains. May include unscored items for statistical analysis — these are not identified and do not impact your score.

Domains & Weights

  • Model Development44%
  • MLOps44%
  • Model Deployment12%

Registration

$200 USD. Available through Kryterion testing centers or online proctored via WebAssessor. Schedule at databricks.com/certification. Costs $200 USD per attempt. No formal prerequisites, but Databricks recommends the ML Associate certification and 1+ years of hands-on production ML experience on Databricks. Credential is valid for 2 years.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowDeep understanding required — these appear across multiple domains and form the foundation of professional-level questions. Know internals, edge cases, and production patterns.
Tier 2: Should KnowUnderstand use cases, configuration, and key behaviors. May appear in 3-8 questions each.
Tier 3: Recognize OnlyKnow at a high level — what it does and when to use it. Rarely more than 1-2 questions each.
Domain 144% of exam

Model Development

The largest domain at 44% (~26 questions). Tests advanced ML pipeline construction with SparkML, distributed training and hyperparameter tuning (Optuna, Ray, pandas Function APIs), advanced MLflow usage (nested runs, PyFunc custom models), and Feature Store concepts (point-in-time correctness, online tables, streaming features, on-demand features). This domain has 22 objectives across 4 sub-sections: SparkML (7), Scaling and Tuning (7), Advanced MLflow (3), and Advanced Feature Store (5).

Key Topics

SparkML PipelinesOptunaRayMLflow TrackingMLflow PyFuncFeature StoreOnline TablesPandas Function APIsCrossValidatorVectorAssembler

Must-Know Concepts

  • SparkML Pipeline construction: stages chain estimators and transformers. Pipeline.fit() trains all stages sequentially, PipelineModel.transform() applies them in order
  • Feature transformer selection: StringIndexer (strings to indices), OneHotEncoder (indices to binary vectors), VectorAssembler (columns to single Vector), StandardScaler (normalization)
  • CrossValidator with ParamGrid for hyperparameter tuning: CrossValidator trains k x n models (folds x combinations). Use with RegressionEvaluator or classification evaluators
  • When to use SparkML vs single-node: SparkML for distributed data (millions+ rows). For single-node models at scale, distribute tuning with SparkTrials or Ray, not SparkML
  • Optuna-MLflow integration: Optuna's define-by-run API with MLflow callback for logging. Supports pruning (early stopping bad trials) and multi-objective optimization
  • Ray for distributed tuning: distributes independent Python functions across a cluster. Better than Spark for embarrassingly parallel compute-bound workloads
  • Pandas Function APIs: applyInPandas() for group-specific model training (e.g., one model per store), mapInPandas() for partition-level distributed inference
  • Vertical vs horizontal scaling: vertical = bigger machines (more RAM/CPU per node), horizontal = more machines. Vertical for memory-bound, horizontal for compute-bound workloads
  • Data parallelism vs model parallelism: data parallelism splits data across workers (each has full model), model parallelism splits the model across workers (each has part of model)
  • Nested MLflow runs: use mlflow.start_run(nested=True) to create parent-child hierarchies for organizing hyperparameter search results under a single parent experiment
  • PyFunc custom models: wrap custom inference logic (pre/post-processing, feature engineering) so it runs at prediction time. Ensures training-serving consistency for complex pipelines
  • Custom metric/parameter/artifact logging: log_metric() for numeric values, log_param() for configuration, log_artifact() for files (plots, data samples, configs)
  • Point-in-time correctness: Feature Store retrieves features as they existed at the prediction timestamp, preventing future data leakage during historical training
  • Online tables: low-latency feature serving synced from offline Feature Store tables. Required for real-time serving endpoints that need feature lookups during inference
  • On-demand features: computed at request time for features that depend on the prediction request itself (e.g., time since last login). Ensures training-serving consistency

Common Exam Traps

SparkML requires all features in a single Vector column (VectorAssembler). Forgetting this step is the most common pipeline error — models will fail without it
OneHotEncoder requires numeric indices, not strings — apply StringIndexer FIRST. Direct string input to OneHotEncoder raises an error
Hyperopt and Optuna both MINIMIZE by default. Return -accuracy (negative) if maximizing a metric, or the optimizer finds the worst model
Too much parallelism degrades Bayesian optimization — TPE/Optuna need completed trials to guide proposals. With max parallelism, it degenerates to random search
applyInPandas() is for grouped operations (after groupBy), mapInPandas() is for partition-level processing. Confusing the two yields incorrect results for group-specific training
PyFunc models run custom code at prediction time — ensure all dependencies are logged with the model via conda_env or pip_requirements, or serving will fail with import errors
Point-in-time lookups only work if the Feature Store table has a timestamp key. Without it, the latest feature value is always returned regardless of prediction time
Online tables have a sync delay from offline tables — real-time predictions may use slightly stale features. Design tolerance for this latency in serving architecture
Quick Check: Model Development

Question 1 of 3

A data scientist needs to train a separate demand forecasting model for each of 500 retail stores using the same scikit-learn algorithm. The training data is stored in a single Spark DataFrame with a store_id column. What is the most efficient approach?

Domain 244% of exam

ML Operations (MLOps)

Tied for the largest domain at 44% (~26 questions). Tests model lifecycle management (dev to staging to prod), validation testing strategies (unit, integration, end-to-end), environment architectures with Databricks Asset Bundles, automated retraining workflows, and — most heavily — Lakehouse Monitoring for drift detection. This domain has 20 objectives across 5 sub-sections: Model Lifecycle (2), Validation Testing (4), Environment Architectures (2), Automated Retraining (2), and Drift Detection/Lakehouse Monitoring (10).

Key Topics

Lakehouse MonitoringUnity Catalog ModelsDatabricks Asset BundlespytestInference TablesDatabricks WorkflowsChampion-Challenger Pattern

Must-Know Concepts

  • Model lifecycle architecture: deploy CODE (not models) across environments. Train in dev, validate in staging, serve in prod. The same pipeline code runs in each environment with different configurations
  • Unity Catalog model aliases replace legacy stage transitions. Assign aliases like 'champion' and 'challenger' to model versions for lifecycle management
  • Unit testing ML code: test individual transformation and feature engineering functions in isolation using pytest. Test data quality assertions, schema validation, and edge cases
  • Integration testing: test component interactions across environments — verify feature pipelines produce expected output types, model training completes, and predictions are within expected ranges
  • End-to-end pipeline testing: validate the full pipeline from feature computation through training, evaluation, and deployment. Use test datasets and temporary catalogs
  • Test organization: separate unit tests (fast, isolated) from integration tests (slower, require infrastructure). Run unit tests on every commit, integration tests on merge to main
  • Databricks Asset Bundles (DABs): define ML resources (jobs, pipelines, serving endpoints) as YAML. Deploy to dev/staging/prod with environment-specific overrides using targets
  • Infrastructure-as-code with DABs: version control ML pipeline configurations alongside code. Enables reproducible deployments and rollbacks across environments
  • Automated retraining triggers: monitor for data drift, prediction drift, or performance degradation. When thresholds are breached, trigger retraining workflows automatically
  • Champion-challenger pattern: train a new model (challenger), compare it against the current model (champion) on held-out data or A/B test in production, promote only if the challenger wins
  • Lakehouse Monitoring statistical tests: Kolmogorov-Smirnov (KS) for numerical drift, Chi-squared for categorical drift, Jensen-Shannon divergence for distribution comparison
  • Three monitoring table types: snapshot (point-in-time data quality), time series (temporal trends), inference (model inputs/outputs/performance)
  • Monitor creation and configuration: create monitors on Delta tables in Unity Catalog, configure refresh schedules, set baseline tables for comparison
  • Custom metrics in Lakehouse Monitoring: define business-specific metrics beyond built-in statistical tests. Use SQL expressions for custom metric computation
  • Feature slicing: analyze drift and performance for specific data segments (e.g., by region, customer type). Identifies localized issues that aggregate metrics miss

Common Exam Traps

Deploy CODE, not models — the production environment runs the same training pipeline as dev, but with production data and configuration. Exporting a trained model file from dev to prod is an anti-pattern
Unity Catalog aliases are flexible — you can create any alias name (not limited to 'champion'/'challenger'). But the exam typically uses champion/challenger as the standard pattern
Unit tests should NOT test model accuracy — they test deterministic functions (data transformations, feature logic). Model accuracy varies with data and is validated in integration/E2E tests
Lakehouse Monitoring drift metrics are computed on REFRESH, not continuously. If your refresh interval is daily, drift that occurs and resolves within a day may be missed
KS test is for numerical features ONLY. Chi-squared is for categorical features ONLY. Using the wrong test produces meaningless results — the exam tests which test to apply for each feature type
Inference tables log raw request/response data — they are NOT drift metrics themselves. Lakehouse Monitoring analyzes inference tables to compute drift metrics and performance trends
Champion-challenger comparison must use the SAME evaluation dataset and metrics. Comparing models on different data subsets or with different metrics invalidates the comparison
DABs targets inherit from the default configuration — never duplicate the full config per target. Only specify environment-specific overrides (cluster sizes, catalog names, permissions)
Quick Check: ML Operations (MLOps)

Question 1 of 3

An ML team notices their fraud detection model's precision has dropped 15% over the past month. Before retraining, they want to identify whether the drop is caused by data drift or concept drift. Which approach is most appropriate?

Domain 312% of exam

Model Deployment

The smallest domain at 12% (~7 questions). Tests deployment strategies (blue-green, canary), custom model serving with PyFunc, REST API integration, and model rollout management. This domain has 5 objectives across 2 sub-sections: Deployment Strategies (2) and Custom Model Serving (3). Despite its low weight, these questions are often the most scenario-heavy and nuanced.

Key Topics

Model Serving EndpointsPyFuncREST APIMLflow Deployments SDKTraffic SplittingBlue-Green DeploymentCanary Deployment

Must-Know Concepts

  • Blue-green deployment: two identical environments. Traffic switches entirely from old (blue) to new (green) version. Instant rollback by switching back. Higher cost but zero-downtime
  • Canary deployment: gradual traffic routing to new version (e.g., 5% to 25% to 50% to 100%). Monitor metrics at each step. Lower risk but slower full rollout
  • Evaluate deployment strategy suitability: blue-green for high-traffic applications needing instant rollback, canary for gradual validation with real production traffic
  • PyFunc model registration in Unity Catalog: log custom models with mlflow.pyfunc.log_model(), register in Unity Catalog for governance and lineage tracking
  • REST API querying: send prediction requests to model serving endpoints via HTTP POST with JSON payloads. Handle authentication with Databricks personal access tokens
  • MLflow Deployments SDK: programmatic interface for creating, updating, and querying model serving endpoints. Alternative to REST API for Python-based workflows
  • Custom artifact management: log additional files (preprocessing pipelines, lookup tables, configuration) with the model so they are available at serving time
  • Model deployment methods: UI (click-based), REST API (programmatic), MLflow Deployments SDK (Python). Know when to use each based on automation needs
  • Traffic splitting for gradual rollout: configure percentage-based traffic routing between model versions on the same serving endpoint for A/B testing or canary deployment
  • Endpoint scaling and latency: configure auto-scaling, warm-up strategies, and appropriate instance types to meet latency SLAs for real-time serving

Common Exam Traps

Blue-green deployment requires running TWO full environments simultaneously — double the infrastructure cost during deployment. If cost is a primary concern, canary uses fewer additional resources
Canary deployment does NOT provide instant rollback — rolling back requires gradually shifting traffic back, which takes time. If instant rollback is required, choose blue-green
PyFunc models must include ALL dependencies (conda_env or pip_requirements) when logged. Missing dependencies cause serving endpoint startup failures that are difficult to debug in production
REST API authentication requires a Databricks personal access token or service principal token. Anonymous access to serving endpoints is not supported by default
Traffic splitting percentages must sum to 100% across all model versions on an endpoint. The exam may present configurations that do not sum correctly as distractor answers
Quick Check: Model Deployment

Question 1 of 3

A financial services company is deploying a new credit scoring model to production. The model serves 50,000 requests per hour and any incorrect predictions could result in regulatory fines. They need the ability to instantly roll back if the new model underperforms. Which deployment strategy is most appropriate?

Key ML Professional Concepts Compared

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

Optuna vs Hyperopt

Use Optuna when…

Modern hyperparameter optimization with native MLflow integration, define-by-run API, pruning support for early stopping unpromising trials, and multi-objective optimization. Preferred for new Databricks workloads.

Use Hyperopt when…

Legacy Bayesian hyperparameter optimization using TPE (Tree of Parzen Estimators) with SparkTrials for Spark-native distributed tuning. Still widely used but being superseded by Optuna on Databricks.

Exam trap

Optuna is the recommended approach on the current exam (September 2025 version). Hyperopt is still tested but Optuna-MLflow integration is the focus for distributed tuning questions. Both minimize by default — return negative values for metrics you want to maximize.

Ray vs Spark (SparkTrials / pandas Function APIs)

Use Ray when…

Distributed Python-native computing for embarrassingly parallel workloads: training many independent models, running compute-heavy tuning trials. Works independently of Spark DataFrames. Best for model parallelism.

Use Spark (SparkTrials / pandas Function APIs) when…

Data-parallel distributed computing. SparkTrials distributes Hyperopt trials across Spark workers. Pandas Function APIs (applyInPandas/mapInPandas) distribute pandas-based operations across Spark partitions. Best for data parallelism.

Exam trap

Ray and Spark serve different parallelism needs. Ray is for distributing independent Python functions (model parallelism). Spark is for distributing data processing (data parallelism). The exam tests when to choose each based on whether the bottleneck is compute or data volume.

Blue-Green Deployment vs Canary Deployment

Use Blue-Green Deployment when…

Two full environments run simultaneously. Traffic switches entirely from old (blue) to new (green). Instant rollback by switching back. Higher infrastructure cost but zero-downtime, all-or-nothing deployment.

Use Canary Deployment when…

Gradual traffic routing — start with 5-10% traffic to the new model, increase as confidence grows. Lower risk per exposure but slower full rollout. Better for validating with real production traffic before committing.

Exam trap

Blue-green is best when you need instant, full rollback capability (e.g., critical financial models). Canary is best when you want to validate with real traffic incrementally. The exam tests which strategy fits specific production risk profiles.

Unity Catalog Model Aliases vs Legacy Model Registry Stages

Use Unity Catalog Model Aliases when…

Current approach — assign named aliases (e.g., 'champion', 'challenger') to model versions in Unity Catalog. Flexible, supports custom alias names, and integrates with Unity Catalog governance.

Use Legacy Model Registry Stages when…

Legacy approach — model versions transition through fixed stages (None, Staging, Production, Archived). Being deprecated in favor of Unity Catalog aliases.

Exam trap

The September 2025 exam tests Unity Catalog aliases, NOT legacy stage transitions. If a question mentions 'champion/challenger' patterns, think aliases. If it mentions Staging/Production stages, it is referencing the legacy approach — the correct modern answer uses aliases.

Point-in-Time Feature Lookup vs Standard Feature Lookup

Use Point-in-Time Feature Lookup when…

Retrieves feature values as they existed at a specific timestamp, preventing future data from leaking into historical training examples. Essential for time-sensitive ML tasks (fraud detection, demand forecasting).

Use Standard Feature Lookup when…

Retrieves the latest feature values without timestamp awareness. Simpler but risks data leakage — features computed after the prediction timestamp may contaminate training data.

Exam trap

Point-in-time correctness is the #1 Feature Store concept on the exam. Without it, a fraud model trained on historical transactions might use account features computed AFTER the fraud occurred — inflating accuracy during training but failing in production.

Snapshot Monitoring vs Time Series / Inference Monitoring

Use Snapshot Monitoring when…

Monitors point-in-time snapshots of a Delta table. Compares current data distribution against a baseline. Best for detecting data quality issues in static tables.

Use Time Series / Inference Monitoring when…

Monitors data over time windows. Time series monitoring tracks feature distributions across windows. Inference monitoring tracks model inputs, outputs, and performance metrics. Best for detecting drift in production models.

Exam trap

Lakehouse Monitoring supports three table types and each produces different metrics. Inference tables are specifically for model monitoring — they track prediction distributions, input drift, and performance trends. The exam tests which table type and monitoring approach to use for different scenarios.

Data Parallelism vs Model Parallelism

Use Data Parallelism when…

Distribute training data across workers, each training a copy of the full model on a data subset. Gradients are aggregated across workers. Best for large datasets with models that fit in a single worker's memory.

Use Model Parallelism when…

Distribute model layers or components across workers. Each worker holds part of the model. Best for models too large to fit in a single worker's memory (e.g., large deep learning models).

Exam trap

Most Databricks ML workloads use data parallelism (SparkML, SparkTrials). Model parallelism is needed only when the model itself exceeds single-node memory — rare for classical ML but common for large deep learning models. The exam tests when each is appropriate.

Concept Drift vs Data Drift

Use Concept Drift when…

The relationship between input features and the target variable has changed (P(Y|X) shifts). Model performance degrades even though input distributions may look stable. Requires retraining on recent labeled data.

Use Data Drift when…

The distribution of input features has changed (P(X) shifts). Detected by comparing feature distributions against a baseline using KS test (numerical) or Chi-squared (categorical). May or may not affect model performance.

Exam trap

Concept drift can occur even when feature distributions are stable — always monitor prediction quality alongside feature distributions. Data drift may not cause performance issues if the model generalizes well to the shifted distribution.

Top Mistakes to Avoid

Confusing data drift (input feature distributions change) with concept drift (relationship between features and target changes) — each requires different detection and remediation strategies
Using KS test for categorical features or Chi-squared for numerical features — KS is for numerical distributions only, Chi-squared is for categorical distributions only
Deploying trained models instead of training code across environments — the professional pattern is to deploy the same pipeline code to each environment and train with environment-specific data
Forgetting point-in-time correctness in Feature Store lookups — without it, future data leaks into historical training examples, inflating offline metrics that fail to translate to production performance
Over-parallelizing Bayesian hyperparameter optimization — TPE/Optuna need completed trial results to guide proposals. Excessive parallelism degenerates to random search with no intelligent guidance
Confusing Unity Catalog model aliases with legacy Model Registry stages — the current exam tests the alias-based approach (champion/challenger), not the legacy Staging/Production/Archived stages
Testing model accuracy in unit tests — unit tests should validate deterministic functions (data transformations, feature logic). Model accuracy belongs in integration or end-to-end tests with real data
Using model serving endpoints for batch inference — serving endpoints are designed for real-time, low-latency predictions. Batch scoring should use spark_udf or dedicated batch inference jobs
Not logging dependencies with PyFunc models — missing conda_env or pip_requirements causes serving endpoint startup failures that are difficult to diagnose
Assuming Lakehouse Monitoring runs continuously — drift metrics are computed on scheduled REFRESH, not in real time. Configure refresh frequency based on your monitoring SLA requirements

Exam-Ready Checklist

Can explain all 3 exam domains, their weights (44/44/12), and the sub-sections within each domain
Know SparkML pipeline construction: stages, estimators, transformers, VectorAssembler, StringIndexer, OneHotEncoder, CrossValidator
Understand distributed tuning: Optuna-MLflow integration, Ray vs Spark trade-offs, pandas Function APIs (applyInPandas vs mapInPandas)
Can explain vertical vs horizontal scaling and data parallelism vs model parallelism — when to use each
Know advanced MLflow: nested runs, PyFunc custom models with pre/post-processing, custom metric/parameter/artifact logging
Understand Feature Store deeply: point-in-time correctness, online tables, streaming features, on-demand features, training-serving consistency
Know Lakehouse Monitoring: statistical tests (KS, Chi-squared, Jensen-Shannon), table types (snapshot, time series, inference), custom metrics, feature slicing, alert configuration
Can distinguish data drift vs concept drift vs prediction drift and know which monitoring approach detects each
Understand model lifecycle: deploy code (not models), Unity Catalog aliases (champion/challenger), environment promotion patterns
Know ML testing strategies: unit testing (pytest, isolated functions), integration testing (component interactions), end-to-end testing (full pipeline)
Can configure Databricks Asset Bundles: YAML configuration, targets with environment-specific overrides, infrastructure-as-code for ML resources
Understand automated retraining: drift-triggered retraining, champion-challenger comparison, model selection strategies
Know deployment strategies: blue-green (instant rollback, higher cost) vs canary (gradual rollout, lower risk) — when to use each
Can deploy custom PyFunc models: registration in Unity Catalog, REST API querying, MLflow Deployments SDK, traffic splitting
Scored 80%+ on at least two full practice exams covering all 3 domains
Reviewed all incorrect answers and understand why the right answer is right

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions