CertPrepNowFREE
DatabricksML AssociateUpdated 2026-06-05

ML Associate Study Guide

Everything you need to pass the Databricks Certified Machine Learning Associate exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The Databricks Certified Machine Learning Associate exam is passable with free resources if you study consistently for 2–3 weeks:

  • Databricks Academy free learning paths (Machine Learning Associate track)
  • Official Databricks documentation for MLflow, Spark ML, and Feature Store
  • Databricks Community Edition (free cluster for hands-on ML practice)
  • 500+ free practice questions on this site

Hands-on practice is essential — the exam tests MLflow API calls, Spark ML pipeline syntax, and feature engineering code, not just theory. Spin up a free Community Edition cluster and build real ML pipelines.

Choose Your Study Path

No prior Spark ML or MLflow experience. You may have basic Python and ML theory knowledge. You'll build foundational knowledge from scratch over 3 weeks.

Day 1–2ML fundamentals review: supervised vs unsupervised learning, classification vs regression, bias-variance tradeoff, overfitting vs underfitting
Day 3–4Databricks platform basics: workspaces, clusters, ML Runtime, notebooks. Sign up for Databricks Community Edition. Understand the difference between standard and ML Runtime
Day 5–7Spark ML basics: Estimators, Transformers, Pipelines, VectorAssembler. Build a simple LinearRegression pipeline from scratch in a notebook
Day 8–9Feature engineering with Spark ML: StringIndexer, OneHotEncoder, StandardScaler, Imputer, Bucketizer. Understand the fit/transform pattern
Day 10–11MLflow fundamentals: experiment tracking (mlflow.start_run, log_param, log_metric, log_model), autologging, comparing runs in the UI
Day 12–13Model evaluation: RegressionEvaluator (RMSE, MAE, R²), BinaryClassificationEvaluator (AUC), MulticlassClassificationEvaluator (F1, accuracy). Cross-validation with CrossValidator
Day 14–15Hyperparameter tuning: ParamGridBuilder, CrossValidator, TrainValidationSplit. Hyperopt with SparkTrials for distributed tuning. Know when to use each
Day 16–17MLflow Model Registry: registering models, model versioning, stage transitions (Staging/Production/Archived). Batch inference with mlflow.pyfunc.spark_udf
Day 18–19Feature Store, AutoML, pandas API on Spark, Pandas UDFs. Understand distributed inference patterns with mapInPandas
Day 20Practice questions across all 5 domains, review explanations carefully
Day 21Take a full mock exam. Review all wrong answers. Retake if below 75%

Exam Overview

Format

45 questions, 90 minutes. Multiple choice (single select and multiple select). Python-first — questions prefer PySpark and MLflow Python syntax, with some conceptual ML theory questions.

Scoring

Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure.

Domains & Weights

  • Machine Learning Fundamentals18%
  • ML Development and Feature Engineering27%
  • Model Training and Evaluation22%
  • Model Deployment and Management18%
  • ML Operations (MLOps)15%

Registration

$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification. Costs $200 USD per attempt.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowYou must understand these deeply — what they do, how to use them, and their exact API syntax. These appear across multiple domains.
Tier 2: Should KnowUnderstand what they do and key use cases. May appear in 2–5 questions.
Tier 3: Recognize OnlyKnow what they do at a high level. Rarely more than 1–2 questions each.
Domain 118% of exam

Machine Learning Fundamentals

This domain covers ~18% and tests foundational ML concepts: supervised vs unsupervised learning, classification vs regression, ensemble methods (bagging vs boosting), bias-variance tradeoff, overfitting, and Databricks platform components for ML (ML Runtime, clusters, notebooks).

Key Topics

Databricks ML RuntimeSpark ML Estimators/TransformersClustersNotebooks

Must-Know Concepts

  • Supervised learning: labeled data, includes classification (predicting categories) and regression (predicting continuous values)
  • Unsupervised learning: unlabeled data, includes clustering (grouping similar items) and dimensionality reduction
  • Spark ML Estimator: implements fit() and produces a Transformer. Transformer: implements transform() to apply learned parameters
  • Bagging (Random Forest): trains models in parallel on random data subsets, reduces variance. Boosting (XGBoost, GBT): trains models sequentially, each correcting previous errors, reduces bias
  • Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex)
  • Databricks ML Runtime: pre-installed ML libraries (MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch) — eliminates manual installation
  • Standard multi-node cluster for distributed ML workloads; SQL Warehouses cannot run Spark ML

Common Exam Traps

Random Forest uses bagging — NOT boosting. XGBoost and Gradient Boosted Trees use boosting — NOT bagging
Gradient boosting is inherently sequential (each tree depends on previous residuals) — it cannot be fully parallelized across iterations, only within individual tree construction
An Estimator is NOT a pre-trained model — it's the algorithm that produces a Transformer (the trained model) when fit() is called
Single Node clusters cannot run distributed Spark ML — use Standard multi-node clusters for distributed training
Quick Check: Machine Learning Fundamentals

Question 1 of 3

What is the correct definition of a Spark ML Estimator?

Domain 227% of exam

ML Development and Feature Engineering

The largest domain at ~27%. Tests feature engineering techniques (encoding, scaling, imputation), Spark ML transformers, data preparation patterns, pandas API on Spark, Pandas UDFs, Feature Store, AutoML, and data leakage prevention. Expect code-heavy questions on VectorAssembler, StringIndexer, OneHotEncoder, and pipeline construction.

Key Topics

VectorAssemblerStringIndexerOneHotEncoderStandardScalerImputerFeature StoreAutoMLPandas UDFspandas API on Spark

Must-Know Concepts

  • VectorAssembler: combines multiple feature columns into a single Vector column — REQUIRED for all Spark ML models
  • StringIndexer → OneHotEncoder chain: strings must be indexed first (StringIndexer), then encoded (OneHotEncoder). Direct string input to OneHotEncoder fails
  • Imputer: fit() learns statistics (mean/median), transform() applies them. Calling transform() without fit() fails
  • Data leakage prevention: fit scaler/encoder on training data only, then transform both train and test. Use Pipeline inside CrossValidator to ensure this
  • Feature Store: centralized feature repository. Features are looked up by primary key during inference — only pass keys, not all features
  • Pandas API on Spark (pyspark.pandas): pandas-compatible API on distributed Spark. Minimal code changes from pandas — just change the import
  • Pandas UDFs (Vectorized UDFs): process data in Arrow batches for 10-100x faster performance than row-at-a-time Python UDFs
  • AutoML: handles feature scaling, encoding, tuning automatically. Generates editable notebooks. EDA must be done separately
  • One-hot encoding should be deferred to model pipelines (not feature store) because different models need different encoding strategies

Common Exam Traps

OneHotEncoder expects numeric indices, not strings — apply StringIndexer first. This is the most common Spark ML pipeline error on the exam
Imputer requires fit() before transform(). Spark ML's two-step pattern differs from pandas' fillna() — fit computes statistics, transform applies them
Standardizing the entire dataset BEFORE splitting creates data leakage. Use Pipeline inside CrossValidator, or fit the scaler on training data only
pandas API on Spark DataFrames are distributed (backed by Spark) — they are NOT local pandas DataFrames. Do not call collect()
AutoML generates editable notebooks — it's not a black box. But AutoML does NOT perform EDA — that's the data scientist's responsibility
Quick Check: ML Development and Feature Engineering

Question 1 of 3

A data scientist needs to prepare features for a LinearRegression model in Spark ML. Their DataFrame has multiple numeric columns. What is the required preparation step?

Domain 322% of exam

Model Training and Evaluation

Covers ~22% of the exam. Tests model training, evaluation metrics, hyperparameter tuning, cross-validation, ensemble methods, and distributed training with Spark ML. Expect questions on choosing the right metric, computing cross-validation scores, and configuring Hyperopt for distributed tuning.

Key Topics

CrossValidatorTrainValidationSplitRegressionEvaluatorBinaryClassificationEvaluatorMulticlassClassificationEvaluatorHyperoptParamGridBuilder

Must-Know Concepts

  • Regression metrics: RMSE (penalizes large errors), MAE (robust to outliers), R² (proportion of variance explained). Use RegressionEvaluator
  • Classification metrics: Accuracy (overall correct), Precision (of predicted positives, how many are correct), Recall (of actual positives, how many are found), F1 (harmonic mean of precision and recall)
  • For imbalanced datasets: accuracy is misleading — use F1 score or recall. A model predicting majority class always gets high accuracy but misses minority class
  • Cross-validation score = mean of fold scores. Example: 5 fold scores [2.5, 3.1, 2.8, 3.4, 2.7] → CV score = 2.9
  • Grid search total models = (hyperparameter combinations) × (CV folds). Example: 9 combinations × 3 folds = 27 models
  • Hyperopt uses TPE (Bayesian): learns from past trials. SparkTrials distributes tuning across Spark workers
  • When target is log-transformed: exponentiate predictions BEFORE computing metrics on original scale
  • DataFrame.randomSplit() for train/test splitting in Spark

Common Exam Traps

F1 score is ONLY for classification — never use it for regression. The exam will present F1 as a regression metric option to trap you
Hyperopt MINIMIZES by default — return -accuracy (negative) if maximizing. Returning raw accuracy will cause Hyperopt to find the worst model
CrossValidator trains k × n models (folds × combinations). TrainValidationSplit trains only n models. Know the computational cost difference
Spark ML's decision trees use feature binning — they may produce different results than scikit-learn on the same data because they evaluate binned split candidates, not all possible splits
Too much parallelism with Bayesian optimization degrades results — TPE needs to observe past trials to propose better combinations. Reduce parallelism if optimization stalls
Quick Check: Model Training and Evaluation

Question 1 of 3

A fraud detection model needs to catch as many fraudulent transactions as possible, even if some legitimate ones are flagged. Which metric should be prioritized?

Domain 418% of exam

Model Deployment and Management

Covers ~18% of the exam. Tests model registration, stage transitions in MLflow Model Registry, batch inference patterns, real-time serving, Feature Store integration for inference, distributed inference with Pandas UDFs, and data pipeline considerations for deployment.

Key Topics

MLflow Model RegistryModel Serving EndpointsFeature Storespark_udfmapInPandasPandas UDFs

Must-Know Concepts

  • Register a model: mlflow.register_model(f'runs:/{run_id}/artifact_path', 'model_name') — note the 'runs:/' prefix with forward slash
  • Stage transitions: None → Staging → Production → Archived. Performed on the model version details page in the UI or via API
  • Batch inference: mlflow.pyfunc.spark_udf(spark, model_uri) creates a Spark UDF for distributed scoring across partitions
  • mapInPandas(): applies a function to each Spark partition as a pandas DataFrame — ideal for applying scikit-learn models to distributed data
  • Feature Store inference: register model with Feature Store feature sets, then only pass primary key at inference time — features are auto-retrieved
  • Iterator-based inference: model loaded once per executor and reused across batches — avoids repeated model loading overhead

Common Exam Traps

mlflow.register_model() URI format is 'runs:/{run_id}/artifact' — the forward slash after 'runs:' is required. Without it, registration fails
Stage transitions happen on the model VERSION page, not the model overview page. Each version has its own stage
mapInPandas() processes partitions — use it for batch inference. foreachPartition() is for side effects (e.g., writing) and does NOT return results
Feature Store auto-retrieval only works if the model was logged with Feature Store feature sets. Without this setup, all features must be in the input DataFrame
Quick Check: Model Deployment and Management

Question 1 of 3

A ML engineer has a best run stored in best_run_id and model artifact named 'trained_model'. What is the correct syntax to register it?

Domain 515% of exam

ML Operations (MLOps)

Covers ~15% of the exam. Tests MLflow experiment tracking, autologging, reproducibility, run comparison, model monitoring, job orchestration for ML pipelines, and debugging failed ML jobs. Expect questions on mlflow.search_runs(), nested runs, notebook source tracking, and job diagnostics.

Key Topics

MLflow TrackingMLflow AutologgingDatabricks Jobsmlflow.search_runs()Nested Runs

Must-Know Concepts

  • MLflow autologging: mlflow.autolog() automatically logs parameters, metrics, and models for supported frameworks (scikit-learn, XGBoost, Spark ML)
  • Nested runs: mlflow.start_run(nested=True) creates parent-child hierarchies — parent for the experiment, children for individual trials
  • mlflow.search_runs(): programmatically filter and sort runs by metrics to find the best model without using the UI
  • Source tracking: MLflow automatically records which notebook/script produced each run — accessible via the 'Source' link on run details page
  • Databricks Jobs matrix view: diagnose multi-task job failures by viewing status and logs for each task individually
  • Git Folders (Repos): create branches for safe ML experimentation without affecting production pipelines

Common Exam Traps

Autologging does NOT create nested runs — it logs flat runs with parameters and metrics. For parent-child hierarchy, you must manually set nested=True
mlflow.search_runs() returns a pandas DataFrame of runs — you can sort and filter by any logged metric or parameter
The 'Source' link on the MLflow run page shows which notebook created the run — not the MLmodel artifact file or the registered model page
Job runs matrix view shows individual task status — don't rerun all tasks to diagnose a single failure. Use the matrix to identify and fix the failing task
Quick Check: ML Operations (MLOps)

Question 1 of 3

A data scientist wants to organize MLflow tracking with one parent run for a grid search and child runs for each combination. How should they establish this hierarchy?

Key ML Concepts Compared

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

CrossValidator vs TrainValidationSplit

Use CrossValidator when…

You need robust model evaluation with k-fold cross-validation. Better for smaller datasets where every data point matters for both training and validation. More computationally expensive (trains k models).

Use TrainValidationSplit when…

You have a large dataset and want faster evaluation. Uses a single random train/validation split. Trains only one model per hyperparameter combination — significantly faster but less robust.

Exam trap

CrossValidator trains k models per hyperparameter combination (e.g., 5-fold × 9 combinations = 45 models). TrainValidationSplit trains 1 model per combination (9 models). If the question mentions 'computational cost' or 'large dataset,' choose TrainValidationSplit.

Grid Search vs Random Search / Bayesian (Hyperopt TPE)

Use Grid Search when…

You have a small hyperparameter space and need to evaluate every combination exhaustively. Best for 2–3 parameters with a few discrete values each.

Use Random Search / Bayesian (Hyperopt TPE) when…

You have a large search space or expensive model training. Random search samples randomly; Bayesian (TPE) learns from past trials to propose better combinations. Both are more efficient than grid search for high-dimensional spaces.

Exam trap

Hyperopt uses Tree of Parzen Estimators (TPE) by default — a Bayesian method. It MINIMIZES the objective, so return negative values for metrics you want to maximize (e.g., -accuracy). Grid search does not learn from past results.

Spark ML Pipelines vs scikit-learn Pipelines

Use Spark ML Pipelines when…

Your data is distributed across a Spark cluster (millions+ rows). Spark ML pipelines parallelize feature engineering and model training across nodes. Use VectorAssembler for feature preparation.

Use scikit-learn Pipelines when…

Your data fits in memory on a single machine. scikit-learn pipelines are simpler, more mature, and have more algorithm options. Use with Hyperopt SparkTrials to parallelize tuning across a cluster.

Exam trap

Spark ML requires features in a single Vector column (VectorAssembler). scikit-learn works with numpy arrays and pandas DataFrames directly. A common exam question tests when to use each framework based on data size.

Batch Inference vs Real-time Serving

Use Batch Inference when…

You need to score large datasets (millions of records) periodically. Use mlflow.pyfunc.spark_udf() to distribute inference across Spark workers. Higher throughput, higher latency.

Use Real-time Serving when…

You need low-latency predictions for individual requests (e.g., APIs, user interactions). Use Databricks Model Serving endpoints. Lower throughput, lower latency.

Exam trap

For batch inference, use spark_udf or mapInPandas — do NOT use model serving endpoints for batch workloads. If the question mentions 'large dataset' or 'periodic scoring,' choose batch inference.

Standard Python UDFs vs Pandas UDFs (Vectorized)

Use Standard Python UDFs when…

Simple row-level transformations where performance is not critical. Processes one row at a time with Python serialization overhead.

Use Pandas UDFs (Vectorized) when…

Performance-critical transformations on large datasets. Processes data in Arrow batches, reducing JVM-Python serialization overhead by 10-100x. Use for feature engineering and batch inference.

Exam trap

Pandas UDFs are faster because they process batches of rows in columnar format using Apache Arrow — not because they use pandas internally. The key advantage is reduced serialization, not the pandas API.

MLflow Experiments vs MLflow Model Registry

Use MLflow Experiments when…

You're in the experimentation phase — training models, comparing metrics, tuning hyperparameters. Experiments track runs, parameters, metrics, and artifacts.

Use MLflow Model Registry when…

You've found a good model and need to manage it through deployment stages (Staging → Production → Archived). Registry handles versioning and stage transitions.

Exam trap

Experiments track training runs. Model Registry tracks deployed model versions. To move a model from experiment to production: first register it (mlflow.register_model), then transition its stage.

Top Mistakes to Avoid

Forgetting to apply StringIndexer before OneHotEncoder — Spark ML's OneHotEncoder requires numeric indices, not raw strings
Calling transform() without fit() on Spark ML Estimators — the two-step pattern is required to learn statistics before applying transformations
Standardizing the entire dataset before train/test split — this leaks test data statistics into training. Fit the scaler on training data only
Using accuracy as the primary metric for imbalanced datasets — a model predicting the majority class always achieves high accuracy. Use F1 or recall instead
Forgetting that Hyperopt minimizes by default — return -accuracy (negative) for metrics you want to maximize, or the optimizer will find the worst model
Not including VectorAssembler in the pipeline — Spark ML models require a single Vector column for features, unlike scikit-learn which accepts multiple columns
Confusing the mlflow.register_model() URI format — it's 'runs:/{run_id}/artifact' (with forward slash), not 'runs:{run_id}/artifact'
Confusing MLflow Experiments (tracking runs) with Model Registry (managing deployed versions) — they serve different lifecycle stages

Exam-Ready Checklist

Can explain all 5 exam domains and their relative weights
Know Spark ML pipeline pattern cold: VectorAssembler → StringIndexer → OneHotEncoder → StandardScaler → Model
Can write MLflow tracking code from memory: start_run, log_param, log_metric, log_model, autolog
Understand MLflow Model Registry: register_model URI format, stage transitions, model versioning
Know feature engineering transformers: Imputer (fit then transform), Bucketizer, StandardScaler
Can configure Hyperopt: fmin, hp.uniform/hp.choice, SparkTrials, return negative loss for maximization
Understand CrossValidator vs TrainValidationSplit tradeoffs and computational cost
Know classification metrics: precision, recall, F1, accuracy — and when to use each
Know regression metrics: RMSE, MAE, R² — and which evaluator to use
Understand data leakage: fit on training only, use Pipeline inside CV
Know Feature Store basics: feature registration, lookup by primary key during inference
Scored 80%+ on at least two full practice exams
Reviewed all incorrect answers and understand why the right answer is right
Can complete the exam within time: average 2 minutes per question

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions