CertPrepNow
DatabricksML Associate5 domains

ML Associate Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Machine Learning Associate exam.

General Exam Tips

  • 1.Read ALL answer choices before committing — many traps rely on plausible-looking options that differ by one word (e.g., 'runs:/' vs 'runs:').
  • 2.This is a Python-first exam: when the question asks 'how would you do X,' the answer is almost always a specific API call, not a conceptual description.
  • 3.45 questions in 90 minutes = 2 minutes per question. Flag and skip questions that require deep calculation — come back at the end.
  • 4.If you see 'imbalanced dataset' in a question, immediately think: accuracy is wrong — use F1 or recall.
  • 5.If you see 'maximize accuracy/AUC' with Hyperopt, the answer involves returning a negative value.
  • 6.If you see 'large dataset' or 'millions of records' for inference, the answer is batch inference (spark_udf or mapInPandas), not a model serving endpoint.
  • 7.Unity Catalog models use aliases (not stages) — if the question mentions UC model registry, think 'aliases' not 'Staging/Production'.
  • 8.For time series data: random splits are WRONG. The answer is always temporal/rolling cross-validation to preserve time order.
  • 9.No partial credit — every question is binary. Never leave blank answers; always guess when unsure.
Domain 118% of exam

Machine Learning Fundamentals

Must-Know Facts

  • Estimator.fit() returns a Transformer (the trained model). Transformer.transform() applies learned parameters. An Estimator is NOT a pre-trained model — it is the algorithm class.
  • Bagging = parallel training on random subsets → reduces variance. Random Forest is the canonical bagging algorithm.
  • Boosting = sequential training, each tree corrects previous errors → reduces bias. XGBoost and GBT use boosting.
  • Gradient boosting iterations are inherently sequential — cannot be parallelized across tree iterations (only within single tree construction).
  • High bias = underfitting (model too simple). High variance = overfitting (model too complex). Solutions are opposite.
  • SQL Warehouses are for SQL queries only — they cannot run Spark ML, scikit-learn, or MLflow. Use a Standard multi-node cluster for ML workloads.
  • Single Node clusters run Python/pandas but cannot run distributed Spark ML algorithms.
  • Databricks ML Runtime pre-installs MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch. Standard Runtime does NOT include these.
  • Supervised learning requires labeled data. Unsupervised learning (clustering, dimensionality reduction) does NOT require labels.

Common Traps

TrapRandom Forest and boosting algorithms are both 'ensemble' methods, so candidates confuse which one uses bagging.
RealityRandom Forest = bagging (parallel). XGBoost, GBT, AdaBoost = boosting (sequential). They reduce different error types: bagging reduces variance, boosting reduces bias.
TrapAn Estimator already contains a trained model and can be used for predictions directly.
RealityAn Estimator is just the algorithm. Call fit() to produce a Transformer (the trained model). Then call transform() on the Transformer to get predictions.
TrapGradient boosting can be parallelized across iterations to speed up training.
RealityEach boosting iteration depends on the residuals of the previous one — iterations are sequential. Parallelism is possible within a single tree's node splitting, but not across iterations.
TrapA Single Node cluster is sufficient for distributed Spark ML training.
RealitySingle Node clusters have no workers — Spark ML distributed training requires at least one driver plus workers (Standard multi-node cluster).
TrapTo add ML libraries like XGBoost to a cluster, use an init script.
RealitySimply select Databricks Runtime for Machine Learning when creating the cluster. It pre-installs all common ML libraries. Init scripts work but are unnecessary for standard ML libraries.

Confusing Pairs

Bagging (Random Forest)Boosting (XGBoost, GBT)

Bagging = train trees independently in parallel on random data subsets, then average results. Reduces variance, good when model overfits. Boosting = train trees sequentially, each correcting previous residuals. Reduces bias, good when model underfits. Key exam cue: 'parallel' → bagging; 'sequential' or 'residuals' → boosting.

Estimator (Spark ML)Transformer (Spark ML)

Estimator implements fit() → produces a Transformer. Transformer implements transform() → applies to data. Think: Estimator = the recipe (needs training), Transformer = the trained result. Example: LinearRegression is an Estimator; LinearRegressionModel is a Transformer.

Databricks ML RuntimeStandard Databricks Runtime

ML Runtime = pre-installed MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch, Spark ML. Standard Runtime = only Spark, Delta Lake, no ML libraries. Always choose ML Runtime for exam scenarios involving ML workloads.

Scenario Tips

If the question asks about:

A question asks you to choose the right cluster type for a data science team using Spark ML and MLflow.

Answer:

Standard multi-node cluster with ML Runtime. Provides distributed training (multiple workers) and pre-installed ML libraries.

Distractor to avoid:

SQL Warehouse sounds like a good idea for 'analytics' but it cannot run Python ML code. Single Node works for pandas but not distributed Spark ML.

If the question asks about:

A model has 95% training accuracy but 60% test accuracy. What is happening and what should be done?

Answer:

High variance (overfitting). Fix: simplify the model, add regularization, gather more training data, use dropout.

Distractor to avoid:

Do not confuse with high bias (underfitting) where BOTH training and test accuracy are low.

If the question asks about:

A question asks which ensemble method to use for a problem where the current model underfits the data.

Answer:

Boosting (XGBoost, GBT) — it reduces bias by sequentially correcting model errors.

Distractor to avoid:

Bagging/Random Forest reduces variance, not bias. It helps when you have overfitting, not underfitting.

Last-Minute Facts

1Random Forest = bagging = parallel = reduces variance.
2XGBoost/GBT = boosting = sequential = reduces bias.
3Estimator.fit() → Transformer. Transformer.transform() → predictions.
4ML Runtime includes: MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch.
5SQL Warehouse = SQL queries only. Standard cluster = ML workloads.
Domain 227% of exam

ML Development and Feature Engineering

Must-Know Facts

  • VectorAssembler is ALWAYS required before any Spark ML model — it combines multiple feature columns into a single 'features' Vector column.
  • StringIndexer MUST come before OneHotEncoder. OneHotEncoder expects numeric indices as input, not raw string values.
  • Imputer must call fit() to learn statistics (mean/median) THEN transform() to apply. Calling transform() without fit() throws an error.
  • Data leakage: never fit a scaler or encoder on the full dataset before splitting. Fit only on training data. Best practice: include scaler inside a Pipeline used with CrossValidator.
  • pandas API on Spark (pyspark.pandas) DataFrames ARE distributed — they are backed by Spark DataFrames. Do NOT call collect() on them.
  • Pandas UDFs process data in Apache Arrow columnar batches — 10-100x faster than row-at-a-time Python UDFs due to reduced JVM-Python serialization.
  • mapInPandas() applies a function to each Spark partition as a pandas DataFrame — the standard pattern for distributed scikit-learn inference.
  • AutoML handles feature scaling, encoding, imputation, and hyperparameter tuning automatically. It does NOT perform EDA.
  • AutoML generates fully editable notebooks — it is not a black box. The code uses standard MLflow and scikit-learn.
  • Feature Store: at inference time, only pass the primary key column — features are automatically retrieved from the feature table.
  • Ordinal encoding preserves order (Low/Medium/High → 0/1/2). One-hot encoding treats categories as unordered. Use OHE for nominal, ordinal for ordered categories.
  • For skewed/outlier-heavy features, use median imputation. For roughly normal features, use mean imputation.

Common Traps

TrapOneHotEncoder can take string column input directly — just point it to the categorical column.
RealityOneHotEncoder requires numeric indices. You MUST apply StringIndexer first to convert strings to indices. This is the most commonly missed Spark ML step.
TrapYou can call scaler.transform(df) after fitting on the full dataset — just don't transform the test set separately.
RealityFitting ANY preprocessing transformer (scaler, encoder, imputer) on data that includes validation/test records leaks statistics. Fit on training data only, then transform both.
Trappandas API on Spark is just a convenience wrapper — it uses local computation under the hood.
Realitypyspark.pandas DataFrames ARE Spark DataFrames. Operations run on the cluster. Calling collect() is unnecessary and forces all data to the driver.
TrapPandas UDFs are faster because they allow you to write pandas code instead of Spark transformations.
RealityThe speed advantage is from Apache Arrow serialization — data is transferred in columnar batches instead of row-by-row serialization. The pandas API is a secondary benefit.
TrapAutoML is a complete solution — once you run it, no further data analysis is needed.
RealityAutoML does NOT perform EDA. Data understanding, feature selection decisions, and business context interpretation remain the data scientist's responsibility.
TrapFeature Store features should always be one-hot encoded at storage time for consistency across models.
RealityDo NOT pre-compute one-hot encoding in Feature Store. Different models (tree-based vs linear) need different encoding strategies. Store raw features and encode in model pipelines.

Confusing Pairs

mapInPandas()foreachPartition()

mapInPandas() applies a function to each partition and RETURNS results as a new DataFrame — use for batch inference. foreachPartition() is for side effects only (e.g., writing to a database) and does NOT return a result DataFrame.

applyInPandas()mapInPandas()

applyInPandas() is used after groupBy() — processes each group as a separate pandas DataFrame (train one model per customer segment). mapInPandas() processes each partition regardless of grouping (global batch inference).

pandas API on Spark (pyspark.pandas)Local pandas DataFrame

pyspark.pandas DataFrames run on Spark cluster (distributed). Local pandas runs in driver memory only. pyspark.pandas uses pandas syntax but requires no collect(). Key cue: if data is millions of rows, use pyspark.pandas.

Standard Python UDFsPandas UDFs (Vectorized)

Standard UDFs: one row at a time, Python object serialization — slow. Pandas UDFs: batches of rows in Arrow format — 10-100x faster. Use Pandas UDFs for any performance-sensitive custom transformation.

Scenario Tips

If the question asks about:

A Spark ML pipeline with StringIndexer → OneHotEncoder → VectorAssembler → RandomForestClassifier is giving an error during fit().

Answer:

Check that StringIndexer.outputCol matches OneHotEncoder.inputCol exactly, and that OneHotEncoder.outputCol is one of the inputs to VectorAssembler.inputCols.

Distractor to avoid:

A common wrong answer is to swap the order of StringIndexer and OneHotEncoder — but they must stay in this exact order (StringIndexer first).

If the question asks about:

A data scientist wants to apply a trained scikit-learn model to score 50 million records in a Spark DataFrame.

Answer:

Use mapInPandas() or mlflow.pyfunc.spark_udf(). Both distribute inference across partitions. Load the model once per executor using an iterator pattern to avoid repeated deserialization.

Distractor to avoid:

Model Serving endpoint is wrong here — it is for real-time, low-latency single-record requests, not batch scoring of millions of records.

If the question asks about:

The question asks how to prevent data leakage when scaling features inside cross-validation.

Answer:

Wrap the StandardScaler and model into a Pipeline, then pass the Pipeline as the estimator to CrossValidator. This ensures fit() is called only on the training folds.

Distractor to avoid:

A wrong approach is to fit the scaler on the entire dataset before CV — this makes validation fold statistics available during scaler fitting.

If the question asks about:

A categorical feature has ordered values (Low, Medium, High, Critical). Should you use StringIndexer → OneHotEncoder or just StringIndexer alone?

Answer:

StringIndexer alone (ordinal encoding) is sufficient and appropriate for ordered categories. Adding OneHotEncoder would destroy the ordinal relationship.

Distractor to avoid:

OneHotEncoder is not always necessary after StringIndexer — only add it for nominal (unordered) categories.

Last-Minute Facts

1VectorAssembler: inputCols=['a','b','c'], outputCol='features'. Always last step before the model.
2StringIndexer first, then OneHotEncoder. Never skip StringIndexer.
3Imputer: fit() learns mean/median, transform() applies. strategy='mean' or 'median'.
4pyspark.pandas is distributed — no collect() needed.
5Pandas UDFs use Arrow batches — faster due to serialization reduction, not pandas syntax.
6mapInPandas returns results. foreachPartition does not.
7Feature Store at inference: only pass the primary key column.
8AutoML does NOT do EDA — only model training and tuning.
9Median imputation for outlier-heavy data; mean for normally distributed data.
Domain 322% of exam

Model Training and Evaluation

Must-Know Facts

  • RegressionEvaluator metrics: 'rmse', 'mae', 'r2', 'mse'. BinaryClassificationEvaluator: 'areaUnderROC'. MulticlassClassificationEvaluator: 'f1', 'accuracy', 'precisionByLabel', 'recallByLabel'.
  • F1 score is a CLASSIFICATION metric only. Never use F1 for regression — if the exam puts F1 as an option for a regression question, it is wrong.
  • For imbalanced datasets, accuracy is misleading — a model predicting the majority class always achieves high accuracy. Use F1 (if you need balance) or Recall (if catching all positives matters most).
  • CV score = mean of all fold scores. Example: [2.5, 3.1, 2.8, 3.4, 2.7] → (14.5/5) = 2.9.
  • CrossValidator trains (combinations × folds) total models. TrainValidationSplit trains (combinations × 1) models. For 9 param combos + 3 folds: CV = 27 models, TVS = 9 models.
  • Hyperopt fmin() MINIMIZES by default. Return -accuracy for maximization metrics. If you return raw accuracy, Hyperopt finds the worst model.
  • SparkTrials distributes Hyperopt trials across Spark workers. Trials() uses single-machine parallelism.
  • Hyperopt uses TPE (Tree of Parzen Estimators) — a Bayesian method that learns from past trials. Too much parallelism degrades it toward random search.
  • Time series data requires temporal/rolling cross-validation, NOT random splits. Random splits leak future information into training.
  • When the target was log-transformed during training, exponentiate predictions BEFORE computing metrics on the original scale.
  • TrainValidationSplit uses a single random split (trainRatio parameter, e.g., 0.8). CrossValidator uses k folds.

Common Traps

TrapF1 is a good universal metric for any ML task — use it for both classification and regression problems.
RealityF1 is strictly a classification metric (harmonic mean of precision and recall). For regression, use RMSE, MAE, or R². The exam presents F1 as a regression option to catch this mistake.
TrapHyperopt maximizes the objective function by default, so return raw accuracy and it will find the best model.
RealityHyperopt MINIMIZES. Return -accuracy (or any metric you want to maximize as a negative value). Returning positive accuracy causes Hyperopt to converge on the worst model.
TrapMore parallelism in Hyperopt always leads to faster convergence on the best hyperparameters.
RealityBayesian optimization (TPE) needs to observe previous trial results to suggest better combinations. High parallelism prevents this and degrades performance to random search. Reduce parallelism if optimization stalls.
TrapFor time series cross-validation, use the same random split approach as tabular data.
RealityRandom splits for time series allow future data to appear in training folds, creating data leakage. Use rolling/walk-forward cross-validation to preserve temporal order.
TrapA model with 99% accuracy on fraud detection (0.1% fraud rate) is excellent.
RealityA model predicting 'no fraud' for everything also achieves 99.9% accuracy. For highly imbalanced fraud detection, use Recall (are we catching all fraud?) or F1.
TrapGrid search is the best approach when you have 10+ hyperparameters to tune.
RealityGrid search is only practical for small search spaces (2-3 parameters with few values). For large or mixed (continuous + categorical) spaces, use Bayesian optimization (Hyperopt TPE).

Confusing Pairs

CrossValidatorTrainValidationSplit

CrossValidator = k-fold CV, trains k models per combination — more robust, higher cost. TrainValidationSplit = single 80/20 split, trains 1 model per combination — faster, less robust. Key cue: 'small dataset' or 'reliable estimate' → CrossValidator. 'Large dataset' or 'compute cost' → TrainValidationSplit.

PrecisionRecall

Precision = of everything the model flagged as positive, how many actually are. Recall = of everything that is actually positive, how many did the model catch. When false negatives are costly (missed fraud, missed cancer) → optimize Recall. When false positives are costly (false arrest, unnecessary surgery) → optimize Precision.

RMSEMAE

RMSE penalizes large errors more (squares the error) — sensitive to outliers. MAE treats all errors equally (absolute value) — robust to outliers. Use MAE when the dataset has significant outliers. Key cue: 'outliers in the target variable' → MAE.

SparkTrialsTrials (Hyperopt)

SparkTrials = distributes trial evaluations across Spark cluster workers (true distributed tuning). Trials = runs trials in parallel on a single machine using threads. Use SparkTrials when you have a Spark cluster and want cluster-level parallelism.

Scenario Tips

If the question asks about:

A fraud detection model must minimize missed fraud cases, even if some legitimate transactions are flagged. Which metric to optimize?

Answer:

Recall — it measures how many actual fraud cases are caught. Missing fraud (false negative) is the costly error here.

Distractor to avoid:

Accuracy is misleading for imbalanced fraud data. Precision minimizes false positives (wrong for this use case). F1 balances both but is not optimal when Recall is the priority.

If the question asks about:

A Hyperopt objective returns accuracy directly (not negated). What will happen?

Answer:

Hyperopt will converge on the model with the LOWEST accuracy. The fix is to return -accuracy (negative value). Hyperopt minimizes.

Distractor to avoid:

Returning {'loss': accuracy} is also wrong — loss is minimized, so you still need -accuracy.

If the question asks about:

You're tuning 3 hyperparameters with 3 values each using CrossValidator with 5 folds. How many models will be trained?

Answer:

27 combinations × 5 folds = 135 total models. TrainValidationSplit with the same grid: 27 × 1 = 27 models.

Distractor to avoid:

A common mistake is forgetting to multiply by the number of folds. The total is combinations × folds, not just combinations.

If the question asks about:

A model was trained with log-transformed target (price → log_price). At inference time, what is the correct process to get RMSE on the original price scale?

Answer:

Apply model.predict() to get log_price predictions, then np.exp() to convert back to price, then compute RMSE on original price vs. predicted price.

Distractor to avoid:

Computing RMSE on the log-scale and then exponentiating the RMSE is wrong — you must inverse-transform predictions BEFORE computing the metric.

Last-Minute Facts

1CV score = mean of fold scores (not min, max, or sum).
2CrossValidator: combinations × folds total models. TrainValidationSplit: combinations × 1.
3F1 = classification only. Never regression.
4Hyperopt minimizes → return -accuracy for maximization.
5SparkTrials = distributed across cluster workers. Trials = single machine.
6Time series: temporal CV, never random splits.
7RMSE sensitive to outliers. MAE robust to outliers.
8Recall = catching all positives. Precision = accuracy of positive predictions.
9Imbalanced data: accuracy is misleading → use F1 or Recall.
Domain 418% of exam

Model Deployment and Management

Must-Know Facts

  • mlflow.register_model() URI format: 'runs:/{run_id}/artifact_path' — the forward slash after 'runs:' is REQUIRED. 'runs:{run_id}/...' (no slash) is wrong.
  • Loading from Model Registry: 'models:/model_name/Production' or 'models:/model_name/1' (by version). This uses 'models:/' prefix, NOT 'runs:/'.
  • Workspace Model Registry stage flow: None → Staging → Production → Archived. Each model VERSION has its own stage.
  • Unity Catalog Model Registry does NOT use stages. Instead use model ALIASES (e.g., 'champion', 'challenger'). Stages are deprecated in MLflow 2.9+.
  • Batch inference pattern: mlflow.pyfunc.spark_udf(spark, model_uri) creates a Spark UDF that distributes scoring across all cluster workers.
  • mapInPandas() with iterator-based loading loads the model ONCE per executor and reuses it across batches — avoids repeated model deserialization overhead.
  • Feature Store inference: the model must be logged WITH Feature Store feature lookup specifications. Only then does fs.score_batch() auto-retrieve features by primary key.
  • For Unity Catalog feature tables, use FeatureEngineeringClient (from databricks-feature-engineering). FeatureStoreClient is the legacy client for Workspace Feature Store. Import: from databricks.feature_engineering import FeatureEngineeringClient.
  • Model Serving endpoints are for real-time, low-latency single-record requests — NOT for batch scoring millions of records.
  • Stage transitions are performed on the model VERSION details page in the UI, not the model overview page.
  • PSI (Population Stability Index) measures input data distribution shifts. K-S test compares feature distributions. Jensen-Shannon divergence quantifies distributional difference.

Common Traps

Trapmlflow.register_model(f'runs:{run_id}/model', 'my_model') is the correct syntax.
RealityMissing the forward slash after 'runs:'. Correct: f'runs:/{run_id}/model'. The exam specifically tests this URI format — the slash is required.
TrapIn Unity Catalog, you transition a model to Production stage using client.transition_model_version_stage().
RealityUnity Catalog does not support stages. You use model aliases instead (set_registered_model_alias). The stages API only works with the workspace (non-UC) model registry.
TrapFor batch inference on a large dataset, deploy a Model Serving endpoint and call it in a loop.
RealityModel Serving endpoints add latency per request and are not designed for batch workloads. Use mlflow.pyfunc.spark_udf() or mapInPandas() for batch scoring — they run in parallel on Spark workers.
TrapFeature Store auto-retrieves features during inference for any model registered in the registry.
RealityFeature auto-retrieval only works if the model was logged WITH Feature Store feature lookup specifications (using FeatureEngineeringClient.log_model for UC, or FeatureStoreClient.log_model for legacy workspace). Without this, all features must be explicitly included in the input DataFrame.
TrapFeatureStoreClient is the correct client to use for all Feature Store operations on Databricks.
RealityFeatureStoreClient is the legacy client for Workspace Feature Store. For Unity Catalog feature tables (the current default), use FeatureEngineeringClient from databricks-feature-engineering. The old databricks-feature-store package is deprecated since v0.17.0. Key cue: if the question mentions Unity Catalog or three-level table names → FeatureEngineeringClient.
TrapData drift and concept drift are the same problem.
RealityData drift = input feature distributions change (but relationships stay the same). Concept drift = the underlying relationship between features and target changes. Data drift is detected by comparing input distributions. Concept drift requires monitoring model performance.

Confusing Pairs

Workspace Model Registry (stages)Unity Catalog Model Registry (aliases)

Workspace registry: fixed lifecycle stages (None/Staging/Production/Archived), transitioned via client.transition_model_version_stage(). Unity Catalog registry: custom aliases (up to 10), set via client.set_registered_model_alias(). Stages are deprecated since MLflow 2.9. Key cue: if question mentions UC or three-level name (catalog.schema.model) → aliases.

Batch Inference (spark_udf / mapInPandas)Real-time Serving (Model Serving Endpoints)

Batch: large datasets, periodic scoring, high throughput — use spark_udf or mapInPandas. Real-time: individual requests, low latency, online APIs — use Model Serving endpoints. Key cues: 'millions of records' or 'nightly scoring' → batch. 'REST API' or 'under 100ms' → real-time serving.

Data DriftConcept Drift

Data drift: input feature distribution shifts (e.g., new customer demographics). Detected by comparing input distributions (PSI, K-S test). Concept drift: relationship between features and target changes (e.g., buying behavior shifts). Detected by monitoring model prediction accuracy or output distributions.

mlflow.pyfunc.spark_udf()mlflow.pyfunc.load_model()

spark_udf() wraps a model as a Spark UDF for distributed batch scoring across workers. load_model() loads the model as a Python object for single-machine inference (e.g., on the driver). Key cue: Spark DataFrame input → spark_udf(). Local pandas DataFrame → load_model().

FeatureStoreClient (legacy)FeatureEngineeringClient (current)

FeatureStoreClient: legacy Workspace Feature Store client, still works but deprecated in databricks-feature-store v0.17.0. FeatureEngineeringClient: current client for Unity Catalog feature engineering tables, from databricks-feature-engineering package. Key cue: if question mentions Unity Catalog or three-level table names (catalog.schema.table) → FeatureEngineeringClient.

Scenario Tips

If the question asks about:

A question gives you best_run_id='abc123' and model artifact path 'classifier'. What is the exact register_model call?

Answer:

mlflow.register_model('runs:/abc123/classifier', 'model_name'). The URI is 'runs:/' (with slash) then run_id then artifact path.

Distractor to avoid:

Option 'runs:abc123/classifier' (no slash after colon) and 'runs://abc123/classifier' (double slash) are both wrong. The exam tests this exact format.

If the question asks about:

A company uses Unity Catalog. They want to mark a model version as their production model. How?

Answer:

Set a model alias using client.set_registered_model_alias(name, alias='champion', version=3). Load it via 'models:/catalog.schema.model@champion'.

Distractor to avoid:

client.transition_model_version_stage() to 'Production' is wrong for UC — stages are not supported in Unity Catalog.

If the question asks about:

A question asks how to score 10 million records using a scikit-learn model stored in MLflow, distributed across a Spark cluster.

Answer:

Use mapInPandas() with an iterator pattern that loads the model once per executor, OR use mlflow.pyfunc.spark_udf() to create a Spark UDF from the model.

Distractor to avoid:

Calling collect() to bring all 10M rows to the driver and scoring locally defeats distributed computing and will likely OOM.

If the question asks about:

Model performance starts degrading one month after deployment. Feature distributions look normal. What type of drift is this, and how do you monitor for it?

Answer:

Concept drift — the underlying relationship between features and target has changed. Monitor by tracking model output/prediction accuracy over time, not input feature distributions.

Distractor to avoid:

PSI and K-S test measure data drift (input distribution changes). For concept drift, you need ground truth labels to compare against predictions.

If the question asks about:

A question asks how to perform continuous/streaming inference on arriving data using an MLflow model.

Answer:

Apply the model as a UDF on a Spark Structured Streaming DataFrame — treat the streaming DataFrame exactly like a batch DataFrame. Call predict_udf on the stream and write results with writeStream. Delta Live Tables can also orchestrate streaming inference pipelines.

Distractor to avoid:

Model Serving endpoints are for real-time per-request calls, not continuous stream processing. Batch inference with spark_udf still applies to streaming DataFrames — the API is the same.

Last-Minute Facts

1URI format: 'runs:/{run_id}/artifact' (forward slash after colon, required).
2URI to load from registry: 'models:/name/stage' or 'models:/name/version'.
3Workspace registry: stages (None/Staging/Production/Archived). UC registry: aliases.
4MLflow 2.9+: stages deprecated. Use aliases in UC.
5UC model name format: catalog.schema.model_name (three-level).
6Batch inference: spark_udf or mapInPandas. Real-time: Model Serving endpoint.
7Feature Store auto-lookup only works if model was logged with FeatureEngineeringClient.log_model (UC) or FeatureStoreClient.log_model (legacy).
8FeatureEngineeringClient = current UC client. FeatureStoreClient = legacy workspace client (deprecated).
9PSI = measures input distribution change over time. K-S test = compares two distributions.
10Data drift = input changes. Concept drift = relationship between features and target changes.
Domain 515% of exam

ML Operations (MLOps)

Must-Know Facts

  • mlflow.autolog() automatically logs parameters, metrics, and model artifacts for supported frameworks (scikit-learn, XGBoost, LightGBM, Spark ML). No explicit log_param/log_metric calls needed.
  • Autologging does NOT create nested run hierarchies. For parent-child relationships (e.g., grid search parent with trial children), you must explicitly set nested=True.
  • mlflow.start_run(nested=True) creates a child run inside an active parent run context. Without nested=True, starting a run inside another run raises an error.
  • mlflow.search_runs() returns a pandas DataFrame sorted/filtered by any logged metric or parameter — the programmatic way to find the best run.
  • Source tracking: MLflow records the notebook or script that created each run. Accessible via the 'Source' link on the run details page in the UI.
  • Databricks Jobs matrix view shows each task's status and logs individually — use this to pinpoint which task failed in a multi-task job.
  • Git Folders (Repos): create a feature branch for ML experiments to avoid breaking the production pipeline on main branch.
  • Delta Lake time travel enables reproducing training datasets: VERSION AS OF 42 or TIMESTAMP AS OF '2025-01-01'.
  • Model monitoring with Lakehouse Monitoring: track input distribution drift, prediction drift, and data quality over time using Delta tables.
  • To find the best run programmatically: mlflow.search_runs(filter_string='metrics.rmse < 2.0', order_by=['metrics.rmse ASC']).

Common Traps

TrapEnabling autologging with mlflow.autolog() automatically creates nested runs for each hyperparameter trial.
RealityAutologging logs flat runs only. It does NOT create parent-child hierarchies. To create nested runs, you must explicitly use mlflow.start_run(nested=True) around each child run.
TrapThe Source link on the MLflow run page shows the MLmodel artifact file contents.
RealityThe Source link shows the notebook or script that produced that run — it links back to the code. The MLmodel file contains model metadata (flavor, schema), not the source notebook.
TrapWhen a multi-task Databricks Job fails, rerun all tasks to identify which one caused the failure.
RealityUse the Job runs matrix view to see each task's status independently. You can identify the failing task and rerun only that task — no need to rerun the entire job.
TrapDelta Lake versioning tracks both data versions AND trained model versions.
RealityDelta Lake versions data (tables). MLflow Model Registry versions trained models. They are separate systems. Use Delta time travel for dataset reproducibility; use Model Registry for model version management.

Confusing Pairs

MLflow AutologgingMLflow Nested Runs

Autologging: automatically captures training metrics/params/model without explicit API calls. Works for a single run. Nested Runs: manually create parent-child run hierarchy using nested=True — used to organize hyperparameter search results (parent = experiment, children = individual trials). Autologging does NOT produce nested runs.

MLflow Experiments (Tracking)MLflow Model Registry

Experiments track training runs (parameters, metrics, artifacts) during development. Model Registry manages model versions through deployment lifecycle (Staging → Production → Archived or aliases in UC). Experiments are for exploration; Registry is for production management. To move a model to production: first register it from its run, then manage stage/alias in Registry.

Delta Lake Time Travel (data versioning)MLflow Model Registry (model versioning)

Delta Lake versions datasets — use VERSION AS OF or TIMESTAMP AS OF to reproduce an exact training dataset. MLflow Model Registry versions trained models — tracks which code+data produced which model. Both are needed for full ML reproducibility.

Scenario Tips

If the question asks about:

A data scientist wants one MLflow run per hyperparameter search with individual child runs for each combination. How?

Answer:

Create an outer mlflow.start_run() context for the parent, then inside it create each trial with mlflow.start_run(nested=True). Autologging alone will not create this hierarchy.

Distractor to avoid:

Calling mlflow.autolog() before the loop does NOT create parent-child relationships — it just logs each run flat.

If the question asks about:

A team wants to find the run with the lowest validation RMSE without manually checking the UI.

Answer:

Use mlflow.search_runs(experiment_ids=['123'], order_by=['metrics.val_rmse ASC'], max_results=1). This returns a pandas DataFrame with the best run on top.

Distractor to avoid:

Iterating through mlflow.get_experiment() and calling get_run() for each is inefficient. search_runs() is the purpose-built API.

If the question asks about:

A complex multi-task Databricks Job starts failing intermittently. A team member wants to rerun the entire job to diagnose it.

Answer:

Open the specific failed job run, use the matrix view to identify which task failed. Click on that task to view its logs. Rerun only the failed task once diagnosed.

Distractor to avoid:

Rerunning all tasks wastes compute and obscures which specific task is the root cause.

If the question asks about:

A model needs to be reproduced exactly 3 months later, including using the exact same training data. What enables this?

Answer:

Delta Lake time travel — VERSION AS OF N or TIMESTAMP AS OF '...' queries the exact snapshot of the Delta table used for training. Combined with MLflow run metadata (which version was used), enables full reproducibility.

Distractor to avoid:

MLflow alone records the model and parameters but not the data. You need Delta Lake for dataset versioning.

Last-Minute Facts

1Autologging: logs params/metrics/model automatically. Does NOT create nested runs.
2nested=True: required for child runs inside a parent run context.
3search_runs(): returns pandas DataFrame. Supports filter_string and order_by.
4Source link on run page: shows the notebook/script that created the run.
5Job runs matrix view: shows each task's status — use to identify failing task.
6Delta Lake time travel: VERSION AS OF or TIMESTAMP AS OF for dataset reproducibility.
7MLflow Registry: model versioning. Delta Lake: dataset versioning. Different systems.
8Git Folders: create branches for safe experimentation without breaking production.

Feeling confident?

Put your knowledge to the test with a timed ML Associate mock exam.