CertPrepNow
DatabricksML Professional3 domains

ML Professional Exam Notes

Last-minute traps, must-know facts, and scenario tips for the Databricks Certified Machine Learning Professional exam.

General Exam Tips

  • 1.Read ALL answer options before choosing — two options often both work but one is the Databricks-idiomatic answer
  • 2.When stuck, eliminate options that compromise governance, automation, or scalability — the exam rewards production-ready thinking
  • 3.Scenario phrasing like 'most efficient', 'best approach', or 'production-grade' means one option is clearly better — look for the automation/governance angle
  • 4.This exam tests judgment, not syntax recall — ask 'Why choose this over the obvious alternative?' for every answer you pick
  • 5.59 questions in 120 minutes = ~2 minutes per question. Flag long scenarios and return; do not skip final review
  • 6.No penalty for wrong answers — always guess before moving on
  • 7.The September 2025 version consolidated to 3 domains. If a study resource mentions 4 domains, it is outdated
  • 8.MLOps and Model Development are both 44% — treat them as equally important. Do not over-invest in Model Deployment (12%)
  • 9.Scenario questions often describe a symptom (e.g., 'model precision dropped') and test whether you diagnose correctly before jumping to a fix
Domain 144% of exam

Model Development

Must-Know Facts

  • SparkML requires ALL features combined into a single Vector column — VectorAssembler is mandatory as the final feature-preparation stage
  • StringIndexer converts string labels to numeric indices. OneHotEncoder converts those indices to binary vectors. Order matters: StringIndexer MUST precede OneHotEncoder
  • CrossValidator trains k * n models total (k folds * n ParamGrid combinations) — this grows exponentially; use TrainValidationSplit when compute is limited
  • Optuna's define-by-run API lets you define the search space inside the objective function. Use the MLflow callback to auto-log each trial
  • Both Hyperopt and Optuna minimize by default — negate metrics you want to maximize (return -f1_score, not +f1_score)
  • Parallelism > completed trials degrades Bayesian optimization — TPE needs prior results to propose better parameters. High parallelism = random search behavior
  • applyInPandas() groups data by a key and applies a function per group — use for training one model per store/product/region
  • mapInPandas() applies a function per partition without grouping — use for distributed batch inference, not group-specific training
  • Vertical scaling adds CPU/RAM per node (memory-bound workloads). Horizontal scaling adds more nodes (compute-bound, data-parallel workloads)
  • Data parallelism: each worker holds the full model and trains on a subset of data — standard SparkML, SparkTrials pattern
  • Model parallelism: model is split across workers — only needed when the model itself exceeds single-node memory (rare for classical ML)
  • Nested MLflow runs: parent groups child runs. Use mlflow.start_run(nested=True) inside a parent context. Required for organized hyperparameter search
  • PyFunc models execute custom Python at prediction time — wrap preprocessing AND postprocessing so they travel with the model artifact
  • Feature Store point-in-time correctness requires a timestamp_keys column in the feature table AND a timestamp_lookup_key in the FeatureLookup call
  • On-demand features are computed at request time from data in the prediction request itself — they are NOT pre-computed and stored in the feature table
  • Online tables are read-only replicas of offline Feature Store tables with millisecond lookup latency — they sync automatically but have a propagation delay

Common Traps

TrapForgetting VectorAssembler before the SparkML classifier/regressor
RealitySparkML models accept exactly one features column of type Vector. Without VectorAssembler combining your feature columns, the pipeline will throw a schema error at fit() time, not at definition time.
TrapPassing string columns directly to OneHotEncoder
RealityOneHotEncoder only accepts numeric index columns as input. You must apply StringIndexer first to convert strings to numeric indices. This is the single most common SparkML pipeline construction error.
TrapSetting parallelism equal to max_evals in Optuna/Hyperopt to 'go faster'
RealityBayesian optimization (TPE) works by learning from completed trials. If all trials run simultaneously, the optimizer has no completed results to learn from and degenerates to random search. Keep parallelism well below max_evals.
TrapUsing mapInPandas() to train separate models per group
RealitymapInPandas() processes data partition by partition — a group can span multiple partitions, so you may train partial models. Use groupBy().applyInPandas() so each group's data is co-located before the function is applied.
TrapAssuming Point-in-Time correctness is automatic in Feature Store lookups
RealityPoint-in-time correctness only works if (1) the feature table has a timestamp key column defined at creation time, AND (2) the FeatureLookup specifies timestamp_lookup_key. Without both, the latest feature value is always returned regardless of training record timestamps.
TrapConfusing on-demand features with online table features
RealityOnline tables serve pre-computed features with low latency. On-demand features are computed dynamically at prediction time using data in the request itself (e.g., time-of-day, request IP). They serve different purposes and cannot substitute for each other.
TrapLogging PyFunc models without specifying dependencies
RealityIf the PyFunc predict() function imports custom libraries, those libraries must be logged in conda_env or pip_requirements when calling mlflow.pyfunc.log_model(). Missing imports cause silent serving endpoint startup failures.
TrapUsing Ray and Spark interchangeably for distributed training
RealityRay distributes independent Python functions (compute-parallel). Spark distributes data-parallel operations. Use Ray when the bottleneck is computation on fixed data. Use Spark when the bottleneck is data volume. They are not interchangeable.

Confusing Pairs

applyInPandas()mapInPandas()

applyInPandas() = called after groupBy(), applies the function to EACH GROUP — entire group data is gathered for the function. Use for training one model per group. mapInPandas() = applied per PARTITION — no grouping guarantee. Use for distributed inference or transformation where partition boundaries don't matter.

SparkML PipelineSingle-Node Model with SparkTrials

SparkML Pipeline = distributes data across workers, each worker trains on a data shard — use when data volume is too large for a single node. Single-node model with SparkTrials/Optuna = each trial runs the full single-node model (e.g., scikit-learn) on one worker — use when the algorithm doesn't support distributed training but you want distributed search.

Vertical ScalingHorizontal Scaling

Vertical = bigger nodes (more RAM, more CPU per node). Choose when a single operation must fit in memory (e.g., large model, wide DataFrame join). Horizontal = more nodes. Choose when data volume is the bottleneck and operations are parallelizable. The exam tests which is appropriate given a specific constraint.

Data ParallelismModel Parallelism

Data parallelism = full model copied to each worker, each worker trains on a data partition. Standard SparkML pattern. Model parallelism = model split across workers, each holding a layer or shard. Only needed when the model exceeds single-node memory — rare outside large deep learning models.

OptunaHyperopt with SparkTrials

Optuna = modern framework with define-by-run API, built-in MLflow callback, pruning support for early trial termination, and multi-objective optimization. Preferred for new Databricks workloads. Hyperopt with SparkTrials = older TPE-based framework with Spark-native distribution. Both minimize by default. The September 2025 exam focuses on Optuna patterns.

PyFunc Custom ModelStandard MLflow Model

Standard MLflow model = logged directly from scikit-learn, XGBoost, etc. with autolog. Prediction is just the model's native predict(). PyFunc = wraps any Python logic including pre/post-processing, feature engineering, and label mapping inside predict(). Use PyFunc when the serving logic is more than just calling model.predict().

Scenario Tips

If the question asks about:

When the question asks about training separate models for each entity (store, product, region) and data is in a single Spark DataFrame with a group column...

Answer:

Use df.groupBy('entity_col').applyInPandas(train_fn, schema). This co-locates all data for each group on a single worker before applying the function.

Distractor to avoid:

mapInPandas() is wrong here — it operates per partition, not per group, so the same entity's data may be split across workers.

If the question asks about:

When a question asks how to prevent future data from contaminating historical training features in Feature Store lookups...

Answer:

Use point-in-time correctness by (1) defining timestamp_keys when creating the feature table and (2) setting timestamp_lookup_key in FeatureLookup. This retrieves features as of each record's event timestamp.

Distractor to avoid:

Manually filtering the feature table by date is wrong — it applies a single global cutoff, not per-record timestamps, and does not handle entities that were updated multiple times.

If the question asks about:

When Optuna produces results barely better than random despite 100 trials...

Answer:

The parallelism is too high. TPE needs completed trial results to make informed proposals. Reduce parallelism so the optimizer can learn sequentially between batches.

Distractor to avoid:

Increasing max_evals won't help if parallelism already saturates all workers — you're still running random search, just more of it.

If the question asks about:

When a model needs preprocessing (tokenization) and postprocessing (label decoding) at inference time to avoid training-serving skew...

Answer:

Wrap all three steps (preprocessing + model + postprocessing) in a custom MLflow PyFunc model. The predict() method executes the full pipeline as a single artifact.

Distractor to avoid:

Doing preprocessing in the client application is wrong — any difference between the training preprocessing and client preprocessing causes training-serving skew, a notoriously difficult production bug to diagnose.

If the question asks about:

When to choose SparkML Pipeline vs single-node model for a new ML workload on 5 million rows...

Answer:

Use SparkML if the algorithm is naturally data-parallel (logistic regression, gradient boosted trees with SparkML implementations). Use single-node model with SparkTrials/Optuna if you need scikit-learn or XGBoost's full feature set — distribute the tuning, not the training.

Distractor to avoid:

Forcing single-node algorithms (sklearn, XGBoost) into SparkML syntax is wrong — SparkML has its own implementations with different APIs and behaviors.

Last-Minute Facts

1SparkML CrossValidator: n_folds * n_paramgrid_combinations = total models trained
2Optuna and Hyperopt both MINIMIZE — negate metrics to maximize (e.g., return -accuracy)
3Feature table point-in-time correctness requires timestamp_keys at table creation AND timestamp_lookup_key at lookup time — both required
4applyInPandas() = per group (after groupBy). mapInPandas() = per partition (no groupBy needed)
5Online tables have sync delay from offline Feature Store — design for eventual consistency in real-time serving
6PyFunc predict() receives a pandas DataFrame, must return a pandas DataFrame or Series
7Ray uses Python-native parallelism (compute-bound). Spark uses data-parallel operations (data-bound)
8Nested MLflow run: mlflow.start_run(nested=True) inside an active parent run context
Domain 244% of exam

MLOps

Must-Know Facts

  • Deploy CODE, not trained model artifacts, across environments. The same pipeline code trains fresh models in each environment (dev/staging/prod) with environment-specific data and configs
  • Unity Catalog model aliases ('champion', 'challenger', 'baseline') replace legacy stage transitions (Staging/Production/Archived). The September 2025 exam tests aliases exclusively
  • Unit tests validate DETERMINISTIC functions only: data transformations, schema validation, feature logic, edge cases. Never test model accuracy in unit tests
  • Integration tests validate component interactions: feature pipelines produce expected output types, training completes without errors, predictions fall within expected ranges
  • End-to-end tests validate the complete pipeline from feature computation through deployment on test datasets in temporary catalogs or schemas
  • Run unit tests on every commit (fast). Run integration tests on merge to main (slower, require infrastructure). End-to-end tests before release
  • Databricks Asset Bundles (DABs) define ML resources as YAML. Use targets for environment-specific overrides — targets INHERIT from default; only specify what changes per environment
  • Lakehouse Monitoring drift metrics are computed on scheduled REFRESH, not continuously — configure refresh frequency based on your monitoring SLA
  • KS test (Kolmogorov-Smirnov) = numerical features only. Chi-squared = categorical features only. Using the wrong test produces statistically meaningless results
  • Jensen-Shannon divergence = symmetric, bounded [0,1], works for both numerical (binned) and categorical distributions. Use when you need a comparable score across feature types
  • Three Lakehouse Monitoring profile types: snapshot (point-in-time data quality), time series (temporal trends), inference log — InferenceLog in the API — (model inputs/outputs/performance metrics)
  • Inference tables are auto-created when endpoint logging is enabled — they store raw request/response pairs as Delta tables in Unity Catalog. They are NOT drift metrics; Lakehouse Monitoring analyzes them to compute metrics
  • Automated retraining: always compare the challenger against the champion on a HELD-OUT dataset with IDENTICAL metrics before promoting — never auto-promote without validation
  • Champion-challenger pattern: new model runs in shadow mode alongside production champion. Compare offline. Promote only when challenger beats champion by a statistically significant margin
  • Concept drift = P(Y|X) changes (the relationship between features and target). Data drift = P(X) changes (feature distributions shift). Both can occur independently

Common Traps

TrapDeploying trained model files (pkl, MLflow artifact) from dev to prod instead of retraining in prod
RealityThe professional MLOps pattern is to promote pipeline CODE (and its tests) to production, which then trains a fresh model on production data. Moving model artifacts across environments is an anti-pattern because it ties the model to dev-environment data characteristics.
TrapConfusing Unity Catalog model aliases with legacy Model Registry stage names
RealityThe exam tests the current alias-based approach. Aliases are flexible custom names ('champion', 'v2-experiment', 'approved-2026-Q1'). Legacy stages (Staging/Production/Archived) are being deprecated. If you see 'set the model version to Production stage' as an answer, it's the old approach.
TrapWriting unit tests that assert on model accuracy or F1 score
RealityModel accuracy is non-deterministic — it varies with data, hyperparameters, and random seeds. Unit tests must only test deterministic logic. Accuracy validation belongs in integration or end-to-end tests where you control the test dataset.
TrapDuplicating the full YAML configuration in each DABs target
RealityDABs targets inherit from the default configuration. Only specify environment-specific overrides (cluster size, catalog name, permissions, schedule). Full duplication creates a maintenance burden and introduces configuration drift risk between environments.
TrapAssuming Lakehouse Monitoring detects drift in real time
RealityDrift metrics are computed only when the monitor refreshes (on a configurable schedule). Drift that occurs and fully resolves within a single refresh interval may never be detected. Set refresh intervals based on your business SLA for drift detection.
TrapApplying the KS test to categorical features
RealityThe KS test compares continuous cumulative distributions — it is undefined for discrete categorical data. Use Chi-squared for categorical drift detection. This is a high-frequency exam mistake because KS is the more commonly known statistical test.
TrapTreating inference tables as drift dashboards
RealityInference tables only store raw request/response logs. They contain no drift metrics. You must configure Lakehouse Monitoring to analyze the inference table and compute drift metrics, aggregate statistics, and performance trends.
TrapAssuming data drift always causes model performance degradation
RealityData drift (P(X) change) may not degrade performance if the model generalizes well to the shifted distribution. Always monitor prediction quality and business metrics alongside feature distributions — feature drift is a leading indicator, not a guarantee of degradation.

Confusing Pairs

Unity Catalog Model AliasesLegacy Model Registry Stages

Unity Catalog aliases = flexible custom names assigned to specific model versions ('champion', 'challenger'). Multiple aliases can point to the same version. No fixed vocabulary. Current exam standard. Legacy stages = fixed vocabulary (None/Staging/Production/Archived), one version per stage, being deprecated. If an answer uses stage transitions, it is the legacy approach and likely wrong.

Data DriftConcept Drift

Data drift = input feature distributions change (P(X) shifts). Detected by KS/Chi-squared on feature columns. May or may not hurt model performance. Concept drift = the relationship between inputs and outputs changes (P(Y|X) shifts). Cannot be detected from features alone — requires labeled data to evaluate. Model performance degrades even with stable feature distributions. Concept drift always requires retraining; data drift may not.

Concept DriftPrediction Drift

Concept drift = root cause, the learned mapping no longer reflects reality. Prediction drift = observable symptom, the distribution of model outputs (P(Y_hat)) has shifted. Prediction drift can be caused by data drift OR concept drift. Always diagnose cause before retraining.

Snapshot MonitorTime Series MonitorInference Log Monitor

Snapshot = compares current table state against a baseline snapshot. Best for data quality on static or append-only tables. Time Series = analyzes feature distributions over rolling time windows. Best for tracking trends in streaming or frequently updated tables. Inference Log = tracks model inputs, outputs, and performance metrics from a serving endpoint's inference table. Official Databricks API profile type is InferenceLog. Best for production model monitoring. The exam tests which type to create for a given scenario.

Unit TestIntegration TestEnd-to-End Test

Unit = tests a single function in isolation with mocked dependencies. Runs in milliseconds. Tests data transformations, validators, feature logic. Integration = tests how components work together. Requires Databricks infrastructure. Tests feature pipeline output, training completion, prediction ranges. End-to-end = tests the full pipeline from raw data through deployed model. Uses test catalogs. Slowest. Run before releasing to production.

KS TestChi-Squared TestJensen-Shannon Divergence

KS test = numerical features only. Compares cumulative distributions. Output: p-value for hypothesis test. Chi-squared = categorical features only. Tests if observed vs expected frequencies differ. Output: p-value. Jensen-Shannon divergence = works for both (numerical must be binned). Symmetric, bounded [0,1]. No p-value — higher value = more divergence. Use JS when you want comparable drift scores across feature types.

Scenario Tips

If the question asks about:

When asked how to move an ML pipeline from dev to production...

Answer:

Promote the pipeline CODE and configuration (via DABs or Git) to the production environment. The production environment runs the same code against production data to train a fresh model.

Distractor to avoid:

Exporting the dev-trained model artifact and registering it in production is the anti-pattern. It skips validation with production data and may embed dev-specific data characteristics.

If the question asks about:

When a fraud model's precision drops 15% and you need to determine if it is data drift or concept drift...

Answer:

Use Lakehouse Monitoring to compare input feature distributions against the training baseline (diagnoses data drift), AND separately evaluate the model on recent labeled data (diagnoses concept drift). You cannot distinguish the two by looking at features alone.

Distractor to avoid:

Immediately retraining without diagnosing the cause is wrong — if it's concept drift with a structural change, retraining on recent data may help, but if it's a data quality issue, retraining amplifies the problem.

If the question asks about:

When the question asks which statistical test to configure in Lakehouse Monitoring for numerical feature drift...

Answer:

Kolmogorov-Smirnov (KS) test for numerical/continuous features. Chi-squared for categorical features.

Distractor to avoid:

Chi-squared on numerical features and KS on categorical features are both wrong. This is the highest-frequency statistical trap on the exam.

If the question asks about:

When a new model outperforms the champion in offline testing and you need to decide how to promote it...

Answer:

Assign the new model the 'challenger' alias in Unity Catalog. Run champion-challenger comparison in production (shadow or A/B). Only assign the 'champion' alias when the challenger has demonstrated production superiority.

Distractor to avoid:

Directly assigning the 'champion' alias without production validation skips the real-world safety check. Offline metrics do not always translate to production performance.

If the question asks about:

When asked which Lakehouse Monitoring table type to create to track model serving endpoint predictions over time...

Answer:

Inference Log monitor (InferenceLog profile type) — it specifically tracks model inputs, outputs, and performance metrics from a serving endpoint's inference log table. Configure it with problem_type, prediction_col, label_col, and timestamp_col.

Distractor to avoid:

Snapshot monitor is wrong — it compares point-in-time states, not temporal model output trends. Time series monitor is closer but doesn't have built-in awareness of model prediction semantics.

If the question asks about:

When configuring Databricks Asset Bundles for dev/staging/prod with different cluster sizes and catalog names...

Answer:

Define the full pipeline in the default configuration block. Create three targets (dev, staging, prod) that each specify ONLY cluster_key and catalog_name overrides. The targets inherit everything else from default.

Distractor to avoid:

Creating three completely separate YAML files is the anti-pattern — it creates configuration drift and maintenance overhead.

Last-Minute Facts

1KS test = numerical drift only. Chi-squared = categorical drift only. Mixing them gives meaningless results
2Lakehouse Monitoring: 3 profile types — snapshot, time series, inference log (InferenceLog). Inference log = model prediction monitoring
3Drift metrics computed on REFRESH schedule — not real-time. Set refresh interval based on your SLA
4Deploy CODE not model artifacts across environments — this is the professional MLOps pattern
5Unity Catalog aliases (champion/challenger) replace legacy stages (Staging/Production) on current exam
6Unit tests: test deterministic logic only. Never assert on model accuracy or F1 score
7DABs targets INHERIT from default config — only specify overrides per environment
8Champion-challenger: compare on identical held-out data with identical metrics before promoting
9Concept drift requires labeled recent data to detect. Data drift can be detected without labels
10Inference tables = raw request/response logs. Lakehouse Monitoring = analyzes inference tables to produce drift metrics
Domain 312% of exam

Model Deployment

Must-Know Facts

  • Blue-green = two full environments running simultaneously. Traffic switches 100% instantly. Rollback is instant by switching back. Costs double during deployment window
  • Canary = gradual traffic routing (5% → 25% → 50% → 100%). Each step monitored before proceeding. Slower full rollout but validates with real traffic at low risk
  • Traffic splitting percentages must sum to exactly 100% across all served model versions on an endpoint — the exam uses this as a distractor in config questions
  • PyFunc model registration in Unity Catalog: call mlflow.pyfunc.log_model() with the 3-level namespace (catalog.schema.model_name) for governance and lineage
  • Model Serving endpoints are for real-time low-latency predictions only. For batch scoring of large datasets, use spark_udf with batch inference jobs
  • Canary deployment does NOT offer instant rollback — shifting traffic back takes time through the same gradual process. Choose blue-green when instant rollback is required
  • Shadow deployment: both models receive the same traffic simultaneously, but only the incumbent's predictions are served to users. New model's results are logged for offline comparison — zero user impact
  • Model serving REST API requires Bearer token authentication (PAT or service principal). Anonymous access is not supported by default
  • Custom artifacts (lookup tables, preprocessing pipelines, encoders) must be logged WITH the model using log_artifact() or as part of the PyFunc model — they cannot be assumed to exist at the serving endpoint

Common Traps

TrapRouting batch scoring jobs through real-time serving endpoints
RealityServing endpoints are architected for low-latency single-record or micro-batch predictions. Running thousands of records through REST calls is expensive and inefficient. Use mlflow.pyfunc.spark_udf() for scalable batch inference on Spark.
TrapChoosing canary deployment when the requirement specifies 'instant rollback'
RealityCanary cannot instantly roll back — you must route traffic back gradually through the same percentage steps, which takes time. Blue-green maintains a hot standby environment that can receive 100% traffic in seconds.
TrapThinking blue-green is always safer than canary
RealityBlue-green switches all traffic at once — 100% of users are exposed to the new model immediately. Canary exposes only a small percentage initially. For risk reduction with real production traffic, canary is actually safer per-user — blue-green offers faster rollback capability, not fewer users exposed.
TrapForgetting to log all dependencies when packaging PyFunc models
RealityServing endpoints load the model in a fresh Python environment. If the PyFunc predict() function imports a library not listed in conda_env or pip_requirements, the endpoint fails to start. These failures are logged as startup errors, not prediction errors — easy to miss in monitoring.
TrapAssuming traffic_percentage can be set to any combination that makes intuitive sense
RealityAll served_models on an endpoint must have traffic_percentage values that sum to exactly 100. A configuration with 90 + 15 = 105 is invalid. This is a common distractor in endpoint configuration questions.

Confusing Pairs

Blue-Green DeploymentCanary Deployment

Blue-Green = instant full switch, instant rollback, double infrastructure cost during deployment. Choose when rollback speed is critical (regulated, high-stakes). Canary = gradual percentage rollout, validates with real traffic, minimal extra cost, slow rollback. Choose when you want incremental confidence with production traffic.

Canary DeploymentShadow Deployment (Champion/Challenger)

Canary = new model SERVES real users with a small percentage of traffic. Users receive new model predictions. Shadow = new model runs on the same traffic but its predictions are NEVER served to users. Results are logged for offline comparison only. Zero user exposure in shadow mode.

Model Serving Endpoint (Real-Time)Batch Inference (spark_udf)

Serving endpoint = REST API, millisecond latency, auto-scaling, for real-time single/micro-batch predictions. Use when the application needs a synchronous prediction response. spark_udf = wraps MLflow model as a Spark UDF, runs distributed on a cluster, for scoring millions of records offline. Use for batch jobs, overnight scoring, or data pipeline enrichment.

MLflow Deployments SDKREST API for Serving

MLflow Deployments SDK = Python interface for creating, updating, querying, and deleting serving endpoints programmatically. Better for automated CI/CD pipelines that manage endpoint lifecycle. REST API = direct HTTP calls to invoke predictions. Use for integration with non-Python clients or when calling a deployed endpoint at prediction time.

Scenario Tips

If the question asks about:

When a regulated financial application requires deploying a new model with the ability to completely reverse the deployment within seconds if it underperforms...

Answer:

Blue-green deployment. Maintain the current (blue) environment live while deploying to green. Switch traffic 100% to green. If issues arise, switch 100% back to blue in seconds.

Distractor to avoid:

Canary cannot instantly roll back — reversing a canary requires gradually shifting traffic back, which takes time proportional to the rollout steps.

If the question asks about:

When a team wants to validate a new recommendation model with real production traffic but cannot risk showing worse recommendations to most users...

Answer:

Canary deployment starting at 5% traffic. Monitor click-through rate, conversion rate, and error rate at each stage before increasing to 25%, 50%, and finally 100%.

Distractor to avoid:

Blue-green exposes 100% of users to the new model on switch — higher per-user risk if the model underperforms. Shadow deployment would provide zero real impact but also zero real feedback about user behavior changes.

If the question asks about:

When a question asks how to score 50 million records overnight using a registered MLflow model...

Answer:

Use mlflow.pyfunc.spark_udf() to wrap the model as a Spark UDF and apply it to the DataFrame with withColumn(). This distributes inference across the cluster.

Distractor to avoid:

Routing 50M records through a real-time serving endpoint via REST API calls is prohibitively expensive, extremely slow, and not how serving endpoints are designed to be used.

If the question asks about:

When a custom model requires a label encoder (fitted on training data) to be available at prediction time...

Answer:

Log the encoder as a custom artifact with mlflow.log_artifact() or include it in the PyFunc model's artifacts dict. The PyFunc predict() method loads it from the mlflow run artifacts at serving time.

Distractor to avoid:

Assuming the encoder is accessible from the file system at the serving endpoint is wrong — endpoints run in isolated containers with only what was logged with the model.

Last-Minute Facts

1Blue-green = instant rollback (switch 100% traffic back). Canary = gradual rollback (no instant option)
2Traffic percentages across all endpoint served models must sum to exactly 100%
3Serving endpoints = real-time only. Batch scoring = spark_udf on a cluster
4Shadow deployment = new model runs but predictions are NEVER served — zero user exposure
5PyFunc dependencies MUST be in conda_env or pip_requirements — missing imports = startup failure
6REST API auth = Bearer token (PAT or service principal). No anonymous access
7Custom artifacts (encoders, lookup tables) must be logged with the model artifact — not assumed present at serving time

Feeling confident?

Put your knowledge to the test with a timed ML Professional mock exam.