General Exam Tips
- 1.Read ALL answer options before choosing — two options often both work but one is the Databricks-idiomatic answer
- 2.When stuck, eliminate options that compromise governance, automation, or scalability — the exam rewards production-ready thinking
- 3.Scenario phrasing like 'most efficient', 'best approach', or 'production-grade' means one option is clearly better — look for the automation/governance angle
- 4.This exam tests judgment, not syntax recall — ask 'Why choose this over the obvious alternative?' for every answer you pick
- 5.59 questions in 120 minutes = ~2 minutes per question. Flag long scenarios and return; do not skip final review
- 6.No penalty for wrong answers — always guess before moving on
- 7.The September 2025 version consolidated to 3 domains. If a study resource mentions 4 domains, it is outdated
- 8.MLOps and Model Development are both 44% — treat them as equally important. Do not over-invest in Model Deployment (12%)
- 9.Scenario questions often describe a symptom (e.g., 'model precision dropped') and test whether you diagnose correctly before jumping to a fix
Quick Navigation
Model Development
Must-Know Facts
- SparkML requires ALL features combined into a single Vector column — VectorAssembler is mandatory as the final feature-preparation stage
- StringIndexer converts string labels to numeric indices. OneHotEncoder converts those indices to binary vectors. Order matters: StringIndexer MUST precede OneHotEncoder
- CrossValidator trains k * n models total (k folds * n ParamGrid combinations) — this grows exponentially; use TrainValidationSplit when compute is limited
- Optuna's define-by-run API lets you define the search space inside the objective function. Use the MLflow callback to auto-log each trial
- Both Hyperopt and Optuna minimize by default — negate metrics you want to maximize (return -f1_score, not +f1_score)
- Parallelism > completed trials degrades Bayesian optimization — TPE needs prior results to propose better parameters. High parallelism = random search behavior
- applyInPandas() groups data by a key and applies a function per group — use for training one model per store/product/region
- mapInPandas() applies a function per partition without grouping — use for distributed batch inference, not group-specific training
- Vertical scaling adds CPU/RAM per node (memory-bound workloads). Horizontal scaling adds more nodes (compute-bound, data-parallel workloads)
- Data parallelism: each worker holds the full model and trains on a subset of data — standard SparkML, SparkTrials pattern
- Model parallelism: model is split across workers — only needed when the model itself exceeds single-node memory (rare for classical ML)
- Nested MLflow runs: parent groups child runs. Use mlflow.start_run(nested=True) inside a parent context. Required for organized hyperparameter search
- PyFunc models execute custom Python at prediction time — wrap preprocessing AND postprocessing so they travel with the model artifact
- Feature Store point-in-time correctness requires a timestamp_keys column in the feature table AND a timestamp_lookup_key in the FeatureLookup call
- On-demand features are computed at request time from data in the prediction request itself — they are NOT pre-computed and stored in the feature table
- Online tables are read-only replicas of offline Feature Store tables with millisecond lookup latency — they sync automatically but have a propagation delay
Common Traps
Confusing Pairs
Scenario Tips
When the question asks about training separate models for each entity (store, product, region) and data is in a single Spark DataFrame with a group column...
Use df.groupBy('entity_col').applyInPandas(train_fn, schema). This co-locates all data for each group on a single worker before applying the function.
mapInPandas() is wrong here — it operates per partition, not per group, so the same entity's data may be split across workers.
When a question asks how to prevent future data from contaminating historical training features in Feature Store lookups...
Use point-in-time correctness by (1) defining timestamp_keys when creating the feature table and (2) setting timestamp_lookup_key in FeatureLookup. This retrieves features as of each record's event timestamp.
Manually filtering the feature table by date is wrong — it applies a single global cutoff, not per-record timestamps, and does not handle entities that were updated multiple times.
When Optuna produces results barely better than random despite 100 trials...
The parallelism is too high. TPE needs completed trial results to make informed proposals. Reduce parallelism so the optimizer can learn sequentially between batches.
Increasing max_evals won't help if parallelism already saturates all workers — you're still running random search, just more of it.
When a model needs preprocessing (tokenization) and postprocessing (label decoding) at inference time to avoid training-serving skew...
Wrap all three steps (preprocessing + model + postprocessing) in a custom MLflow PyFunc model. The predict() method executes the full pipeline as a single artifact.
Doing preprocessing in the client application is wrong — any difference between the training preprocessing and client preprocessing causes training-serving skew, a notoriously difficult production bug to diagnose.
When to choose SparkML Pipeline vs single-node model for a new ML workload on 5 million rows...
Use SparkML if the algorithm is naturally data-parallel (logistic regression, gradient boosted trees with SparkML implementations). Use single-node model with SparkTrials/Optuna if you need scikit-learn or XGBoost's full feature set — distribute the tuning, not the training.
Forcing single-node algorithms (sklearn, XGBoost) into SparkML syntax is wrong — SparkML has its own implementations with different APIs and behaviors.
Last-Minute Facts
MLOps
Must-Know Facts
- Deploy CODE, not trained model artifacts, across environments. The same pipeline code trains fresh models in each environment (dev/staging/prod) with environment-specific data and configs
- Unity Catalog model aliases ('champion', 'challenger', 'baseline') replace legacy stage transitions (Staging/Production/Archived). The September 2025 exam tests aliases exclusively
- Unit tests validate DETERMINISTIC functions only: data transformations, schema validation, feature logic, edge cases. Never test model accuracy in unit tests
- Integration tests validate component interactions: feature pipelines produce expected output types, training completes without errors, predictions fall within expected ranges
- End-to-end tests validate the complete pipeline from feature computation through deployment on test datasets in temporary catalogs or schemas
- Run unit tests on every commit (fast). Run integration tests on merge to main (slower, require infrastructure). End-to-end tests before release
- Databricks Asset Bundles (DABs) define ML resources as YAML. Use targets for environment-specific overrides — targets INHERIT from default; only specify what changes per environment
- Lakehouse Monitoring drift metrics are computed on scheduled REFRESH, not continuously — configure refresh frequency based on your monitoring SLA
- KS test (Kolmogorov-Smirnov) = numerical features only. Chi-squared = categorical features only. Using the wrong test produces statistically meaningless results
- Jensen-Shannon divergence = symmetric, bounded [0,1], works for both numerical (binned) and categorical distributions. Use when you need a comparable score across feature types
- Three Lakehouse Monitoring profile types: snapshot (point-in-time data quality), time series (temporal trends), inference log — InferenceLog in the API — (model inputs/outputs/performance metrics)
- Inference tables are auto-created when endpoint logging is enabled — they store raw request/response pairs as Delta tables in Unity Catalog. They are NOT drift metrics; Lakehouse Monitoring analyzes them to compute metrics
- Automated retraining: always compare the challenger against the champion on a HELD-OUT dataset with IDENTICAL metrics before promoting — never auto-promote without validation
- Champion-challenger pattern: new model runs in shadow mode alongside production champion. Compare offline. Promote only when challenger beats champion by a statistically significant margin
- Concept drift = P(Y|X) changes (the relationship between features and target). Data drift = P(X) changes (feature distributions shift). Both can occur independently
Common Traps
Confusing Pairs
Scenario Tips
When asked how to move an ML pipeline from dev to production...
Promote the pipeline CODE and configuration (via DABs or Git) to the production environment. The production environment runs the same code against production data to train a fresh model.
Exporting the dev-trained model artifact and registering it in production is the anti-pattern. It skips validation with production data and may embed dev-specific data characteristics.
When a fraud model's precision drops 15% and you need to determine if it is data drift or concept drift...
Use Lakehouse Monitoring to compare input feature distributions against the training baseline (diagnoses data drift), AND separately evaluate the model on recent labeled data (diagnoses concept drift). You cannot distinguish the two by looking at features alone.
Immediately retraining without diagnosing the cause is wrong — if it's concept drift with a structural change, retraining on recent data may help, but if it's a data quality issue, retraining amplifies the problem.
When the question asks which statistical test to configure in Lakehouse Monitoring for numerical feature drift...
Kolmogorov-Smirnov (KS) test for numerical/continuous features. Chi-squared for categorical features.
Chi-squared on numerical features and KS on categorical features are both wrong. This is the highest-frequency statistical trap on the exam.
When a new model outperforms the champion in offline testing and you need to decide how to promote it...
Assign the new model the 'challenger' alias in Unity Catalog. Run champion-challenger comparison in production (shadow or A/B). Only assign the 'champion' alias when the challenger has demonstrated production superiority.
Directly assigning the 'champion' alias without production validation skips the real-world safety check. Offline metrics do not always translate to production performance.
When asked which Lakehouse Monitoring table type to create to track model serving endpoint predictions over time...
Inference Log monitor (InferenceLog profile type) — it specifically tracks model inputs, outputs, and performance metrics from a serving endpoint's inference log table. Configure it with problem_type, prediction_col, label_col, and timestamp_col.
Snapshot monitor is wrong — it compares point-in-time states, not temporal model output trends. Time series monitor is closer but doesn't have built-in awareness of model prediction semantics.
When configuring Databricks Asset Bundles for dev/staging/prod with different cluster sizes and catalog names...
Define the full pipeline in the default configuration block. Create three targets (dev, staging, prod) that each specify ONLY cluster_key and catalog_name overrides. The targets inherit everything else from default.
Creating three completely separate YAML files is the anti-pattern — it creates configuration drift and maintenance overhead.
Last-Minute Facts
Model Deployment
Must-Know Facts
- Blue-green = two full environments running simultaneously. Traffic switches 100% instantly. Rollback is instant by switching back. Costs double during deployment window
- Canary = gradual traffic routing (5% → 25% → 50% → 100%). Each step monitored before proceeding. Slower full rollout but validates with real traffic at low risk
- Traffic splitting percentages must sum to exactly 100% across all served model versions on an endpoint — the exam uses this as a distractor in config questions
- PyFunc model registration in Unity Catalog: call mlflow.pyfunc.log_model() with the 3-level namespace (catalog.schema.model_name) for governance and lineage
- Model Serving endpoints are for real-time low-latency predictions only. For batch scoring of large datasets, use spark_udf with batch inference jobs
- Canary deployment does NOT offer instant rollback — shifting traffic back takes time through the same gradual process. Choose blue-green when instant rollback is required
- Shadow deployment: both models receive the same traffic simultaneously, but only the incumbent's predictions are served to users. New model's results are logged for offline comparison — zero user impact
- Model serving REST API requires Bearer token authentication (PAT or service principal). Anonymous access is not supported by default
- Custom artifacts (lookup tables, preprocessing pipelines, encoders) must be logged WITH the model using log_artifact() or as part of the PyFunc model — they cannot be assumed to exist at the serving endpoint
Common Traps
Confusing Pairs
Scenario Tips
When a regulated financial application requires deploying a new model with the ability to completely reverse the deployment within seconds if it underperforms...
Blue-green deployment. Maintain the current (blue) environment live while deploying to green. Switch traffic 100% to green. If issues arise, switch 100% back to blue in seconds.
Canary cannot instantly roll back — reversing a canary requires gradually shifting traffic back, which takes time proportional to the rollout steps.
When a team wants to validate a new recommendation model with real production traffic but cannot risk showing worse recommendations to most users...
Canary deployment starting at 5% traffic. Monitor click-through rate, conversion rate, and error rate at each stage before increasing to 25%, 50%, and finally 100%.
Blue-green exposes 100% of users to the new model on switch — higher per-user risk if the model underperforms. Shadow deployment would provide zero real impact but also zero real feedback about user behavior changes.
When a question asks how to score 50 million records overnight using a registered MLflow model...
Use mlflow.pyfunc.spark_udf() to wrap the model as a Spark UDF and apply it to the DataFrame with withColumn(). This distributes inference across the cluster.
Routing 50M records through a real-time serving endpoint via REST API calls is prohibitively expensive, extremely slow, and not how serving endpoints are designed to be used.
When a custom model requires a label encoder (fitted on training data) to be available at prediction time...
Log the encoder as a custom artifact with mlflow.log_artifact() or include it in the PyFunc model's artifacts dict. The PyFunc predict() method loads it from the mlflow run artifacts at serving time.
Assuming the encoder is accessible from the file system at the serving endpoint is wrong — endpoints run in isolated containers with only what was logged with the model.