General Exam Tips
- 1.Read ALL answer choices before committing — many traps rely on plausible-looking options that differ by one word (e.g., 'runs:/' vs 'runs:').
- 2.This is a Python-first exam: when the question asks 'how would you do X,' the answer is almost always a specific API call, not a conceptual description.
- 3.45 questions in 90 minutes = 2 minutes per question. Flag and skip questions that require deep calculation — come back at the end.
- 4.If you see 'imbalanced dataset' in a question, immediately think: accuracy is wrong — use F1 or recall.
- 5.If you see 'maximize accuracy/AUC' with Hyperopt, the answer involves returning a negative value.
- 6.If you see 'large dataset' or 'millions of records' for inference, the answer is batch inference (spark_udf or mapInPandas), not a model serving endpoint.
- 7.Unity Catalog models use aliases (not stages) — if the question mentions UC model registry, think 'aliases' not 'Staging/Production'.
- 8.For time series data: random splits are WRONG. The answer is always temporal/rolling cross-validation to preserve time order.
- 9.No partial credit — every question is binary. Never leave blank answers; always guess when unsure.
Quick Navigation
Machine Learning Fundamentals
Must-Know Facts
- Estimator.fit() returns a Transformer (the trained model). Transformer.transform() applies learned parameters. An Estimator is NOT a pre-trained model — it is the algorithm class.
- Bagging = parallel training on random subsets → reduces variance. Random Forest is the canonical bagging algorithm.
- Boosting = sequential training, each tree corrects previous errors → reduces bias. XGBoost and GBT use boosting.
- Gradient boosting iterations are inherently sequential — cannot be parallelized across tree iterations (only within single tree construction).
- High bias = underfitting (model too simple). High variance = overfitting (model too complex). Solutions are opposite.
- SQL Warehouses are for SQL queries only — they cannot run Spark ML, scikit-learn, or MLflow. Use a Standard multi-node cluster for ML workloads.
- Single Node clusters run Python/pandas but cannot run distributed Spark ML algorithms.
- Databricks ML Runtime pre-installs MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch. Standard Runtime does NOT include these.
- Supervised learning requires labeled data. Unsupervised learning (clustering, dimensionality reduction) does NOT require labels.
Common Traps
Confusing Pairs
Scenario Tips
A question asks you to choose the right cluster type for a data science team using Spark ML and MLflow.
Standard multi-node cluster with ML Runtime. Provides distributed training (multiple workers) and pre-installed ML libraries.
SQL Warehouse sounds like a good idea for 'analytics' but it cannot run Python ML code. Single Node works for pandas but not distributed Spark ML.
A model has 95% training accuracy but 60% test accuracy. What is happening and what should be done?
High variance (overfitting). Fix: simplify the model, add regularization, gather more training data, use dropout.
Do not confuse with high bias (underfitting) where BOTH training and test accuracy are low.
A question asks which ensemble method to use for a problem where the current model underfits the data.
Boosting (XGBoost, GBT) — it reduces bias by sequentially correcting model errors.
Bagging/Random Forest reduces variance, not bias. It helps when you have overfitting, not underfitting.
Last-Minute Facts
ML Development and Feature Engineering
Must-Know Facts
- VectorAssembler is ALWAYS required before any Spark ML model — it combines multiple feature columns into a single 'features' Vector column.
- StringIndexer MUST come before OneHotEncoder. OneHotEncoder expects numeric indices as input, not raw string values.
- Imputer must call fit() to learn statistics (mean/median) THEN transform() to apply. Calling transform() without fit() throws an error.
- Data leakage: never fit a scaler or encoder on the full dataset before splitting. Fit only on training data. Best practice: include scaler inside a Pipeline used with CrossValidator.
- pandas API on Spark (pyspark.pandas) DataFrames ARE distributed — they are backed by Spark DataFrames. Do NOT call collect() on them.
- Pandas UDFs process data in Apache Arrow columnar batches — 10-100x faster than row-at-a-time Python UDFs due to reduced JVM-Python serialization.
- mapInPandas() applies a function to each Spark partition as a pandas DataFrame — the standard pattern for distributed scikit-learn inference.
- AutoML handles feature scaling, encoding, imputation, and hyperparameter tuning automatically. It does NOT perform EDA.
- AutoML generates fully editable notebooks — it is not a black box. The code uses standard MLflow and scikit-learn.
- Feature Store: at inference time, only pass the primary key column — features are automatically retrieved from the feature table.
- Ordinal encoding preserves order (Low/Medium/High → 0/1/2). One-hot encoding treats categories as unordered. Use OHE for nominal, ordinal for ordered categories.
- For skewed/outlier-heavy features, use median imputation. For roughly normal features, use mean imputation.
Common Traps
Confusing Pairs
Scenario Tips
A Spark ML pipeline with StringIndexer → OneHotEncoder → VectorAssembler → RandomForestClassifier is giving an error during fit().
Check that StringIndexer.outputCol matches OneHotEncoder.inputCol exactly, and that OneHotEncoder.outputCol is one of the inputs to VectorAssembler.inputCols.
A common wrong answer is to swap the order of StringIndexer and OneHotEncoder — but they must stay in this exact order (StringIndexer first).
A data scientist wants to apply a trained scikit-learn model to score 50 million records in a Spark DataFrame.
Use mapInPandas() or mlflow.pyfunc.spark_udf(). Both distribute inference across partitions. Load the model once per executor using an iterator pattern to avoid repeated deserialization.
Model Serving endpoint is wrong here — it is for real-time, low-latency single-record requests, not batch scoring of millions of records.
The question asks how to prevent data leakage when scaling features inside cross-validation.
Wrap the StandardScaler and model into a Pipeline, then pass the Pipeline as the estimator to CrossValidator. This ensures fit() is called only on the training folds.
A wrong approach is to fit the scaler on the entire dataset before CV — this makes validation fold statistics available during scaler fitting.
A categorical feature has ordered values (Low, Medium, High, Critical). Should you use StringIndexer → OneHotEncoder or just StringIndexer alone?
StringIndexer alone (ordinal encoding) is sufficient and appropriate for ordered categories. Adding OneHotEncoder would destroy the ordinal relationship.
OneHotEncoder is not always necessary after StringIndexer — only add it for nominal (unordered) categories.
Last-Minute Facts
Model Training and Evaluation
Must-Know Facts
- RegressionEvaluator metrics: 'rmse', 'mae', 'r2', 'mse'. BinaryClassificationEvaluator: 'areaUnderROC'. MulticlassClassificationEvaluator: 'f1', 'accuracy', 'precisionByLabel', 'recallByLabel'.
- F1 score is a CLASSIFICATION metric only. Never use F1 for regression — if the exam puts F1 as an option for a regression question, it is wrong.
- For imbalanced datasets, accuracy is misleading — a model predicting the majority class always achieves high accuracy. Use F1 (if you need balance) or Recall (if catching all positives matters most).
- CV score = mean of all fold scores. Example: [2.5, 3.1, 2.8, 3.4, 2.7] → (14.5/5) = 2.9.
- CrossValidator trains (combinations × folds) total models. TrainValidationSplit trains (combinations × 1) models. For 9 param combos + 3 folds: CV = 27 models, TVS = 9 models.
- Hyperopt fmin() MINIMIZES by default. Return -accuracy for maximization metrics. If you return raw accuracy, Hyperopt finds the worst model.
- SparkTrials distributes Hyperopt trials across Spark workers. Trials() uses single-machine parallelism.
- Hyperopt uses TPE (Tree of Parzen Estimators) — a Bayesian method that learns from past trials. Too much parallelism degrades it toward random search.
- Time series data requires temporal/rolling cross-validation, NOT random splits. Random splits leak future information into training.
- When the target was log-transformed during training, exponentiate predictions BEFORE computing metrics on the original scale.
- TrainValidationSplit uses a single random split (trainRatio parameter, e.g., 0.8). CrossValidator uses k folds.
Common Traps
Confusing Pairs
Scenario Tips
A fraud detection model must minimize missed fraud cases, even if some legitimate transactions are flagged. Which metric to optimize?
Recall — it measures how many actual fraud cases are caught. Missing fraud (false negative) is the costly error here.
Accuracy is misleading for imbalanced fraud data. Precision minimizes false positives (wrong for this use case). F1 balances both but is not optimal when Recall is the priority.
A Hyperopt objective returns accuracy directly (not negated). What will happen?
Hyperopt will converge on the model with the LOWEST accuracy. The fix is to return -accuracy (negative value). Hyperopt minimizes.
Returning {'loss': accuracy} is also wrong — loss is minimized, so you still need -accuracy.
You're tuning 3 hyperparameters with 3 values each using CrossValidator with 5 folds. How many models will be trained?
27 combinations × 5 folds = 135 total models. TrainValidationSplit with the same grid: 27 × 1 = 27 models.
A common mistake is forgetting to multiply by the number of folds. The total is combinations × folds, not just combinations.
A model was trained with log-transformed target (price → log_price). At inference time, what is the correct process to get RMSE on the original price scale?
Apply model.predict() to get log_price predictions, then np.exp() to convert back to price, then compute RMSE on original price vs. predicted price.
Computing RMSE on the log-scale and then exponentiating the RMSE is wrong — you must inverse-transform predictions BEFORE computing the metric.
Last-Minute Facts
Model Deployment and Management
Must-Know Facts
- mlflow.register_model() URI format: 'runs:/{run_id}/artifact_path' — the forward slash after 'runs:' is REQUIRED. 'runs:{run_id}/...' (no slash) is wrong.
- Loading from Model Registry: 'models:/model_name/Production' or 'models:/model_name/1' (by version). This uses 'models:/' prefix, NOT 'runs:/'.
- Workspace Model Registry stage flow: None → Staging → Production → Archived. Each model VERSION has its own stage.
- Unity Catalog Model Registry does NOT use stages. Instead use model ALIASES (e.g., 'champion', 'challenger'). Stages are deprecated in MLflow 2.9+.
- Batch inference pattern: mlflow.pyfunc.spark_udf(spark, model_uri) creates a Spark UDF that distributes scoring across all cluster workers.
- mapInPandas() with iterator-based loading loads the model ONCE per executor and reuses it across batches — avoids repeated model deserialization overhead.
- Feature Store inference: the model must be logged WITH Feature Store feature lookup specifications. Only then does fs.score_batch() auto-retrieve features by primary key.
- For Unity Catalog feature tables, use FeatureEngineeringClient (from databricks-feature-engineering). FeatureStoreClient is the legacy client for Workspace Feature Store. Import: from databricks.feature_engineering import FeatureEngineeringClient.
- Model Serving endpoints are for real-time, low-latency single-record requests — NOT for batch scoring millions of records.
- Stage transitions are performed on the model VERSION details page in the UI, not the model overview page.
- PSI (Population Stability Index) measures input data distribution shifts. K-S test compares feature distributions. Jensen-Shannon divergence quantifies distributional difference.
Common Traps
Confusing Pairs
Scenario Tips
A question gives you best_run_id='abc123' and model artifact path 'classifier'. What is the exact register_model call?
mlflow.register_model('runs:/abc123/classifier', 'model_name'). The URI is 'runs:/' (with slash) then run_id then artifact path.
Option 'runs:abc123/classifier' (no slash after colon) and 'runs://abc123/classifier' (double slash) are both wrong. The exam tests this exact format.
A company uses Unity Catalog. They want to mark a model version as their production model. How?
Set a model alias using client.set_registered_model_alias(name, alias='champion', version=3). Load it via 'models:/catalog.schema.model@champion'.
client.transition_model_version_stage() to 'Production' is wrong for UC — stages are not supported in Unity Catalog.
A question asks how to score 10 million records using a scikit-learn model stored in MLflow, distributed across a Spark cluster.
Use mapInPandas() with an iterator pattern that loads the model once per executor, OR use mlflow.pyfunc.spark_udf() to create a Spark UDF from the model.
Calling collect() to bring all 10M rows to the driver and scoring locally defeats distributed computing and will likely OOM.
Model performance starts degrading one month after deployment. Feature distributions look normal. What type of drift is this, and how do you monitor for it?
Concept drift — the underlying relationship between features and target has changed. Monitor by tracking model output/prediction accuracy over time, not input feature distributions.
PSI and K-S test measure data drift (input distribution changes). For concept drift, you need ground truth labels to compare against predictions.
A question asks how to perform continuous/streaming inference on arriving data using an MLflow model.
Apply the model as a UDF on a Spark Structured Streaming DataFrame — treat the streaming DataFrame exactly like a batch DataFrame. Call predict_udf on the stream and write results with writeStream. Delta Live Tables can also orchestrate streaming inference pipelines.
Model Serving endpoints are for real-time per-request calls, not continuous stream processing. Batch inference with spark_udf still applies to streaming DataFrames — the API is the same.
Last-Minute Facts
ML Operations (MLOps)
Must-Know Facts
- mlflow.autolog() automatically logs parameters, metrics, and model artifacts for supported frameworks (scikit-learn, XGBoost, LightGBM, Spark ML). No explicit log_param/log_metric calls needed.
- Autologging does NOT create nested run hierarchies. For parent-child relationships (e.g., grid search parent with trial children), you must explicitly set nested=True.
- mlflow.start_run(nested=True) creates a child run inside an active parent run context. Without nested=True, starting a run inside another run raises an error.
- mlflow.search_runs() returns a pandas DataFrame sorted/filtered by any logged metric or parameter — the programmatic way to find the best run.
- Source tracking: MLflow records the notebook or script that created each run. Accessible via the 'Source' link on the run details page in the UI.
- Databricks Jobs matrix view shows each task's status and logs individually — use this to pinpoint which task failed in a multi-task job.
- Git Folders (Repos): create a feature branch for ML experiments to avoid breaking the production pipeline on main branch.
- Delta Lake time travel enables reproducing training datasets: VERSION AS OF 42 or TIMESTAMP AS OF '2025-01-01'.
- Model monitoring with Lakehouse Monitoring: track input distribution drift, prediction drift, and data quality over time using Delta tables.
- To find the best run programmatically: mlflow.search_runs(filter_string='metrics.rmse < 2.0', order_by=['metrics.rmse ASC']).
Common Traps
Confusing Pairs
Scenario Tips
A data scientist wants one MLflow run per hyperparameter search with individual child runs for each combination. How?
Create an outer mlflow.start_run() context for the parent, then inside it create each trial with mlflow.start_run(nested=True). Autologging alone will not create this hierarchy.
Calling mlflow.autolog() before the loop does NOT create parent-child relationships — it just logs each run flat.
A team wants to find the run with the lowest validation RMSE without manually checking the UI.
Use mlflow.search_runs(experiment_ids=['123'], order_by=['metrics.val_rmse ASC'], max_results=1). This returns a pandas DataFrame with the best run on top.
Iterating through mlflow.get_experiment() and calling get_run() for each is inefficient. search_runs() is the purpose-built API.
A complex multi-task Databricks Job starts failing intermittently. A team member wants to rerun the entire job to diagnose it.
Open the specific failed job run, use the matrix view to identify which task failed. Click on that task to view its logs. Rerun only the failed task once diagnosed.
Rerunning all tasks wastes compute and obscures which specific task is the root cause.
A model needs to be reproduced exactly 3 months later, including using the exact same training data. What enables this?
Delta Lake time travel — VERSION AS OF N or TIMESTAMP AS OF '...' queries the exact snapshot of the Delta table used for training. Combined with MLflow run metadata (which version was used), enables full reproducibility.
MLflow alone records the model and parameters but not the data. You need Delta Lake for dataset versioning.