You Can Pass This Exam For Free
Choose Your Study Path
No prior Spark ML or MLflow experience. You may have basic Python and ML theory knowledge. You'll build foundational knowledge from scratch over 3 weeks.
Exam Overview
Format
45 questions, 90 minutes. Multiple choice (single select and multiple select). Python-first — questions prefer PySpark and MLflow Python syntax, with some conceptual ML theory questions.
Scoring
Pass/fail based on percentage score. Passing: 70%. No penalty for wrong answers — always guess if unsure.
Domains & Weights
- Machine Learning Fundamentals18%
- ML Development and Feature Engineering27%
- Model Training and Evaluation22%
- Model Deployment and Management18%
- ML Operations (MLOps)15%
Registration
$200 USD. Available through Kryterion testing centers or online proctored. Schedule at databricks.com/certification. Costs $200 USD per attempt.
Topic Priority Table
Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.
Machine Learning Fundamentals
This domain covers ~18% and tests foundational ML concepts: supervised vs unsupervised learning, classification vs regression, ensemble methods (bagging vs boosting), bias-variance tradeoff, overfitting, and Databricks platform components for ML (ML Runtime, clusters, notebooks).
Key Topics
Must-Know Concepts
- Supervised learning: labeled data, includes classification (predicting categories) and regression (predicting continuous values)
- Unsupervised learning: unlabeled data, includes clustering (grouping similar items) and dimensionality reduction
- Spark ML Estimator: implements fit() and produces a Transformer. Transformer: implements transform() to apply learned parameters
- Bagging (Random Forest): trains models in parallel on random data subsets, reduces variance. Boosting (XGBoost, GBT): trains models sequentially, each correcting previous errors, reduces bias
- Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex)
- Databricks ML Runtime: pre-installed ML libraries (MLflow, scikit-learn, XGBoost, TensorFlow, PyTorch) — eliminates manual installation
- Standard multi-node cluster for distributed ML workloads; SQL Warehouses cannot run Spark ML
Common Exam Traps
ML Development and Feature Engineering
The largest domain at ~27%. Tests feature engineering techniques (encoding, scaling, imputation), Spark ML transformers, data preparation patterns, pandas API on Spark, Pandas UDFs, Feature Store, AutoML, and data leakage prevention. Expect code-heavy questions on VectorAssembler, StringIndexer, OneHotEncoder, and pipeline construction.
Key Topics
Must-Know Concepts
- VectorAssembler: combines multiple feature columns into a single Vector column — REQUIRED for all Spark ML models
- StringIndexer → OneHotEncoder chain: strings must be indexed first (StringIndexer), then encoded (OneHotEncoder). Direct string input to OneHotEncoder fails
- Imputer: fit() learns statistics (mean/median), transform() applies them. Calling transform() without fit() fails
- Data leakage prevention: fit scaler/encoder on training data only, then transform both train and test. Use Pipeline inside CrossValidator to ensure this
- Feature Store: centralized feature repository. Features are looked up by primary key during inference — only pass keys, not all features
- Pandas API on Spark (pyspark.pandas): pandas-compatible API on distributed Spark. Minimal code changes from pandas — just change the import
- Pandas UDFs (Vectorized UDFs): process data in Arrow batches for 10-100x faster performance than row-at-a-time Python UDFs
- AutoML: handles feature scaling, encoding, tuning automatically. Generates editable notebooks. EDA must be done separately
- One-hot encoding should be deferred to model pipelines (not feature store) because different models need different encoding strategies
Common Exam Traps
Model Training and Evaluation
Covers ~22% of the exam. Tests model training, evaluation metrics, hyperparameter tuning, cross-validation, ensemble methods, and distributed training with Spark ML. Expect questions on choosing the right metric, computing cross-validation scores, and configuring Hyperopt for distributed tuning.
Key Topics
Must-Know Concepts
- Regression metrics: RMSE (penalizes large errors), MAE (robust to outliers), R² (proportion of variance explained). Use RegressionEvaluator
- Classification metrics: Accuracy (overall correct), Precision (of predicted positives, how many are correct), Recall (of actual positives, how many are found), F1 (harmonic mean of precision and recall)
- For imbalanced datasets: accuracy is misleading — use F1 score or recall. A model predicting majority class always gets high accuracy but misses minority class
- Cross-validation score = mean of fold scores. Example: 5 fold scores [2.5, 3.1, 2.8, 3.4, 2.7] → CV score = 2.9
- Grid search total models = (hyperparameter combinations) × (CV folds). Example: 9 combinations × 3 folds = 27 models
- Hyperopt uses TPE (Bayesian): learns from past trials. SparkTrials distributes tuning across Spark workers
- When target is log-transformed: exponentiate predictions BEFORE computing metrics on original scale
- DataFrame.randomSplit() for train/test splitting in Spark
Common Exam Traps
Model Deployment and Management
Covers ~18% of the exam. Tests model registration, stage transitions in MLflow Model Registry, batch inference patterns, real-time serving, Feature Store integration for inference, distributed inference with Pandas UDFs, and data pipeline considerations for deployment.
Key Topics
Must-Know Concepts
- Register a model: mlflow.register_model(f'runs:/{run_id}/artifact_path', 'model_name') — note the 'runs:/' prefix with forward slash
- Stage transitions: None → Staging → Production → Archived. Performed on the model version details page in the UI or via API
- Batch inference: mlflow.pyfunc.spark_udf(spark, model_uri) creates a Spark UDF for distributed scoring across partitions
- mapInPandas(): applies a function to each Spark partition as a pandas DataFrame — ideal for applying scikit-learn models to distributed data
- Feature Store inference: register model with Feature Store feature sets, then only pass primary key at inference time — features are auto-retrieved
- Iterator-based inference: model loaded once per executor and reused across batches — avoids repeated model loading overhead
Common Exam Traps
ML Operations (MLOps)
Covers ~15% of the exam. Tests MLflow experiment tracking, autologging, reproducibility, run comparison, model monitoring, job orchestration for ML pipelines, and debugging failed ML jobs. Expect questions on mlflow.search_runs(), nested runs, notebook source tracking, and job diagnostics.
Key Topics
Must-Know Concepts
- MLflow autologging: mlflow.autolog() automatically logs parameters, metrics, and models for supported frameworks (scikit-learn, XGBoost, Spark ML)
- Nested runs: mlflow.start_run(nested=True) creates parent-child hierarchies — parent for the experiment, children for individual trials
- mlflow.search_runs(): programmatically filter and sort runs by metrics to find the best model without using the UI
- Source tracking: MLflow automatically records which notebook/script produced each run — accessible via the 'Source' link on run details page
- Databricks Jobs matrix view: diagnose multi-task job failures by viewing status and logs for each task individually
- Git Folders (Repos): create branches for safe ML experimentation without affecting production pipelines
Common Exam Traps
Key ML Concepts Compared
These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.
Top Mistakes to Avoid
Exam-Ready Checklist
Recommended Resources
Free & Official Resources
Paid Courses & Practice Exams
These are recommended if you prefer a structured learning path. They can save time but are not required to pass.