You Can Pass This Exam For Free
Choose Your Study Path
You have data analysis experience (SQL, Excel, basic statistics) but limited machine learning or advanced math background. You need to build up mathematical foundations and ML skills.
Exam Overview
Format
Up to 90 questions, 165 minutes. Multiple-choice and performance-based questions (PBQs).
Scoring
Pass/fail only (no scaled score). There is no published numeric passing threshold — you either pass or fail.
Domains & Weights
- Mathematics and Statistics17%
- Modeling, Analysis, and Outcomes24%
- Machine Learning24%
- Operations and Processes22%
- Specialized Applications of Data Science13%
Registration
$544 USD. Available at Pearson VUE testing centers or online proctored from home. Exam fee is $544 USD.
Topic Priority Table
Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.
Mathematics and Statistics
This domain tests your ability to apply mathematical and statistical methods to data science problems. It covers statistical tests, probability distributions, linear algebra, calculus fundamentals, and temporal models including time series analysis and causal inference. While the lightest domain by weight, the mathematical foundations here underpin every other domain on the exam.
Key Topics
Must-Know Concepts
- Statistical tests and when to use each: t-tests (comparing means), chi-squared (categorical independence), ANOVA (comparing multiple group means), and their assumptions
- Confusion matrix metrics: accuracy, precision, recall, F1 score, MCC — know how to calculate each from TP, FP, TN, FN
- ROC/AUC curves: how to interpret them, what AUC values mean (0.5 = random, 1.0 = perfect), and their role in model evaluation
- Regression performance metrics: R-squared, Adjusted R-squared, RMSE, and F-statistic — know what each measures and when to use it
- Probability distributions: normal, uniform, Poisson, t, binomial, power law — know their shapes, parameters, and typical use cases
- Distribution characteristics: skewness (asymmetry), kurtosis (tail heaviness), heteroskedasticity (non-constant variance) — know implications for model assumptions
- Type I error (false positive, rejecting true null) vs Type II error (false negative, failing to reject false null) and their relationship to significance level
- Linear algebra essentials: matrix operations (multiplication, transposition, inversion, decomposition), eigenvalues/eigenvectors, rank, and span
- Distance metrics: Euclidean (straight line), Manhattan (grid), cosine (angle between vectors) — know when each is appropriate
- Time series models: AR, MA, ARIMA — understand stationarity requirements and model selection
- Causal inference: difference between correlation and causation, DAGs, A/B testing, difference-in-differences, and RCTs
- Model selection criteria: AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) — lower values indicate better model fit with penalty for complexity
Common Exam Traps
Modeling, Analysis, and Outcomes
This domain covers the full modeling workflow from exploratory data analysis through results communication. You must demonstrate mastery of EDA techniques, data issue identification and resolution, feature engineering, model iteration, and presenting findings to stakeholders. It is one of the two heaviest domains at 24% and emphasizes practical, scenario-based application of data analysis skills.
Key Topics
Must-Know Concepts
- Exploratory Data Analysis: univariate analysis (single variable distributions) and multivariate analysis (relationships between variables) — know when and why to use each
- Visualization types and when to use each: bar plot (categorical comparisons), scatter plot (two continuous variables), box plot (distribution summary), violin plot (distribution shape), heat map (correlation matrices), line plot (trends over time)
- Feature types: categorical, discrete, continuous, ordinal, nominal, binary — must correctly identify each and know appropriate analysis methods for each type
- Data issues and solutions: sparse data/matrices, non-linearity, non-stationarity, multicollinearity, seasonality, granularity misalignment, insufficient features, multivariate outliers
- Feature engineering techniques: one-hot encoding (categorical to binary columns), label encoding (categorical to integers), normalization, binning, log/exponential transformation, Box-Cox transformation, ratio creation
- Handling multicollinearity: detection using VIF (Variance Inflation Factor), resolution through feature removal, PCA, or regularization
- Model design iteration: defining constraints (time, resources, hardware, cost), hyperparameter tuning, experiment tracking, diagnostic plots for architecture decisions
- Results communication: benchmarking against baselines, aligning with business requirements, accessibility in charts (font size, color choice, content tagging), documentation best practices
- Data enrichment: incorporating external data sources, synthetic data generation, and data augmentation techniques
Common Exam Traps
Machine Learning
This domain covers the full spectrum of machine learning from foundational concepts through deep learning. You must understand supervised learning (regression and classification), tree-based methods, unsupervised learning, and deep learning architectures. At 24%, this domain shares the top weight with Modeling and requires both theoretical understanding and practical application knowledge.
Key Topics
Must-Know Concepts
- Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex). Goal is to minimize total prediction error by balancing both
- Feature selection methods: importance metrics, VIF for multicollinearity, and model-based selection. Know when to reduce features vs engineer new ones
- Class imbalance handling: oversampling (SMOTE), undersampling, stratified sampling — know the tradeoffs of each approach
- Regularization types: L1/LASSO (feature selection), L2/Ridge (coefficient shrinkage), Elastic Net (combined), dropout (neural networks), early stopping, batch normalization
- Supervised statistical methods: linear regression (OLS, Ridge, LASSO, Elastic Net), logistic regression (probit/logit), discriminant analysis (LDA/QDA), Naive Bayes, association rules (confidence, lift, support)
- Tree-based methods: decision trees, random forests (bagging), gradient boosting, XGBoost — know the algorithm differences and when each excels
- Deep learning architecture: perceptron, multilayer perceptron, activation functions (ReLU, Sigmoid, Tanh, Softmax), backpropagation, layer types (input, hidden, pooling, output)
- Deep learning models: CNN (images), RNN (sequences), LSTM (long sequences), GANs (generation), autoencoders (compression), transformers (attention-based)
- Optimizers: Adam, SGD, RMSprop, momentum, mini-batch — know their characteristics and when to use each
- Unsupervised methods: k-means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP, SVD — know method selection criteria
- Data leakage: information from outside the training dataset improperly influencing the model. Common in feature engineering and cross-validation
- Hyperparameter tuning: grid search (exhaustive) vs random search (sampled) — know efficiency tradeoffs
Common Exam Traps
Operations and Processes
This domain covers the operational side of data science: from business requirements gathering through data acquisition, infrastructure, wrangling, lifecycle management, and MLOps deployment. At 22%, it tests your ability to translate business needs into technical solutions and maintain production data science systems. Expect scenario questions about data pipelines, deployment strategies, and operational best practices.
Key Topics
Must-Know Concepts
- Compliance and security: PII identification and protection, proprietary data handling, anonymization techniques, obfuscation methods — these appear throughout the domain
- Business translation: establishing measures, metrics, and KPIs; requirements gathering with cost-benefit analysis; translating business needs into data science solutions
- Data acquisition sources: surveys, administrative data, sensor data, transactional data, experimental data, synthetic data (costs, benefits, limitations), commercial/public data (licensing, restrictions)
- Data infrastructure: resource sizing, GPU/TPU considerations, data formats (CSV, JSON, Parquet, compressed), storage types (structured, semi-structured, unstructured), streaming vs batching
- Data pipeline implementation: orchestration, automation, data lineage tracking, and archiving strategies
- Data wrangling: merging techniques (defining keys, fuzzy joins), deduplication, standardization, unit conversion, regular expressions, outlier handling (winsorization), imputation strategies, ground truth labeling
- CRISP-DM phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment — know what happens at each phase
- Version control for data science: code versioning, data versioning, hyperparameter tracking, model versioning — all four must be managed
- MLOps practices: data replication, CI/CD pipelines for models, container orchestration, model validation (online, offline, A/B testing), continuous performance monitoring
- Deployment environments: containerization, cloud, cluster, hybrid, edge, on-premises — know tradeoffs and appropriate use cases for each
- Clean code practices: unit testing, documentation (markdown, docstrings, code comments), dependency licensing management, API access patterns
Common Exam Traps
Specialized Applications of Data Science
This domain covers advanced and specialized data science applications including optimization, NLP, computer vision, and emerging techniques. At 13%, it is the lightest domain but contains highly specific technical content. Expect questions on NLP preprocessing, computer vision techniques, optimization methods, and applications like fraud detection and reinforcement learning.
Key Topics
Must-Know Concepts
- Constrained optimization: linear programming (simplex method), network topology optimization, scheduling, non-linear solvers, pricing, and resource allocation
- Unconstrained optimization: multi-armed bandit (exploration vs exploitation tradeoff), local extrema finding, gradient-based methods
- NLP preprocessing pipeline: tokenization, bag of words, lemmatization vs stemming, stop word removal, n-grams — know the correct ordering and purpose of each step
- NLP representations: TF-IDF (term frequency-inverse document frequency), word embeddings (Word2Vec, GloVe), document-term matrices
- NLP applications: sentiment analysis, named entity recognition (NER), question answering, text generation, text summarization, speech recognition, NLU/NLG
- Topic modeling: Latent Dirichlet Allocation (LDA) — unsupervised method for discovering topics in document collections
- Computer vision core concepts: CNNs for feature extraction, OCR (optical character recognition), object detection and tracking, semantic segmentation, sensor fusion
- Computer vision data augmentation: rotation, flipping, scaling, cropping, noise injection, occlusion, filter application, masking — know why each is used
- Specialized applications: graph analysis, heuristics, greedy algorithms, reinforcement learning, event/fraud/anomaly detection, multimodal ML, edge computing optimization, signal processing
Common Exam Traps
Concepts You Must Not Confuse
These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.
Top Mistakes to Avoid
Exam-Ready Checklist
Recommended Resources
Free & Official Resources
Paid Courses & Practice Exams
These are recommended if you prefer a structured learning path. They can save time but are not required to pass.