CertPrepNow
CompTIADY0-001Updated 2026-06-15

DY0-001 Study Guide

Everything you need to pass the CompTIA DataAI (formerly DataX) exam. Structured study plans, key services, common traps, and practice questions.

You Can Pass This Exam For Free

The DY0-001 exam is passable with free resources if you study consistently for 3-6 months, though this is an expert-level exam requiring deep data science knowledge:

  • CompTIA official DY0-001 exam objectives PDF (free download)
  • Khan Academy statistics and linear algebra courses (free)
  • Andrew Ng's Machine Learning Specialization on Coursera (audit for free)
  • Scikit-learn, TensorFlow, and PyTorch official documentation (free)
  • CRISP-DM methodology documentation and guides (free)
  • 500+ free practice questions on this site

This is an expert-level certification with a high failure rate. The exam tests applied knowledge of statistics, machine learning, and MLOps — not just definitions. Hands-on experience with Python/R, ML frameworks, and real datasets is essential and cannot be fully replaced by study materials alone.

Choose Your Study Path

You have data analysis experience (SQL, Excel, basic statistics) but limited machine learning or advanced math background. You need to build up mathematical foundations and ML skills.

Month 1 Weeks 1-2Build math foundations: review linear algebra (matrices, eigenvalues, decomposition), calculus (partial derivatives, chain rule, gradient), and probability distributions (normal, Poisson, binomial, t-distribution)
Month 1 Weeks 3-4Study statistics in depth: hypothesis testing (t-tests, chi-squared, ANOVA), p-values, Type I/II errors, confidence intervals, regression metrics (R-squared, RMSE), confusion matrix metrics (precision, recall, F1, MCC)
Month 2 Weeks 1-2Learn Domain 2 EDA and modeling: univariate/multivariate analysis, visualization types, feature engineering (one-hot encoding, binning, Box-Cox), handling multicollinearity, outliers, and missing data patterns
Month 2 Weeks 3-4Study supervised learning: linear regression (OLS, Ridge, LASSO, Elastic Net), logistic regression, Naive Bayes, discriminant analysis, decision trees, and ensemble methods (random forests, gradient boosting, XGBoost)
Month 3 Weeks 1-2Learn unsupervised learning (k-means, hierarchical clustering, DBSCAN, PCA, t-SNE) and deep learning fundamentals (neural network architecture, activation functions, backpropagation, CNNs, RNNs, LSTMs, transformers)
Month 3 Weeks 3-4Study bias-variance tradeoff, overfitting/underfitting, regularization (dropout, L1/L2), cross-validation, hyperparameter tuning (grid search, random search), model drift, and data leakage prevention
Month 4 Weeks 1-2Cover Domain 4 operations: data acquisition, data wrangling (joins, deduplication, imputation), data infrastructure (Parquet, streaming vs batching), CRISP-DM framework, version control for code/data/models
Month 4 Weeks 3-4Study MLOps and deployment: CI/CD pipelines, containerization, model validation (A/B testing, online/offline), deployment environments (cloud, edge, hybrid), and continuous model monitoring
Month 5 Weeks 1-2Cover Domain 5 specialized applications: NLP (tokenization, TF-IDF, embeddings, sentiment analysis, LDA), computer vision (OCR, object detection, data augmentation), optimization (simplex, multi-armed bandit), and reinforcement learning
Month 5 Weeks 3-4Take full-length practice exams under timed conditions. Review all incorrect answers thoroughly. Focus extra time on the two 24%-weight domains (Modeling/Analysis and Machine Learning)
Month 6Final review: re-study weak areas identified in practice exams, review all confusable concepts, ensure you can apply formulas and methods to scenario-based questions. Schedule your exam when consistently passing practice tests

Exam Overview

Format

Up to 90 questions, 165 minutes. Multiple-choice and performance-based questions (PBQs).

Scoring

Pass/fail only (no scaled score). There is no published numeric passing threshold — you either pass or fail.

Domains & Weights

  • Mathematics and Statistics17%
  • Modeling, Analysis, and Outcomes24%
  • Machine Learning24%
  • Operations and Processes22%
  • Specialized Applications of Data Science13%

Registration

$544 USD. Available at Pearson VUE testing centers or online proctored from home. Exam fee is $544 USD.

Topic Priority Table

Not all topics are tested equally. Focus your study time on Tier 1 first, then Tier 2. Tier 3 topics rarely appear — just recognize what they do.

Tier 1: Must KnowYou must understand these concepts deeply, know the math behind them, and be able to apply them in scenario-based questions. These appear across multiple domains and questions.
Tier 2: Should KnowUnderstand what these are, their key characteristics, and when to apply them. May appear in 2-5 questions each.
Tier 3: Recognize OnlyKnow what these are at a high level and their primary use case. Rarely more than 1-2 questions each.
Domain 117% of exam

Mathematics and Statistics

This domain tests your ability to apply mathematical and statistical methods to data science problems. It covers statistical tests, probability distributions, linear algebra, calculus fundamentals, and temporal models including time series analysis and causal inference. While the lightest domain by weight, the mathematical foundations here underpin every other domain on the exam.

Key Topics

Hypothesis TestingProbability DistributionsLinear AlgebraCalculusTime Series (ARIMA)Confusion Matrix MetricsROC/AUCCausal Inference

Must-Know Concepts

  • Statistical tests and when to use each: t-tests (comparing means), chi-squared (categorical independence), ANOVA (comparing multiple group means), and their assumptions
  • Confusion matrix metrics: accuracy, precision, recall, F1 score, MCC — know how to calculate each from TP, FP, TN, FN
  • ROC/AUC curves: how to interpret them, what AUC values mean (0.5 = random, 1.0 = perfect), and their role in model evaluation
  • Regression performance metrics: R-squared, Adjusted R-squared, RMSE, and F-statistic — know what each measures and when to use it
  • Probability distributions: normal, uniform, Poisson, t, binomial, power law — know their shapes, parameters, and typical use cases
  • Distribution characteristics: skewness (asymmetry), kurtosis (tail heaviness), heteroskedasticity (non-constant variance) — know implications for model assumptions
  • Type I error (false positive, rejecting true null) vs Type II error (false negative, failing to reject false null) and their relationship to significance level
  • Linear algebra essentials: matrix operations (multiplication, transposition, inversion, decomposition), eigenvalues/eigenvectors, rank, and span
  • Distance metrics: Euclidean (straight line), Manhattan (grid), cosine (angle between vectors) — know when each is appropriate
  • Time series models: AR, MA, ARIMA — understand stationarity requirements and model selection
  • Causal inference: difference between correlation and causation, DAGs, A/B testing, difference-in-differences, and RCTs
  • Model selection criteria: AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) — lower values indicate better model fit with penalty for complexity

Common Exam Traps

Precision and recall are NOT the same thing — precision = TP/(TP+FP), recall = TP/(TP+FN). The exam will test whether you can identify which metric matters in a given business scenario
Type I error is a FALSE POSITIVE (rejecting a true null hypothesis). Type II error is a FALSE NEGATIVE (failing to reject a false null). Do not mix these up
AIC and BIC both penalize model complexity but BIC penalizes more heavily. Lower values are better for both — the exam may test which to use with small vs large datasets
Pearson correlation measures LINEAR relationships only. Spearman measures MONOTONIC relationships and handles non-linear data. The exam tests when to use each
ARIMA requires STATIONARY data. If the data has trends or seasonality, you must difference it first. Non-stationarity is a common trap in time series questions
Quick Check: Mathematics and Statistics

Question 1 of 3

A data scientist is evaluating a fraud detection model where missing actual fraud cases is far more costly than flagging legitimate transactions. Which metric should they prioritize?

Domain 224% of exam

Modeling, Analysis, and Outcomes

This domain covers the full modeling workflow from exploratory data analysis through results communication. You must demonstrate mastery of EDA techniques, data issue identification and resolution, feature engineering, model iteration, and presenting findings to stakeholders. It is one of the two heaviest domains at 24% and emphasizes practical, scenario-based application of data analysis skills.

Key Topics

EDAFeature EngineeringVisualizationData IssuesHyperparameter TuningModel IterationResults Communication

Must-Know Concepts

  • Exploratory Data Analysis: univariate analysis (single variable distributions) and multivariate analysis (relationships between variables) — know when and why to use each
  • Visualization types and when to use each: bar plot (categorical comparisons), scatter plot (two continuous variables), box plot (distribution summary), violin plot (distribution shape), heat map (correlation matrices), line plot (trends over time)
  • Feature types: categorical, discrete, continuous, ordinal, nominal, binary — must correctly identify each and know appropriate analysis methods for each type
  • Data issues and solutions: sparse data/matrices, non-linearity, non-stationarity, multicollinearity, seasonality, granularity misalignment, insufficient features, multivariate outliers
  • Feature engineering techniques: one-hot encoding (categorical to binary columns), label encoding (categorical to integers), normalization, binning, log/exponential transformation, Box-Cox transformation, ratio creation
  • Handling multicollinearity: detection using VIF (Variance Inflation Factor), resolution through feature removal, PCA, or regularization
  • Model design iteration: defining constraints (time, resources, hardware, cost), hyperparameter tuning, experiment tracking, diagnostic plots for architecture decisions
  • Results communication: benchmarking against baselines, aligning with business requirements, accessibility in charts (font size, color choice, content tagging), documentation best practices
  • Data enrichment: incorporating external data sources, synthetic data generation, and data augmentation techniques

Common Exam Traps

One-hot encoding creates k binary columns for k categories. Label encoding assigns integers. Use one-hot for nominal data (no order) and label encoding only for ordinal data — using label encoding for nominal data implies false ordering
Box-and-whisker plots show the MEDIAN (not mean), quartiles, and outliers. Do not confuse the center line with the mean
Multicollinearity does not reduce model accuracy but makes individual coefficient interpretation unreliable. The exam may present scenarios where prediction is fine but feature importance is misleading
Normalization (scaling to 0-1) and standardization (mean=0, std=1) are DIFFERENT techniques. Know which algorithms require which preprocessing
A heat map shows correlation strength, but correlation does not imply causation. The exam may test this distinction in interpretation questions
Quick Check: Modeling, Analysis, and Outcomes

Question 1 of 3

A data scientist discovers that two predictor variables in a regression model have a Variance Inflation Factor (VIF) above 10. What does this indicate and what is the most appropriate action?

Domain 324% of exam

Machine Learning

This domain covers the full spectrum of machine learning from foundational concepts through deep learning. You must understand supervised learning (regression and classification), tree-based methods, unsupervised learning, and deep learning architectures. At 24%, this domain shares the top weight with Modeling and requires both theoretical understanding and practical application knowledge.

Key Topics

Supervised LearningUnsupervised LearningDeep LearningEnsemble MethodsRegularizationFeature SelectionHyperparameter TuningNeural Networks

Must-Know Concepts

  • Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex). Goal is to minimize total prediction error by balancing both
  • Feature selection methods: importance metrics, VIF for multicollinearity, and model-based selection. Know when to reduce features vs engineer new ones
  • Class imbalance handling: oversampling (SMOTE), undersampling, stratified sampling — know the tradeoffs of each approach
  • Regularization types: L1/LASSO (feature selection), L2/Ridge (coefficient shrinkage), Elastic Net (combined), dropout (neural networks), early stopping, batch normalization
  • Supervised statistical methods: linear regression (OLS, Ridge, LASSO, Elastic Net), logistic regression (probit/logit), discriminant analysis (LDA/QDA), Naive Bayes, association rules (confidence, lift, support)
  • Tree-based methods: decision trees, random forests (bagging), gradient boosting, XGBoost — know the algorithm differences and when each excels
  • Deep learning architecture: perceptron, multilayer perceptron, activation functions (ReLU, Sigmoid, Tanh, Softmax), backpropagation, layer types (input, hidden, pooling, output)
  • Deep learning models: CNN (images), RNN (sequences), LSTM (long sequences), GANs (generation), autoencoders (compression), transformers (attention-based)
  • Optimizers: Adam, SGD, RMSprop, momentum, mini-batch — know their characteristics and when to use each
  • Unsupervised methods: k-means, hierarchical clustering, DBSCAN, PCA, t-SNE, UMAP, SVD — know method selection criteria
  • Data leakage: information from outside the training dataset improperly influencing the model. Common in feature engineering and cross-validation
  • Hyperparameter tuning: grid search (exhaustive) vs random search (sampled) — know efficiency tradeoffs

Common Exam Traps

Random Forest uses BAGGING (parallel, reduces variance). XGBoost uses BOOSTING (sequential, reduces bias). The exam specifically tests this distinction
LASSO (L1) can zero out coefficients, performing feature selection. Ridge (L2) CANNOT — it only shrinks them toward zero. This is a high-frequency exam question
Softmax activation is used for MULTI-CLASS classification output layers. Sigmoid is for BINARY classification. Do not use Sigmoid for multi-class problems
k-Nearest Neighbors (kNN) is technically a supervised method despite being listed near unsupervised methods. It classifies based on labeled neighbor data
Data leakage can occur during PREPROCESSING if you fit transformations (like scaling) on the full dataset before splitting into train/test. Always fit on training data only
Learning rate in deep learning: too high causes divergence (overshooting), too low causes extremely slow convergence. The exam tests understanding of this tradeoff
Quick Check: Machine Learning

Question 1 of 3

A data scientist trains a model that achieves 99% accuracy on training data but only 65% on the test set. Which technique would MOST effectively address this problem?

Domain 422% of exam

Operations and Processes

This domain covers the operational side of data science: from business requirements gathering through data acquisition, infrastructure, wrangling, lifecycle management, and MLOps deployment. At 22%, it tests your ability to translate business needs into technical solutions and maintain production data science systems. Expect scenario questions about data pipelines, deployment strategies, and operational best practices.

Key Topics

CRISP-DMDAMA FrameworkData PipelinesMLOpsCI/CDContainerizationData WranglingVersion Control

Must-Know Concepts

  • Compliance and security: PII identification and protection, proprietary data handling, anonymization techniques, obfuscation methods — these appear throughout the domain
  • Business translation: establishing measures, metrics, and KPIs; requirements gathering with cost-benefit analysis; translating business needs into data science solutions
  • Data acquisition sources: surveys, administrative data, sensor data, transactional data, experimental data, synthetic data (costs, benefits, limitations), commercial/public data (licensing, restrictions)
  • Data infrastructure: resource sizing, GPU/TPU considerations, data formats (CSV, JSON, Parquet, compressed), storage types (structured, semi-structured, unstructured), streaming vs batching
  • Data pipeline implementation: orchestration, automation, data lineage tracking, and archiving strategies
  • Data wrangling: merging techniques (defining keys, fuzzy joins), deduplication, standardization, unit conversion, regular expressions, outlier handling (winsorization), imputation strategies, ground truth labeling
  • CRISP-DM phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment — know what happens at each phase
  • Version control for data science: code versioning, data versioning, hyperparameter tracking, model versioning — all four must be managed
  • MLOps practices: data replication, CI/CD pipelines for models, container orchestration, model validation (online, offline, A/B testing), continuous performance monitoring
  • Deployment environments: containerization, cloud, cluster, hybrid, edge, on-premises — know tradeoffs and appropriate use cases for each
  • Clean code practices: unit testing, documentation (markdown, docstrings, code comments), dependency licensing management, API access patterns

Common Exam Traps

CRISP-DM is a DATA MINING lifecycle framework. Do not confuse it with DAMA (data management body of knowledge). The exam tests both frameworks and their different focus areas
Streaming and batching are different data processing strategies. Streaming processes data in real-time as it arrives. Batching collects data and processes it in bulk at intervals. Know which scenarios require which
Parquet is a COLUMNAR storage format optimized for analytics. CSV is row-based. The exam may test when Parquet outperforms CSV (large analytical queries) vs when CSV is sufficient (small datasets, interoperability)
Winsorization CAPS outliers at a percentile threshold rather than removing them. This is different from trimming (removal) and imputation (replacement with estimated values)
Edge deployment puts models on devices with limited compute (IoT, mobile). Cloud deployment offers scalable compute. The exam tests when edge is preferred (low latency, offline capability, privacy) vs cloud (complex models, scalability)
Quick Check: Operations and Processes

Question 1 of 3

A data science team needs to track changes to their model architecture, training data, hyperparameters, and code simultaneously across experiments. Which practice BEST addresses this requirement?

Domain 513% of exam

Specialized Applications of Data Science

This domain covers advanced and specialized data science applications including optimization, NLP, computer vision, and emerging techniques. At 13%, it is the lightest domain but contains highly specific technical content. Expect questions on NLP preprocessing, computer vision techniques, optimization methods, and applications like fraud detection and reinforcement learning.

Key Topics

NLPComputer VisionOptimizationReinforcement LearningGraph AnalysisAnomaly DetectionEdge Computing

Must-Know Concepts

  • Constrained optimization: linear programming (simplex method), network topology optimization, scheduling, non-linear solvers, pricing, and resource allocation
  • Unconstrained optimization: multi-armed bandit (exploration vs exploitation tradeoff), local extrema finding, gradient-based methods
  • NLP preprocessing pipeline: tokenization, bag of words, lemmatization vs stemming, stop word removal, n-grams — know the correct ordering and purpose of each step
  • NLP representations: TF-IDF (term frequency-inverse document frequency), word embeddings (Word2Vec, GloVe), document-term matrices
  • NLP applications: sentiment analysis, named entity recognition (NER), question answering, text generation, text summarization, speech recognition, NLU/NLG
  • Topic modeling: Latent Dirichlet Allocation (LDA) — unsupervised method for discovering topics in document collections
  • Computer vision core concepts: CNNs for feature extraction, OCR (optical character recognition), object detection and tracking, semantic segmentation, sensor fusion
  • Computer vision data augmentation: rotation, flipping, scaling, cropping, noise injection, occlusion, filter application, masking — know why each is used
  • Specialized applications: graph analysis, heuristics, greedy algorithms, reinforcement learning, event/fraud/anomaly detection, multimodal ML, edge computing optimization, signal processing

Common Exam Traps

TF-IDF is NOT the same as bag of words. Bag of words counts word frequency. TF-IDF weights words by their importance across documents — common words get lower weight
Word2Vec and GloVe produce DENSE vector embeddings. Bag of words and TF-IDF produce SPARSE vectors. The exam tests whether you know the difference and when each is appropriate
Latent Dirichlet Allocation (LDA) in NLP is a TOPIC MODEL, not the same as Linear Discriminant Analysis (LDA) in supervised learning. Same abbreviation, completely different algorithms
Reinforcement learning is NEITHER supervised NOR unsupervised. It learns through reward/penalty feedback from an environment. The exam may include it as a distractor in supervised/unsupervised questions
Data augmentation in computer vision (rotation, flipping) increases training data diversity WITHOUT collecting new images. It does not improve image quality or resolution
Quick Check: Specialized Applications of Data Science

Question 1 of 3

A data scientist needs to build a system that discovers the main themes across 100,000 customer support tickets without predefined categories. Which technique is most appropriate?

Concepts You Must Not Confuse

These pairs appear on nearly every exam. Learn the difference and you'll avoid the most common traps.

Overfitting vs Underfitting

Use Overfitting when…

Model is too complex and learns noise in the training data. Performs well on training data but poorly on unseen data. High variance, low bias.

Use Underfitting when…

Model is too simple and fails to capture the underlying pattern. Performs poorly on both training and test data. High bias, low variance.

Exam trap

Overfitting means the model memorized the training data (high variance). Underfitting means the model is too simplistic (high bias). The exam tests whether you can identify each from performance metrics and choose the correct remedy: regularization for overfitting, more complexity for underfitting.

Ridge Regression (L2) vs LASSO Regression (L1)

Use Ridge Regression (L2) when…

Adds L2 penalty (sum of squared coefficients) to prevent overfitting. Shrinks coefficients toward zero but never eliminates them entirely. Best when all features contribute.

Use LASSO Regression (L1) when…

Adds L1 penalty (sum of absolute coefficients) to prevent overfitting. Can shrink coefficients to exactly zero, effectively performing feature selection. Best when many features are irrelevant.

Exam trap

LASSO performs automatic feature selection by zeroing out coefficients. Ridge does NOT eliminate features — it only shrinks them. If the question asks about feature selection through regularization, the answer is LASSO (L1), not Ridge (L2). Elastic Net combines both.

Bagging (Bootstrap Aggregation) vs Boosting

Use Bagging (Bootstrap Aggregation) when…

Trains multiple models independently on random subsets of data and averages their predictions. Reduces variance. Random Forest is the classic bagging algorithm.

Use Boosting when…

Trains models sequentially, with each new model focusing on errors made by previous models. Reduces bias. Gradient Boosting and XGBoost are key boosting algorithms.

Exam trap

Bagging reduces VARIANCE (parallel models, averaging). Boosting reduces BIAS (sequential models, error correction). The exam tests whether you know which ensemble approach addresses which problem. Random Forest = bagging. XGBoost = boosting.

Precision vs Recall

Use Precision when…

Of all predictions labeled positive, what proportion was actually positive? High precision means few false positives. Critical when false positives are costly (spam filtering).

Use Recall when…

Of all actual positives, what proportion was correctly identified? High recall means few false negatives. Critical when false negatives are costly (disease detection, fraud detection).

Exam trap

Precision focuses on the quality of positive predictions (minimize false positives). Recall focuses on finding all actual positives (minimize false negatives). The F1 score is their harmonic mean. The exam will present scenarios where you must choose which metric matters more based on business context.

Supervised Learning vs Unsupervised Learning

Use Supervised Learning when…

Training with labeled data where correct outputs are known. Used for classification (categorical target) and regression (continuous target). Examples: logistic regression, decision trees, SVM.

Use Unsupervised Learning when…

Training with unlabeled data to discover hidden patterns. Used for clustering, dimensionality reduction, and anomaly detection. Examples: k-means, PCA, DBSCAN.

Exam trap

Supervised = labeled data, predefined target variable. Unsupervised = unlabeled data, discovers structure. The exam also tests semi-supervised learning (mix of labeled and unlabeled) and reinforcement learning (learns from rewards/penalties), which are distinct categories.

Data Drift vs Concept Drift

Use Data Drift when…

The statistical distribution of input features changes over time while the underlying relationship between features and target remains the same.

Use Concept Drift when…

The relationship between input features and the target variable changes over time, even if input distributions remain stable.

Exam trap

Data drift means the INPUT distribution shifts (e.g., customer demographics change). Concept drift means the RELATIONSHIP between inputs and outputs changes (e.g., what predicts churn evolves). Both degrade model performance but require different monitoring and remediation strategies.

Stemming vs Lemmatization

Use Stemming when…

Crude rule-based method that chops word endings to find the root form. Fast but imprecise — 'running' becomes 'run' but 'better' might become 'bet'.

Use Lemmatization when…

Uses vocabulary and morphological analysis to return the dictionary base form (lemma). Slower but accurate — 'better' correctly becomes 'good'.

Exam trap

Stemming is fast but can produce non-words. Lemmatization is accurate but slower. The exam tests whether you understand the quality-speed tradeoff in NLP preprocessing and can choose the appropriate method for a given scenario.

k-Means Clustering vs DBSCAN

Use k-Means Clustering when…

Partitions data into k clusters based on distance to centroids. Requires specifying k in advance. Works well with spherical, evenly-sized clusters. Uses silhouette score or elbow method to find optimal k.

Use DBSCAN when…

Density-based clustering that finds clusters of arbitrary shape. Does not require specifying the number of clusters. Can identify outliers as noise points. Struggles with varying density clusters.

Exam trap

k-Means requires you to specify k beforehand and assumes spherical clusters. DBSCAN discovers the number of clusters automatically and handles irregular shapes. If the question mentions unknown number of clusters or non-spherical data, DBSCAN is likely the answer.

Top Mistakes to Avoid

Confusing overfitting (high variance, memorizes training data) with underfitting (high bias, too simplistic) — the remedies are opposite: regularize for overfitting, add complexity for underfitting
Mixing up LASSO (L1, performs feature selection by zeroing coefficients) and Ridge (L2, shrinks but never zeroes coefficients) — the exam heavily tests this distinction
Using accuracy as the primary metric for imbalanced datasets — a model predicting all majority class achieves high accuracy but misses the minority class entirely. Use precision, recall, or F1 instead
Confusing bagging (parallel, reduces variance, e.g. Random Forest) with boosting (sequential, reduces bias, e.g. XGBoost) — know which ensemble method addresses which type of error
Applying label encoding to nominal categorical data — this falsely implies an ordering. Use one-hot encoding for nominal features like color or city
Forgetting that ARIMA requires stationary data — non-stationary time series must be differenced before applying ARIMA, or the results will be unreliable
Confusing LDA (Latent Dirichlet Allocation, a topic model in NLP) with LDA (Linear Discriminant Analysis, a supervised classification method) — same abbreviation, completely different algorithms
Assuming data augmentation in computer vision creates new data — it only creates transformed copies of existing images to improve model robustness, not new independent samples
Fitting preprocessing transformations (scaling, encoding) on the full dataset before train/test split — this causes data leakage from the test set and inflates performance metrics
Treating correlation as causation — high correlation between variables does not establish a causal relationship. Use causal inference methods (A/B tests, RCTs, DAGs) to establish causation

Exam-Ready Checklist

Can explain all 5 exam domains and their relative weights (17%, 24%, 24%, 22%, 13%)
Know which statistical test to apply for each scenario: t-tests (2 group means), chi-squared (categorical), ANOVA (3+ group means), Pearson/Spearman correlation
Can calculate and interpret confusion matrix metrics: accuracy, precision, recall, F1, MCC, and know when each metric is most important
Understand the bias-variance tradeoff and can identify overfitting vs underfitting from training/test performance gaps
Can distinguish all regression variants: OLS, Ridge (L2), LASSO (L1), Elastic Net — and know when to use each
Know the complete supervised learning catalog: linear regression, logistic regression, Naive Bayes, LDA/QDA, decision trees, random forests, gradient boosting, XGBoost
Understand deep learning architecture: activation functions (ReLU, Sigmoid, Tanh, Softmax), layer types, backpropagation, and model architectures (CNN, RNN, LSTM, transformers)
Can explain unsupervised methods: k-means vs DBSCAN vs hierarchical clustering, PCA vs t-SNE vs UMAP, and when to use each
Know the NLP pipeline in order: tokenization, stop words, stemming/lemmatization, bag of words, TF-IDF, embeddings — and can distinguish TF-IDF from bag of words
Understand CRISP-DM phases and can map data science activities to the correct lifecycle stage
Know MLOps practices: version control (code, data, models, hyperparameters), CI/CD, containerization, A/B testing, continuous monitoring, and deployment environments
Can identify and resolve data issues: multicollinearity (VIF), class imbalance (SMOTE), missing data patterns, outliers (winsorization), and data leakage
Understand optimization concepts: constrained (linear programming, simplex) vs unconstrained (multi-armed bandit, exploration-exploitation)
Have practiced with PBQs — performance-based questions test applied skills, not just recall
Reviewed all confusable concepts: Ridge vs LASSO, bagging vs boosting, precision vs recall, stemming vs lemmatization, data drift vs concept drift

Recommended Resources

Free & Official Resources

Paid Courses & Practice Exams

These are recommended if you prefer a structured learning path. They can save time but are not required to pass.

Frequently Asked Questions