General Exam Tips
- 1.Read ALL answer choices before committing — two options often look correct but differ on one critical word (e.g., 'shrinks toward zero' vs 'zeros out coefficients')
- 2.PBQs appear at the START of the exam. Flag simulation PBQs and return to them after completing multiple-choice — virtual PBQs must be completed on first encounter
- 3.Pass/fail scoring means there is no partial-credit safety net. You must demonstrate competence across ALL five domains — neglecting any domain risks a fail even if you ace the rest
- 4.The exam is 165 minutes for up to 90 questions. Budget roughly 1 minute per multiple-choice question and 4-6 minutes per PBQ. Never spend more than 8 minutes on any single question
- 5.Scenario questions always embed a constraint that narrows the correct answer — identify the constraint FIRST (e.g., 'low latency', 'limited compute', 'imbalanced classes') before evaluating options
- 6.When a question mentions a business context (e.g., 'missing fraud cases is costly'), translate it to a metric: costly false negatives means maximize recall; costly false positives means maximize precision
- 7.The Operations and Processes domain (22%) is the most commonly neglected by working data scientists — candidates who only study ML theory frequently fail here on MLOps and governance questions
- 8.Ethics and AI governance appear in the exam content despite not having a dedicated domain — do not skip these topics entirely
- 9.The exam was rebranded from CompTIA DataX to CompTIA DataAI in January 2026 — content and objectives are IDENTICAL, same DY0-001 code
Quick Navigation
Mathematics and Statistics
Must-Know Facts
- The distinction between Type I error (rejecting a TRUE null — false positive) and Type II error (failing to reject a FALSE null — false negative). Type I = alpha, Type II = beta
- Which statistical test to use by scenario: t-test for two group means, ANOVA for three or more group means, chi-squared for two categorical variables, Pearson for linear continuous relationships, Spearman for monotonic/ordinal/non-linear relationships
- Confusion matrix mechanics: Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean of both, MCC = most reliable metric under severe class imbalance
- ROC/AUC interpretation: AUC = 0.5 means the model is no better than random guessing, AUC = 1.0 is a perfect classifier. Higher AUC is always better
- AIC vs BIC: both penalize model complexity — LOWER is better for both. BIC applies a heavier penalty and is preferred when the sample size is large
- ARIMA requires stationary data. Non-stationarity (trend, seasonality) must be removed through differencing BEFORE fitting ARIMA
- Pearson measures LINEAR correlation only. Spearman measures MONOTONIC correlation and is robust to outliers and non-linear relationships
- Distance metrics: Euclidean measures straight-line distance (sensitive to magnitude), Manhattan measures grid distance, cosine measures angle between vectors (magnitude-invariant, preferred for text)
Common Traps
Confusing Pairs
Scenario Tips
When the question describes a binary medical test or fraud model and asks which metric to prioritize given 'missing actual cases is very costly'...
Choose Recall (sensitivity). Missing actual cases = false negatives. Recall = TP/(TP+FN) minimizes false negatives
Precision is tempting but it minimizes false POSITIVES, not false negatives. Accuracy is wrong because of class imbalance
When a question asks you to compare three or more models, training methods, or groups for statistical significance...
Choose ANOVA. It compares means across three or more groups in a single test and avoids inflating Type I error
Multiple t-tests seem reasonable but they inflate the chance of a false positive. The correct answer is always ANOVA for 3+ groups
When a question presents time series data with seasonal patterns and asks which model is appropriate...
Use ARIMA with differencing (SARIMA if seasonal). First difference the data to achieve stationarity, then apply ARIMA
Applying ARIMA directly without checking for stationarity is the trap. Non-stationary data produces unreliable ARIMA results
Last-Minute Facts
Modeling, Analysis, and Outcomes
Must-Know Facts
- Exploratory Data Analysis requires both univariate analysis (single variable distributions) AND multivariate analysis (relationships between variables). Do not skip one type
- Visualization purpose: bar chart = categorical comparison, scatter plot = two continuous variables, box plot = distribution summary where the CENTER LINE is the MEDIAN not the mean, heat map = correlation matrix, violin plot = distribution shape
- Label encoding implies ordinal order (Red=1, Blue=2, Green=3 implies Green > Blue). Use one-hot encoding for nominal categories with no inherent order
- Normalization (scales to 0-1) vs standardization (mean=0, std=1): distance-based algorithms (kNN, k-means, SVM) and neural networks need normalization or standardization. Linear/logistic regression and PCA typically use standardization
- VIF > 10 signals severe multicollinearity: the fix is to remove one correlated variable, combine via PCA, or apply regularization
- Multicollinearity does NOT reduce prediction accuracy — it makes individual coefficient interpretation unreliable. The model can still predict well even with multicollinear features
- Box-Cox transformation is a power transformation that normalizes skewed continuous data. Lambda = 0 is equivalent to a log transform
- When communicating results, always benchmark against a baseline (e.g., a naive classifier, previous model, or random guess)
Common Traps
Confusing Pairs
Scenario Tips
A question describes a regression model where two features have VIF > 10 and asks what action to take...
Multicollinearity detected. Options: remove one of the correlated variables, combine them via PCA, or apply Ridge regression (L2 regularization). The answer depends on the stated goal — if prediction only, Ridge works; if interpretability matters, remove a variable
Increasing training data or adding more features will not fix multicollinearity. Normalizing features also does not resolve it
A model achieves 97% accuracy on a dataset where 96% of records are the negative class...
Accuracy is misleading — the model may be predicting all samples as negative. Evaluate with precision, recall, F1, or MCC. Investigate with a confusion matrix
Do not celebrate 97% accuracy on an imbalanced dataset. A naive model that predicts all negative would score 96% already
When the question asks how to encode a nominal feature with 200 unique categories for a gradient boosting model...
Use target encoding or binary encoding for high-cardinality nominal features. One-hot encoding on 200 categories would create 200 sparse columns, increasing dimensionality dramatically
One-hot encoding is correct for low-cardinality nominal features but creates excessive dimensionality for high-cardinality ones
Last-Minute Facts
Machine Learning
Must-Know Facts
- Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex). Total error = Bias^2 + Variance + Irreducible noise
- LASSO (L1) can reduce coefficients to exactly zero and performs automatic feature selection. Ridge (L2) shrinks coefficients toward zero but NEVER to exactly zero
- Random Forest uses BAGGING (parallel training on random subsets, reduces VARIANCE). Gradient Boosting and XGBoost use BOOSTING (sequential training correcting errors, reduces BIAS)
- For multi-class classification output: use Softmax activation + categorical cross-entropy. For binary: use Sigmoid + binary cross-entropy. ReLU is for HIDDEN layers, not output layers
- Data leakage: fitting preprocessing on the full dataset before splitting leaks test information into training. Always fit on TRAINING data, then transform both sets
- k-Nearest Neighbors (kNN) is a SUPERVISED algorithm — it predicts using labeled neighbors. Despite appearing near clustering methods, it is not unsupervised
- Curse of dimensionality: as features increase relative to samples, data becomes sparse in high-dimensional space and model performance degrades. Motivates dimensionality reduction
- SMOTE generates synthetic minority class examples by interpolating between existing minority samples — it does not duplicate them
- Cross-validation: k-fold splits data into k folds, trains on k-1, tests on 1, rotates. Average score estimates generalization. Use stratified k-fold for imbalanced classification
Common Traps
Confusing Pairs
Scenario Tips
A model achieves 99% train accuracy and 65% test accuracy — which technique most directly addresses this...
Apply regularization (L1 or L2) or reduce model complexity. This is classic overfitting (high variance). Regularization penalizes complexity and closes the train/test gap
Adding more features or increasing complexity would worsen overfitting. Removing cross-validation removes a diagnostic tool without solving the problem
A deep learning model for 10-class image classification — which output activation + loss function combination is correct...
Softmax activation + categorical cross-entropy loss. Softmax produces a probability distribution across 10 classes, categorical cross-entropy is the correct loss for multi-class classification
Sigmoid + binary cross-entropy is correct only for binary classification. ReLU is a hidden layer activation, not an output activation for classification
A dataset has 300 features and 150 samples, causing poor model performance — what concept explains this and what is the fix...
Curse of dimensionality. Features far outnumber samples causing sparse high-dimensional space. Fix: dimensionality reduction (PCA, feature selection) or collect more data
This is not data leakage or concept drift. The mismatch between feature count and sample count is the defining characteristic of the curse of dimensionality
When the question describes an anomaly detection task with unlabeled data and asks for the right algorithm category...
Use unsupervised anomaly detection: isolation forest, autoencoder, or statistical methods (z-score, IQR). Unsupervised because no labeled 'anomaly' examples are available
Do not choose supervised classification — that requires labeled examples of anomalies, which are typically unavailable in anomaly detection scenarios
Last-Minute Facts
Operations and Processes
Must-Know Facts
- CRISP-DM has six phases in order: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. Know what activities happen at each phase
- Version control for data science covers FOUR dimensions: code (git), data (DVC, Delta Lake), hyperparameters (MLflow experiment tracking), and model artifacts. Versioning code alone is insufficient
- Data drift = input feature distribution changes over time. Concept drift = the RELATIONSHIP between inputs and target changes over time. Both degrade model performance but require different responses
- Deployment environments: edge = on-device, no internet required, low latency, limited compute. Cloud = scalable, internet required. Hybrid = combines both. On-premises = maximum data control
- Streaming processes data in real-time as it arrives (low latency, high complexity). Batching collects data and processes in bulk at intervals (higher latency, simpler)
- Parquet is a COLUMNAR storage format optimized for analytical queries. CSV is row-based. Parquet dramatically outperforms CSV for large aggregation queries
- Winsorization CAPS outliers at a percentile threshold (e.g., values above 99th percentile are set to the 99th percentile value). It does not remove them — different from trimming
- PII must be identified and protected before model training. Anonymization and pseudonymization are the primary techniques. This is a compliance requirement, not optional
- CI/CD for ML: code commit triggers automated testing, model training, validation against threshold, and deployment with rollback capability if performance drops
Common Traps
Confusing Pairs
Scenario Tips
An IoT device needs real-time predictions on sensor data with no internet connectivity...
Edge deployment. The model runs directly on the device. No connectivity required, low latency guaranteed. Model must be compressed/quantized for limited compute
Cloud deployment requires connectivity and introduces latency. Hybrid with cloud as primary still depends on connectivity for main inference. Edge is the only answer that works offline
A question asks how to merge customer records from two sources where names are spelled differently across systems...
Fuzzy join using string similarity matching (edit distance, Jaccard). It matches records that are similar but not identical, handling typos and name variations
An inner join on exact name match would miss valid records. Cross join with filtering creates a cartesian product (too expensive). Left join on exact match also misses fuzzy matches
A data science team detects that model predictions have drifted in production — how to determine if it is data drift or concept drift...
Monitor input feature distributions (PSI, KL divergence) to detect data drift. Monitor model performance on fresh labeled data to detect concept drift. If input distributions shifted but labels are unavailable, assume data drift. If you have new ground truth and performance degraded, it could be concept drift
You cannot diagnose drift without monitoring. Retraining immediately without diagnosis may not solve the root cause
When the question describes a CRISP-DM scenario and asks which phase involves evaluating whether the model meets business objectives...
Evaluation phase (Phase 5). This is where you assess the model against business success criteria BEFORE deciding to deploy. It is separate from model validation metrics done in the Modeling phase
Deployment phase (Phase 6) comes AFTER evaluation — you do not evaluate against business objectives during deployment. Model validation (train/test metrics) happens in Modeling phase, not the Evaluation phase
Last-Minute Facts
Specialized Applications of Data Science
Must-Know Facts
- NLP pipeline order: tokenization → stop word removal → stemming or lemmatization → bag of words or TF-IDF → embeddings. Each step must precede the next
- LDA in NLP context = Latent Dirichlet Allocation = unsupervised TOPIC MODEL for discovering themes in document collections. LDA in ML context = Linear Discriminant Analysis = supervised classification. Same abbreviation, completely different algorithms
- TF-IDF is NOT bag of words. Bag of words counts raw word frequencies. TF-IDF weights by term frequency multiplied by inverse document frequency — downweighting common words
- Word2Vec and GloVe produce DENSE vector embeddings that capture semantic relationships. Bag of words and TF-IDF produce SPARSE vectors. Dense embeddings are more powerful for similarity tasks
- Reinforcement learning is NEITHER supervised NOR unsupervised — it is a third learning paradigm where an agent learns through reward and penalty signals from an environment
- Data augmentation in computer vision (rotation, flipping, scaling, noise) creates transformed copies of existing images to improve model robustness — it does NOT create new independent data or improve resolution
- Multi-armed bandit: balances exploration (trying unknown options) vs exploitation (using the known best option). Used in adaptive A/B testing and recommendation optimization
- Simplex method is the algorithm for solving linear programming (constrained optimization) problems — it traverses constraint boundary vertices to find the optimal solution
Common Traps
Confusing Pairs
Scenario Tips
A question asks to discover hidden themes in 50,000 customer reviews without predefined categories...
Use LDA (Latent Dirichlet Allocation) — unsupervised topic modeling. It identifies latent topics without predefined labels
Named entity recognition extracts specific entities (names, places) not themes. Text classification requires predefined categories. Sentiment analysis measures tone, not topics
A computer vision team has only 500 training images and asks the best way to improve model performance without collecting new data...
Apply data augmentation: rotation, flipping, scaling, cropping, color jitter, noise injection. This artificially increases training set diversity and reduces overfitting
Increasing learning rate risks divergence. Switching to logistic regression loses spatial feature extraction. Removing pooling layers increases computation without solving the data limitation
A question asks which technique balances showing known high-performing product recommendations while also exploring potentially better new options...
Multi-armed bandit. It explicitly models the exploration vs exploitation tradeoff, routing more traffic to known performers while testing alternatives
Linear programming is for constrained optimization with a known objective function. Greedy algorithms always exploit the current best without any exploration. Gradient descent optimizes parameters, not allocation decisions
An NLP question describes processing steps in the wrong order — which ordering is correct...
Correct pipeline order: tokenization first, then stop word removal, then stemming or lemmatization, then vectorization (bag of words or TF-IDF), then optionally embedding. Never stem before tokenizing or vectorize before normalizing
Applying TF-IDF before removing stop words gives high weights to common words like 'the' — defeating the purpose. Always normalize text first, then vectorize