CompTIADY0-0015 domains

DY0-001 Exam Notes

Last-minute traps, must-know facts, and scenario tips for the CompTIA DataAI (formerly DataX) exam.

General Exam Tips

1.Read ALL answer choices before committing — two options often look correct but differ on one critical word (e.g., 'shrinks toward zero' vs 'zeros out coefficients')
2.PBQs appear at the START of the exam. Flag simulation PBQs and return to them after completing multiple-choice — virtual PBQs must be completed on first encounter
3.Pass/fail scoring means there is no partial-credit safety net. You must demonstrate competence across ALL five domains — neglecting any domain risks a fail even if you ace the rest
4.The exam is 165 minutes for up to 90 questions. Budget roughly 1 minute per multiple-choice question and 4-6 minutes per PBQ. Never spend more than 8 minutes on any single question
5.Scenario questions always embed a constraint that narrows the correct answer — identify the constraint FIRST (e.g., 'low latency', 'limited compute', 'imbalanced classes') before evaluating options
6.When a question mentions a business context (e.g., 'missing fraud cases is costly'), translate it to a metric: costly false negatives means maximize recall; costly false positives means maximize precision
7.The Operations and Processes domain (22%) is the most commonly neglected by working data scientists — candidates who only study ML theory frequently fail here on MLOps and governance questions
8.Ethics and AI governance appear in the exam content despite not having a dedicated domain — do not skip these topics entirely
9.The exam was rebranded from CompTIA DataX to CompTIA DataAI in January 2026 — content and objectives are IDENTICAL, same DY0-001 code

Quick Navigation

Mathematics and Statistics Modeling, Analysis, and Outcomes Machine Learning Operations and Processes Specialized Applications of Data Science

Domain 117% of exam

Mathematics and Statistics

Must-Know Facts

The distinction between Type I error (rejecting a TRUE null — false positive) and Type II error (failing to reject a FALSE null — false negative). Type I = alpha, Type II = beta
Which statistical test to use by scenario: t-test for two group means, ANOVA for three or more group means, chi-squared for two categorical variables, Pearson for linear continuous relationships, Spearman for monotonic/ordinal/non-linear relationships
Confusion matrix mechanics: Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean of both, MCC = most reliable metric under severe class imbalance
ROC/AUC interpretation: AUC = 0.5 means the model is no better than random guessing, AUC = 1.0 is a perfect classifier. Higher AUC is always better
AIC vs BIC: both penalize model complexity — LOWER is better for both. BIC applies a heavier penalty and is preferred when the sample size is large
ARIMA requires stationary data. Non-stationarity (trend, seasonality) must be removed through differencing BEFORE fitting ARIMA
Pearson measures LINEAR correlation only. Spearman measures MONOTONIC correlation and is robust to outliers and non-linear relationships
Distance metrics: Euclidean measures straight-line distance (sensitive to magnitude), Manhattan measures grid distance, cosine measures angle between vectors (magnitude-invariant, preferred for text)

Common Traps

TrapConfusing Type I and Type II errors — many candidates memorize which is 'false positive' but reverse under exam pressure

RealityType I = false POSITIVE (rejecting a true null). Type II = false NEGATIVE (keeping a false null). A useful anchor: Type I is the 'optimistic' error — you thought something was there when it was not

TrapAssuming accuracy is always the right metric to report for a classifier

RealityAccuracy is misleading with imbalanced classes. A model that predicts every sample as negative achieves 97% accuracy on a dataset that is 97% negative while detecting zero positive cases. Use F1, MCC, or precision/recall instead

TrapThinking AIC and BIC values should be maximized

RealityLower AIC/BIC values indicate a better model. Both penalize model complexity — you want the model with the best fit at the lowest complexity penalty

TrapUsing Pearson correlation when data is skewed or contains outliers

RealityPearson measures only LINEAR relationships and is sensitive to outliers. Spearman rank correlation handles non-linear monotonic relationships and is robust to extreme values

TrapApplying ARIMA to a time series without checking for stationarity

RealityARIMA assumes stationary data. You must check stationarity (e.g., with the Augmented Dickey-Fuller test) and apply differencing to remove trend or seasonality before fitting ARIMA

Confusing Pairs

PrecisionRecall

Precision = quality of positive predictions (minimize false positives — use for spam filtering). Recall = coverage of actual positives (minimize false negatives — use for fraud and disease detection). The business context in the question tells you which one matters

Pearson CorrelationSpearman Correlation

Pearson = linear relationship between two continuous variables. Spearman = monotonic relationship using ranks, handles non-linear data and outliers. If the question mentions non-linearity, ordinal data, or outliers, Spearman is correct

t-testANOVA

t-test = compare means between exactly TWO groups. ANOVA = compare means across THREE or more groups. Using multiple t-tests to compare three groups inflates Type I error — ANOVA is the correct choice

AICBIC

Both are model selection criteria where LOWER is better. BIC penalizes complexity more heavily and is preferred for large datasets. AIC is preferred for small samples where predictive accuracy matters more

Scenario Tips

If the question asks about:

When the question describes a binary medical test or fraud model and asks which metric to prioritize given 'missing actual cases is very costly'...

Answer:

Choose Recall (sensitivity). Missing actual cases = false negatives. Recall = TP/(TP+FN) minimizes false negatives

Distractor to avoid:

Precision is tempting but it minimizes false POSITIVES, not false negatives. Accuracy is wrong because of class imbalance

If the question asks about:

When a question asks you to compare three or more models, training methods, or groups for statistical significance...

Answer:

Choose ANOVA. It compares means across three or more groups in a single test and avoids inflating Type I error

Distractor to avoid:

Multiple t-tests seem reasonable but they inflate the chance of a false positive. The correct answer is always ANOVA for 3+ groups

If the question asks about:

When a question presents time series data with seasonal patterns and asks which model is appropriate...

Answer:

Use ARIMA with differencing (SARIMA if seasonal). First difference the data to achieve stationarity, then apply ARIMA

Distractor to avoid:

Applying ARIMA directly without checking for stationarity is the trap. Non-stationary data produces unreliable ARIMA results

Last-Minute Facts

1Type I error = alpha = significance level (e.g., 0.05). Type II error = beta. Power = 1 - beta

2AUC = 0.5 is random, AUC = 1.0 is perfect. Higher is always better

3F1 = 2 * (Precision * Recall) / (Precision + Recall) — it is the HARMONIC mean, not arithmetic mean

4VIF > 10 indicates severe multicollinearity between predictors

5AIC and BIC: LOWER is better. BIC penalizes complexity more than AIC

Domain 224% of exam

Modeling, Analysis, and Outcomes

Must-Know Facts

Exploratory Data Analysis requires both univariate analysis (single variable distributions) AND multivariate analysis (relationships between variables). Do not skip one type
Visualization purpose: bar chart = categorical comparison, scatter plot = two continuous variables, box plot = distribution summary where the CENTER LINE is the MEDIAN not the mean, heat map = correlation matrix, violin plot = distribution shape
Label encoding implies ordinal order (Red=1, Blue=2, Green=3 implies Green > Blue). Use one-hot encoding for nominal categories with no inherent order
Normalization (scales to 0-1) vs standardization (mean=0, std=1): distance-based algorithms (kNN, k-means, SVM) and neural networks need normalization or standardization. Linear/logistic regression and PCA typically use standardization
VIF > 10 signals severe multicollinearity: the fix is to remove one correlated variable, combine via PCA, or apply regularization
Multicollinearity does NOT reduce prediction accuracy — it makes individual coefficient interpretation unreliable. The model can still predict well even with multicollinear features
Box-Cox transformation is a power transformation that normalizes skewed continuous data. Lambda = 0 is equivalent to a log transform
When communicating results, always benchmark against a baseline (e.g., a naive classifier, previous model, or random guess)

Common Traps

TrapUsing label encoding for a nominal categorical variable (city, color, animal species)

RealityLabel encoding assigns integers (Dog=0, Cat=1, Bird=2), which implies Cat is between Dog and Bird in some ordering. For nominal features with no inherent order, use one-hot encoding. Only use label encoding for ordinal features (Low=0, Medium=1, High=2)

TrapThinking the center line of a box plot represents the mean

RealityThe center line of a box plot is always the MEDIAN (50th percentile). The box spans from Q1 to Q3. The mean may appear as a dot in some implementations but is NOT the default center line

TrapConcluding that multicollinearity is making predictions worse

RealityMulticollinearity damages the INTERPRETABILITY of individual coefficients — it inflates standard errors and makes feature importance unreliable. Prediction accuracy may still be fine. The exam tests this distinction

TrapTreating a heat map's high correlation as causal evidence

RealityCorrelation is not causation. A heat map shows statistical association. Establishing causation requires controlled experiments (A/B tests, RCTs) or causal inference methods (DAGs, difference-in-differences)

TrapApplying the same preprocessing to train and test sets simultaneously

RealityFitting scalers, encoders, or imputers on the full dataset before train/test split causes data leakage. Always fit transformations on the TRAINING set only, then apply (transform) to both train and test

Confusing Pairs

Normalization (min-max scaling)Standardization (z-score)

Normalization scales to [0, 1]. Standardization transforms to mean=0, std=1. Distance-based algorithms (kNN, k-means, SVM) and neural networks often need normalization. PCA and linear regression typically use standardization. The exam may test which is appropriate for a specific algorithm

One-Hot EncodingLabel Encoding

One-hot = k binary columns for k categories, no ordering implied — use for NOMINAL data (colors, cities). Label encoding = integers assigned to categories, ordering implied — use ONLY for ORDINAL data (Low/Medium/High). Using label encoding on nominal data is an exam trap

OverfittingUnderfitting

Overfitting = high training accuracy, low test accuracy = model too complex = HIGH VARIANCE. Underfitting = low accuracy on BOTH train and test = model too simple = HIGH BIAS. The remedy is opposite: regularize for overfitting, increase complexity or add features for underfitting

Scenario Tips

If the question asks about:

A question describes a regression model where two features have VIF > 10 and asks what action to take...

Answer:

Multicollinearity detected. Options: remove one of the correlated variables, combine them via PCA, or apply Ridge regression (L2 regularization). The answer depends on the stated goal — if prediction only, Ridge works; if interpretability matters, remove a variable

Distractor to avoid:

Increasing training data or adding more features will not fix multicollinearity. Normalizing features also does not resolve it

If the question asks about:

A model achieves 97% accuracy on a dataset where 96% of records are the negative class...

Answer:

Accuracy is misleading — the model may be predicting all samples as negative. Evaluate with precision, recall, F1, or MCC. Investigate with a confusion matrix

Distractor to avoid:

Do not celebrate 97% accuracy on an imbalanced dataset. A naive model that predicts all negative would score 96% already

If the question asks about:

When the question asks how to encode a nominal feature with 200 unique categories for a gradient boosting model...

Answer:

Use target encoding or binary encoding for high-cardinality nominal features. One-hot encoding on 200 categories would create 200 sparse columns, increasing dimensionality dramatically

Distractor to avoid:

One-hot encoding is correct for low-cardinality nominal features but creates excessive dimensionality for high-cardinality ones

Last-Minute Facts

1Box plot center line = MEDIAN, not mean

2VIF > 10 = severe multicollinearity. Fix: remove variable, PCA, or regularization

3Multicollinearity harms interpretability, NOT necessarily prediction accuracy

4One-hot encoding for NOMINAL data. Label encoding only for ORDINAL data

5Data leakage: always fit preprocessing (scalers, encoders) on TRAINING data only

Domain 324% of exam

Machine Learning

Must-Know Facts

Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex). Total error = Bias^2 + Variance + Irreducible noise
LASSO (L1) can reduce coefficients to exactly zero and performs automatic feature selection. Ridge (L2) shrinks coefficients toward zero but NEVER to exactly zero
Random Forest uses BAGGING (parallel training on random subsets, reduces VARIANCE). Gradient Boosting and XGBoost use BOOSTING (sequential training correcting errors, reduces BIAS)
For multi-class classification output: use Softmax activation + categorical cross-entropy. For binary: use Sigmoid + binary cross-entropy. ReLU is for HIDDEN layers, not output layers
Data leakage: fitting preprocessing on the full dataset before splitting leaks test information into training. Always fit on TRAINING data, then transform both sets
k-Nearest Neighbors (kNN) is a SUPERVISED algorithm — it predicts using labeled neighbors. Despite appearing near clustering methods, it is not unsupervised
Curse of dimensionality: as features increase relative to samples, data becomes sparse in high-dimensional space and model performance degrades. Motivates dimensionality reduction
SMOTE generates synthetic minority class examples by interpolating between existing minority samples — it does not duplicate them
Cross-validation: k-fold splits data into k folds, trains on k-1, tests on 1, rotates. Average score estimates generalization. Use stratified k-fold for imbalanced classification

Common Traps

TrapConfusing which ensemble method is parallel vs sequential

RealityRandom Forest = PARALLEL (bagging). Gradient Boosting / XGBoost = SEQUENTIAL (boosting). Bagging reduces variance. Boosting reduces bias. This distinction is a top exam question pattern

TrapThinking Ridge (L2) can perform feature selection

RealityRidge NEVER zeros out a coefficient — it only shrinks them. LASSO (L1) is the only linear regularization method that performs automatic feature selection by zeroing out coefficients. This is tested explicitly

TrapUsing Sigmoid in a multi-class output layer

RealitySigmoid outputs a single probability for binary classification. For multi-class (3+ classes), Softmax is required — it outputs a probability distribution across all classes that sums to 1

TrapThinking kNN is an unsupervised method because it involves distances

RealitykNN is supervised. It classifies a new point based on the MAJORITY LABEL of its k nearest labeled training points. The label requirement makes it supervised

TrapScaling or encoding the entire dataset before train/test split

RealityFitting a scaler on the full dataset before splitting leaks test set statistics into training. Always split first, then fit the scaler on training data only, then transform both train and test

TrapAssuming more training always fixes overfitting

RealityMore data helps reduce overfitting (variance) but only up to a point. Regularization (L1, L2, dropout), cross-validation, and simpler model architectures are the primary remedies

Confusing Pairs

Random Forest (Bagging)Gradient Boosting (Boosting)

Random Forest: trains trees IN PARALLEL on random data subsets, averages predictions, reduces VARIANCE. Gradient Boosting: trains trees SEQUENTIALLY where each corrects previous errors, reduces BIAS. If the question describes a high-variance problem, think Random Forest. If high-bias underfitting, think Boosting

LASSO (L1)Ridge (L2)

LASSO: adds penalty on absolute coefficient values, CAN zero out coefficients, performs feature selection. Ridge: adds penalty on squared coefficient values, CANNOT zero out coefficients, all features remain. Elastic Net combines both. Exam question: which method removes irrelevant features? LASSO

SoftmaxSigmoid

Sigmoid: binary classification output (single probability 0-1). Softmax: multi-class classification output (probability distribution summing to 1 across all classes). The exam will specify the number of target classes — if more than 2, always Softmax

k-MeansDBSCAN

k-Means: requires pre-specifying k, assumes spherical clusters of similar size, no noise handling. DBSCAN: discovers cluster count automatically, handles arbitrary shapes, identifies outliers as noise. If question says 'unknown number of clusters' or 'irregular shapes', DBSCAN is correct

Grid SearchRandom Search

Grid search exhaustively tests ALL hyperparameter combinations — thorough but computationally expensive. Random search samples combinations randomly — faster and often achieves comparable results, especially in high-dimensional hyperparameter spaces

Scenario Tips

If the question asks about:

A model achieves 99% train accuracy and 65% test accuracy — which technique most directly addresses this...

Answer:

Apply regularization (L1 or L2) or reduce model complexity. This is classic overfitting (high variance). Regularization penalizes complexity and closes the train/test gap

Distractor to avoid:

Adding more features or increasing complexity would worsen overfitting. Removing cross-validation removes a diagnostic tool without solving the problem

If the question asks about:

A deep learning model for 10-class image classification — which output activation + loss function combination is correct...

Answer:

Softmax activation + categorical cross-entropy loss. Softmax produces a probability distribution across 10 classes, categorical cross-entropy is the correct loss for multi-class classification

Distractor to avoid:

Sigmoid + binary cross-entropy is correct only for binary classification. ReLU is a hidden layer activation, not an output activation for classification

If the question asks about:

A dataset has 300 features and 150 samples, causing poor model performance — what concept explains this and what is the fix...

Answer:

Curse of dimensionality. Features far outnumber samples causing sparse high-dimensional space. Fix: dimensionality reduction (PCA, feature selection) or collect more data

Distractor to avoid:

This is not data leakage or concept drift. The mismatch between feature count and sample count is the defining characteristic of the curse of dimensionality

If the question asks about:

When the question describes an anomaly detection task with unlabeled data and asks for the right algorithm category...

Answer:

Use unsupervised anomaly detection: isolation forest, autoencoder, or statistical methods (z-score, IQR). Unsupervised because no labeled 'anomaly' examples are available

Distractor to avoid:

Do not choose supervised classification — that requires labeled examples of anomalies, which are typically unavailable in anomaly detection scenarios

Last-Minute Facts

1LASSO (L1) = feature selection via zeroing. Ridge (L2) = shrinkage only, never zeros

2Random Forest = bagging = parallel = reduces VARIANCE. XGBoost = boosting = sequential = reduces BIAS

3Softmax = multi-class output (3+ classes). Sigmoid = binary output

4kNN is SUPERVISED — do not confuse with unsupervised clustering

5Adam = default deep learning optimizer: combines momentum + RMSprop, adaptive learning rate

6SMOTE = creates SYNTHETIC minority samples by interpolation, not duplication

7Dropout = randomly disables neurons during training, reduces overfitting in neural networks

Domain 422% of exam

Operations and Processes

Must-Know Facts

CRISP-DM has six phases in order: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. Know what activities happen at each phase
Version control for data science covers FOUR dimensions: code (git), data (DVC, Delta Lake), hyperparameters (MLflow experiment tracking), and model artifacts. Versioning code alone is insufficient
Data drift = input feature distribution changes over time. Concept drift = the RELATIONSHIP between inputs and target changes over time. Both degrade model performance but require different responses
Deployment environments: edge = on-device, no internet required, low latency, limited compute. Cloud = scalable, internet required. Hybrid = combines both. On-premises = maximum data control
Streaming processes data in real-time as it arrives (low latency, high complexity). Batching collects data and processes in bulk at intervals (higher latency, simpler)
Parquet is a COLUMNAR storage format optimized for analytical queries. CSV is row-based. Parquet dramatically outperforms CSV for large aggregation queries
Winsorization CAPS outliers at a percentile threshold (e.g., values above 99th percentile are set to the 99th percentile value). It does not remove them — different from trimming
PII must be identified and protected before model training. Anonymization and pseudonymization are the primary techniques. This is a compliance requirement, not optional
CI/CD for ML: code commit triggers automated testing, model training, validation against threshold, and deployment with rollback capability if performance drops

Common Traps

TrapConfusing CRISP-DM with DAMA framework

RealityCRISP-DM is a DATA MINING process model covering the lifecycle from business understanding to deployment. DAMA is a data GOVERNANCE body of knowledge focused on data stewardship and management. They are completely different frameworks with different scope and purpose

TrapThinking code version control (git) alone is sufficient for data science reproducibility

RealityData science reproducibility requires versioning across four dimensions: code, data, hyperparameters, and model artifacts. Changing training data without data versioning makes experiments non-reproducible even with perfect code versioning

TrapConfusing data drift and concept drift

RealityData drift: the INPUT distribution changes (new customer demographics shift feature values) while the feature-to-target relationship stays the same. Concept drift: the RELATIONSHIP between inputs and target changes (what predicts churn 2 years ago may differ today). Concept drift requires model retraining; data drift may only require recalibration

TrapAssuming Parquet is always better than CSV

RealityParquet excels for large analytical queries where you select a subset of columns — it reads only needed columns. For small datasets, row-level operations, or human-readable interchange, CSV is often sufficient. The exam tests when each format is the RIGHT choice

TrapConfusing winsorization with data removal (trimming)

RealityWinsorization REPLACES outliers with the value at a percentile threshold — the outlier is transformed, not removed. Trimming REMOVES outlier observations entirely. The sample size changes with trimming but not with winsorization

Confusing Pairs

CRISP-DMDAMA Framework

CRISP-DM = process model for data mining projects — six lifecycle phases. DAMA = data management body of knowledge — data governance, stewardship, quality. The exam may present both and ask which framework applies to a given scenario. Project lifecycle = CRISP-DM. Enterprise data governance = DAMA

StreamingBatching

Streaming: processes events as they arrive, millisecond latency, higher infrastructure complexity, needed for real-time dashboards and fraud detection. Batching: accumulates data then processes periodically, simpler, acceptable for overnight reports and model retraining. The question's latency requirement is the decision key

Data DriftConcept Drift

Data drift: input feature distributions change (the 'X' changes). Concept drift: the mapping from features to target changes (the 'f(X) = Y' relationship changes). Data drift can sometimes be handled with data augmentation; concept drift requires model retraining. Both trigger alerts in model monitoring

Edge DeploymentCloud Deployment

Edge: model runs ON the device, no internet dependency, low latency, limited compute — requires model compression. Cloud: model runs on remote servers, scalable compute, requires stable connectivity. If the question mentions IoT, offline capability, or privacy-sensitive local data, edge deployment is the answer

WinsorizationTrimming (Truncation)

Winsorization: replaces outlier values with the boundary value at a specified percentile — sample size unchanged. Trimming: removes observations identified as outliers — sample size decreases. When the question asks about outlier handling that preserves sample size, the answer is winsorization

Scenario Tips

If the question asks about:

An IoT device needs real-time predictions on sensor data with no internet connectivity...

Answer:

Edge deployment. The model runs directly on the device. No connectivity required, low latency guaranteed. Model must be compressed/quantized for limited compute

Distractor to avoid:

Cloud deployment requires connectivity and introduces latency. Hybrid with cloud as primary still depends on connectivity for main inference. Edge is the only answer that works offline

If the question asks about:

A question asks how to merge customer records from two sources where names are spelled differently across systems...

Answer:

Fuzzy join using string similarity matching (edit distance, Jaccard). It matches records that are similar but not identical, handling typos and name variations

Distractor to avoid:

An inner join on exact name match would miss valid records. Cross join with filtering creates a cartesian product (too expensive). Left join on exact match also misses fuzzy matches

If the question asks about:

A data science team detects that model predictions have drifted in production — how to determine if it is data drift or concept drift...

Answer:

Monitor input feature distributions (PSI, KL divergence) to detect data drift. Monitor model performance on fresh labeled data to detect concept drift. If input distributions shifted but labels are unavailable, assume data drift. If you have new ground truth and performance degraded, it could be concept drift

Distractor to avoid:

You cannot diagnose drift without monitoring. Retraining immediately without diagnosis may not solve the root cause

If the question asks about:

When the question describes a CRISP-DM scenario and asks which phase involves evaluating whether the model meets business objectives...

Answer:

Evaluation phase (Phase 5). This is where you assess the model against business success criteria BEFORE deciding to deploy. It is separate from model validation metrics done in the Modeling phase

Distractor to avoid:

Deployment phase (Phase 6) comes AFTER evaluation — you do not evaluate against business objectives during deployment. Model validation (train/test metrics) happens in Modeling phase, not the Evaluation phase

Last-Minute Facts

1CRISP-DM phases in order: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment

2Version control for ML = code + data + hyperparameters + model artifacts (4 dimensions)

3Parquet = columnar = fast for analytical SELECT column queries. CSV = row-based = simple but slow at scale

4Winsorization = CAPS outliers at percentile. Trimming = REMOVES outliers. Winsorization preserves sample size

5Data drift = X distribution changes. Concept drift = f(X)→Y relationship changes

6Edge deployment = no internet required, low latency, limited compute, needs model compression

Domain 513% of exam

Specialized Applications of Data Science

Must-Know Facts

NLP pipeline order: tokenization → stop word removal → stemming or lemmatization → bag of words or TF-IDF → embeddings. Each step must precede the next
LDA in NLP context = Latent Dirichlet Allocation = unsupervised TOPIC MODEL for discovering themes in document collections. LDA in ML context = Linear Discriminant Analysis = supervised classification. Same abbreviation, completely different algorithms
TF-IDF is NOT bag of words. Bag of words counts raw word frequencies. TF-IDF weights by term frequency multiplied by inverse document frequency — downweighting common words
Word2Vec and GloVe produce DENSE vector embeddings that capture semantic relationships. Bag of words and TF-IDF produce SPARSE vectors. Dense embeddings are more powerful for similarity tasks
Reinforcement learning is NEITHER supervised NOR unsupervised — it is a third learning paradigm where an agent learns through reward and penalty signals from an environment
Data augmentation in computer vision (rotation, flipping, scaling, noise) creates transformed copies of existing images to improve model robustness — it does NOT create new independent data or improve resolution
Multi-armed bandit: balances exploration (trying unknown options) vs exploitation (using the known best option). Used in adaptive A/B testing and recommendation optimization
Simplex method is the algorithm for solving linear programming (constrained optimization) problems — it traverses constraint boundary vertices to find the optimal solution

Common Traps

TrapTreating LDA as a single concept across all questions

RealityLDA has two completely different meanings depending on context. In NLP: Latent Dirichlet Allocation — unsupervised topic discovery. In supervised ML: Linear Discriminant Analysis — dimensionality reduction for classification. The question context tells you which one applies

TrapClassifying reinforcement learning as a type of unsupervised learning

RealityReinforcement learning is its own learning paradigm. It is NOT unsupervised (no label discovery from unlabeled data) and NOT supervised (no labeled training examples). An agent learns by interacting with an environment and receiving rewards or penalties

TrapAssuming data augmentation in computer vision is equivalent to collecting new data

RealityAugmentation creates transformed copies of existing images (rotated, flipped, scaled). These are NOT independent samples — they are derived from the same underlying images. Augmentation improves diversity and reduces overfitting but does not substitute for genuinely new data

TrapConfusing TF-IDF with bag of words

RealityBag of words counts raw word occurrences per document — common words like 'the' get high counts. TF-IDF multiplies term frequency by the inverse of how many documents contain the term — common words across documents get low weights. TF-IDF is more discriminative

TrapThinking stemming and lemmatization produce the same results

RealityStemming is fast and rule-based but crude — it chops word endings and can produce non-words ('better' → 'bet'). Lemmatization uses vocabulary and morphology to return valid dictionary base forms ('better' → 'good'). Stemming sacrifices accuracy for speed; lemmatization sacrifices speed for accuracy

Confusing Pairs

StemmingLemmatization

Stemming: fast, rule-based, chops endings, may produce non-words ('running' → 'run', 'studies' → 'studi'). Lemmatization: uses vocabulary, always returns valid dictionary form ('better' → 'good', 'running' → 'run'). If speed is the constraint, stemming. If accuracy is the constraint, lemmatization

Bag of WordsTF-IDF

Bag of words: count raw word frequency per document, SPARSE vector, ignores how common a word is across documents. TF-IDF: weights words by frequency in document * rarity across corpus, common words ('the', 'is') get low weight, distinctive words get high weight. TF-IDF is better for document similarity and retrieval

Word2Vec / GloVeTF-IDF

Word2Vec/GloVe: DENSE embeddings that capture semantic meaning (similar words cluster together in vector space). TF-IDF: SPARSE vectors based on frequency statistics with no semantic understanding. For downstream tasks requiring meaning (sentiment, Q&A), embeddings are superior. For keyword extraction or document retrieval, TF-IDF can suffice

Multi-Armed BanditA/B Testing

Traditional A/B testing: assigns equal traffic to variants for a fixed duration, then picks a winner — misses revenue during the test period. Multi-armed bandit: dynamically routes more traffic to better-performing variants during the test, minimizing lost opportunity while still exploring alternatives

Scenario Tips

If the question asks about:

A question asks to discover hidden themes in 50,000 customer reviews without predefined categories...

Answer:

Use LDA (Latent Dirichlet Allocation) — unsupervised topic modeling. It identifies latent topics without predefined labels

Distractor to avoid:

Named entity recognition extracts specific entities (names, places) not themes. Text classification requires predefined categories. Sentiment analysis measures tone, not topics

If the question asks about:

A computer vision team has only 500 training images and asks the best way to improve model performance without collecting new data...

Answer:

Apply data augmentation: rotation, flipping, scaling, cropping, color jitter, noise injection. This artificially increases training set diversity and reduces overfitting

Distractor to avoid:

Increasing learning rate risks divergence. Switching to logistic regression loses spatial feature extraction. Removing pooling layers increases computation without solving the data limitation

If the question asks about:

A question asks which technique balances showing known high-performing product recommendations while also exploring potentially better new options...

Answer:

Multi-armed bandit. It explicitly models the exploration vs exploitation tradeoff, routing more traffic to known performers while testing alternatives

Distractor to avoid:

Linear programming is for constrained optimization with a known objective function. Greedy algorithms always exploit the current best without any exploration. Gradient descent optimizes parameters, not allocation decisions

If the question asks about:

An NLP question describes processing steps in the wrong order — which ordering is correct...

Answer:

Correct pipeline order: tokenization first, then stop word removal, then stemming or lemmatization, then vectorization (bag of words or TF-IDF), then optionally embedding. Never stem before tokenizing or vectorize before normalizing

Distractor to avoid:

Applying TF-IDF before removing stop words gives high weights to common words like 'the' — defeating the purpose. Always normalize text first, then vectorize

Last-Minute Facts

1LDA in NLP = Latent Dirichlet Allocation = TOPIC MODEL (unsupervised). LDA in ML = Linear Discriminant Analysis = CLASSIFIER (supervised). Same acronym, different algorithms

2Reinforcement learning = third paradigm: agent + environment + reward signal. NOT supervised, NOT unsupervised

3NLP pipeline order: tokenize → stop words → stem/lemmatize → bag of words/TF-IDF → embeddings

4TF-IDF > bag of words for distinguishing documents. Word2Vec > TF-IDF for semantic similarity

5Stemming = fast, crude, may produce non-words. Lemmatization = slow, accurate, always valid base form

6Data augmentation creates TRANSFORMED COPIES, not new independent samples — does not substitute for real data collection

7Multi-armed bandit = exploration + exploitation tradeoff. Greedy = exploitation ONLY with no exploration

Feeling confident?

Put your knowledge to the test with a timed DY0-001 mock exam.