CertPrepNow
CompTIADY0-0015 domains

DY0-001 Exam Notes

Last-minute traps, must-know facts, and scenario tips for the CompTIA DataAI (formerly DataX) exam.

General Exam Tips

  • 1.Read ALL answer choices before committing — two options often look correct but differ on one critical word (e.g., 'shrinks toward zero' vs 'zeros out coefficients')
  • 2.PBQs appear at the START of the exam. Flag simulation PBQs and return to them after completing multiple-choice — virtual PBQs must be completed on first encounter
  • 3.Pass/fail scoring means there is no partial-credit safety net. You must demonstrate competence across ALL five domains — neglecting any domain risks a fail even if you ace the rest
  • 4.The exam is 165 minutes for up to 90 questions. Budget roughly 1 minute per multiple-choice question and 4-6 minutes per PBQ. Never spend more than 8 minutes on any single question
  • 5.Scenario questions always embed a constraint that narrows the correct answer — identify the constraint FIRST (e.g., 'low latency', 'limited compute', 'imbalanced classes') before evaluating options
  • 6.When a question mentions a business context (e.g., 'missing fraud cases is costly'), translate it to a metric: costly false negatives means maximize recall; costly false positives means maximize precision
  • 7.The Operations and Processes domain (22%) is the most commonly neglected by working data scientists — candidates who only study ML theory frequently fail here on MLOps and governance questions
  • 8.Ethics and AI governance appear in the exam content despite not having a dedicated domain — do not skip these topics entirely
  • 9.The exam was rebranded from CompTIA DataX to CompTIA DataAI in January 2026 — content and objectives are IDENTICAL, same DY0-001 code
Domain 117% of exam

Mathematics and Statistics

Must-Know Facts

  • The distinction between Type I error (rejecting a TRUE null — false positive) and Type II error (failing to reject a FALSE null — false negative). Type I = alpha, Type II = beta
  • Which statistical test to use by scenario: t-test for two group means, ANOVA for three or more group means, chi-squared for two categorical variables, Pearson for linear continuous relationships, Spearman for monotonic/ordinal/non-linear relationships
  • Confusion matrix mechanics: Precision = TP/(TP+FP), Recall = TP/(TP+FN), F1 = harmonic mean of both, MCC = most reliable metric under severe class imbalance
  • ROC/AUC interpretation: AUC = 0.5 means the model is no better than random guessing, AUC = 1.0 is a perfect classifier. Higher AUC is always better
  • AIC vs BIC: both penalize model complexity — LOWER is better for both. BIC applies a heavier penalty and is preferred when the sample size is large
  • ARIMA requires stationary data. Non-stationarity (trend, seasonality) must be removed through differencing BEFORE fitting ARIMA
  • Pearson measures LINEAR correlation only. Spearman measures MONOTONIC correlation and is robust to outliers and non-linear relationships
  • Distance metrics: Euclidean measures straight-line distance (sensitive to magnitude), Manhattan measures grid distance, cosine measures angle between vectors (magnitude-invariant, preferred for text)

Common Traps

TrapConfusing Type I and Type II errors — many candidates memorize which is 'false positive' but reverse under exam pressure
RealityType I = false POSITIVE (rejecting a true null). Type II = false NEGATIVE (keeping a false null). A useful anchor: Type I is the 'optimistic' error — you thought something was there when it was not
TrapAssuming accuracy is always the right metric to report for a classifier
RealityAccuracy is misleading with imbalanced classes. A model that predicts every sample as negative achieves 97% accuracy on a dataset that is 97% negative while detecting zero positive cases. Use F1, MCC, or precision/recall instead
TrapThinking AIC and BIC values should be maximized
RealityLower AIC/BIC values indicate a better model. Both penalize model complexity — you want the model with the best fit at the lowest complexity penalty
TrapUsing Pearson correlation when data is skewed or contains outliers
RealityPearson measures only LINEAR relationships and is sensitive to outliers. Spearman rank correlation handles non-linear monotonic relationships and is robust to extreme values
TrapApplying ARIMA to a time series without checking for stationarity
RealityARIMA assumes stationary data. You must check stationarity (e.g., with the Augmented Dickey-Fuller test) and apply differencing to remove trend or seasonality before fitting ARIMA

Confusing Pairs

PrecisionRecall

Precision = quality of positive predictions (minimize false positives — use for spam filtering). Recall = coverage of actual positives (minimize false negatives — use for fraud and disease detection). The business context in the question tells you which one matters

Pearson CorrelationSpearman Correlation

Pearson = linear relationship between two continuous variables. Spearman = monotonic relationship using ranks, handles non-linear data and outliers. If the question mentions non-linearity, ordinal data, or outliers, Spearman is correct

t-testANOVA

t-test = compare means between exactly TWO groups. ANOVA = compare means across THREE or more groups. Using multiple t-tests to compare three groups inflates Type I error — ANOVA is the correct choice

AICBIC

Both are model selection criteria where LOWER is better. BIC penalizes complexity more heavily and is preferred for large datasets. AIC is preferred for small samples where predictive accuracy matters more

Scenario Tips

If the question asks about:

When the question describes a binary medical test or fraud model and asks which metric to prioritize given 'missing actual cases is very costly'...

Answer:

Choose Recall (sensitivity). Missing actual cases = false negatives. Recall = TP/(TP+FN) minimizes false negatives

Distractor to avoid:

Precision is tempting but it minimizes false POSITIVES, not false negatives. Accuracy is wrong because of class imbalance

If the question asks about:

When a question asks you to compare three or more models, training methods, or groups for statistical significance...

Answer:

Choose ANOVA. It compares means across three or more groups in a single test and avoids inflating Type I error

Distractor to avoid:

Multiple t-tests seem reasonable but they inflate the chance of a false positive. The correct answer is always ANOVA for 3+ groups

If the question asks about:

When a question presents time series data with seasonal patterns and asks which model is appropriate...

Answer:

Use ARIMA with differencing (SARIMA if seasonal). First difference the data to achieve stationarity, then apply ARIMA

Distractor to avoid:

Applying ARIMA directly without checking for stationarity is the trap. Non-stationary data produces unreliable ARIMA results

Last-Minute Facts

1Type I error = alpha = significance level (e.g., 0.05). Type II error = beta. Power = 1 - beta
2AUC = 0.5 is random, AUC = 1.0 is perfect. Higher is always better
3F1 = 2 * (Precision * Recall) / (Precision + Recall) — it is the HARMONIC mean, not arithmetic mean
4VIF > 10 indicates severe multicollinearity between predictors
5AIC and BIC: LOWER is better. BIC penalizes complexity more than AIC
Domain 224% of exam

Modeling, Analysis, and Outcomes

Must-Know Facts

  • Exploratory Data Analysis requires both univariate analysis (single variable distributions) AND multivariate analysis (relationships between variables). Do not skip one type
  • Visualization purpose: bar chart = categorical comparison, scatter plot = two continuous variables, box plot = distribution summary where the CENTER LINE is the MEDIAN not the mean, heat map = correlation matrix, violin plot = distribution shape
  • Label encoding implies ordinal order (Red=1, Blue=2, Green=3 implies Green > Blue). Use one-hot encoding for nominal categories with no inherent order
  • Normalization (scales to 0-1) vs standardization (mean=0, std=1): distance-based algorithms (kNN, k-means, SVM) and neural networks need normalization or standardization. Linear/logistic regression and PCA typically use standardization
  • VIF > 10 signals severe multicollinearity: the fix is to remove one correlated variable, combine via PCA, or apply regularization
  • Multicollinearity does NOT reduce prediction accuracy — it makes individual coefficient interpretation unreliable. The model can still predict well even with multicollinear features
  • Box-Cox transformation is a power transformation that normalizes skewed continuous data. Lambda = 0 is equivalent to a log transform
  • When communicating results, always benchmark against a baseline (e.g., a naive classifier, previous model, or random guess)

Common Traps

TrapUsing label encoding for a nominal categorical variable (city, color, animal species)
RealityLabel encoding assigns integers (Dog=0, Cat=1, Bird=2), which implies Cat is between Dog and Bird in some ordering. For nominal features with no inherent order, use one-hot encoding. Only use label encoding for ordinal features (Low=0, Medium=1, High=2)
TrapThinking the center line of a box plot represents the mean
RealityThe center line of a box plot is always the MEDIAN (50th percentile). The box spans from Q1 to Q3. The mean may appear as a dot in some implementations but is NOT the default center line
TrapConcluding that multicollinearity is making predictions worse
RealityMulticollinearity damages the INTERPRETABILITY of individual coefficients — it inflates standard errors and makes feature importance unreliable. Prediction accuracy may still be fine. The exam tests this distinction
TrapTreating a heat map's high correlation as causal evidence
RealityCorrelation is not causation. A heat map shows statistical association. Establishing causation requires controlled experiments (A/B tests, RCTs) or causal inference methods (DAGs, difference-in-differences)
TrapApplying the same preprocessing to train and test sets simultaneously
RealityFitting scalers, encoders, or imputers on the full dataset before train/test split causes data leakage. Always fit transformations on the TRAINING set only, then apply (transform) to both train and test

Confusing Pairs

Normalization (min-max scaling)Standardization (z-score)

Normalization scales to [0, 1]. Standardization transforms to mean=0, std=1. Distance-based algorithms (kNN, k-means, SVM) and neural networks often need normalization. PCA and linear regression typically use standardization. The exam may test which is appropriate for a specific algorithm

One-Hot EncodingLabel Encoding

One-hot = k binary columns for k categories, no ordering implied — use for NOMINAL data (colors, cities). Label encoding = integers assigned to categories, ordering implied — use ONLY for ORDINAL data (Low/Medium/High). Using label encoding on nominal data is an exam trap

OverfittingUnderfitting

Overfitting = high training accuracy, low test accuracy = model too complex = HIGH VARIANCE. Underfitting = low accuracy on BOTH train and test = model too simple = HIGH BIAS. The remedy is opposite: regularize for overfitting, increase complexity or add features for underfitting

Scenario Tips

If the question asks about:

A question describes a regression model where two features have VIF > 10 and asks what action to take...

Answer:

Multicollinearity detected. Options: remove one of the correlated variables, combine them via PCA, or apply Ridge regression (L2 regularization). The answer depends on the stated goal — if prediction only, Ridge works; if interpretability matters, remove a variable

Distractor to avoid:

Increasing training data or adding more features will not fix multicollinearity. Normalizing features also does not resolve it

If the question asks about:

A model achieves 97% accuracy on a dataset where 96% of records are the negative class...

Answer:

Accuracy is misleading — the model may be predicting all samples as negative. Evaluate with precision, recall, F1, or MCC. Investigate with a confusion matrix

Distractor to avoid:

Do not celebrate 97% accuracy on an imbalanced dataset. A naive model that predicts all negative would score 96% already

If the question asks about:

When the question asks how to encode a nominal feature with 200 unique categories for a gradient boosting model...

Answer:

Use target encoding or binary encoding for high-cardinality nominal features. One-hot encoding on 200 categories would create 200 sparse columns, increasing dimensionality dramatically

Distractor to avoid:

One-hot encoding is correct for low-cardinality nominal features but creates excessive dimensionality for high-cardinality ones

Last-Minute Facts

1Box plot center line = MEDIAN, not mean
2VIF > 10 = severe multicollinearity. Fix: remove variable, PCA, or regularization
3Multicollinearity harms interpretability, NOT necessarily prediction accuracy
4One-hot encoding for NOMINAL data. Label encoding only for ORDINAL data
5Data leakage: always fit preprocessing (scalers, encoders) on TRAINING data only
Domain 324% of exam

Machine Learning

Must-Know Facts

  • Bias-variance tradeoff: high bias = underfitting (model too simple), high variance = overfitting (model too complex). Total error = Bias^2 + Variance + Irreducible noise
  • LASSO (L1) can reduce coefficients to exactly zero and performs automatic feature selection. Ridge (L2) shrinks coefficients toward zero but NEVER to exactly zero
  • Random Forest uses BAGGING (parallel training on random subsets, reduces VARIANCE). Gradient Boosting and XGBoost use BOOSTING (sequential training correcting errors, reduces BIAS)
  • For multi-class classification output: use Softmax activation + categorical cross-entropy. For binary: use Sigmoid + binary cross-entropy. ReLU is for HIDDEN layers, not output layers
  • Data leakage: fitting preprocessing on the full dataset before splitting leaks test information into training. Always fit on TRAINING data, then transform both sets
  • k-Nearest Neighbors (kNN) is a SUPERVISED algorithm — it predicts using labeled neighbors. Despite appearing near clustering methods, it is not unsupervised
  • Curse of dimensionality: as features increase relative to samples, data becomes sparse in high-dimensional space and model performance degrades. Motivates dimensionality reduction
  • SMOTE generates synthetic minority class examples by interpolating between existing minority samples — it does not duplicate them
  • Cross-validation: k-fold splits data into k folds, trains on k-1, tests on 1, rotates. Average score estimates generalization. Use stratified k-fold for imbalanced classification

Common Traps

TrapConfusing which ensemble method is parallel vs sequential
RealityRandom Forest = PARALLEL (bagging). Gradient Boosting / XGBoost = SEQUENTIAL (boosting). Bagging reduces variance. Boosting reduces bias. This distinction is a top exam question pattern
TrapThinking Ridge (L2) can perform feature selection
RealityRidge NEVER zeros out a coefficient — it only shrinks them. LASSO (L1) is the only linear regularization method that performs automatic feature selection by zeroing out coefficients. This is tested explicitly
TrapUsing Sigmoid in a multi-class output layer
RealitySigmoid outputs a single probability for binary classification. For multi-class (3+ classes), Softmax is required — it outputs a probability distribution across all classes that sums to 1
TrapThinking kNN is an unsupervised method because it involves distances
RealitykNN is supervised. It classifies a new point based on the MAJORITY LABEL of its k nearest labeled training points. The label requirement makes it supervised
TrapScaling or encoding the entire dataset before train/test split
RealityFitting a scaler on the full dataset before splitting leaks test set statistics into training. Always split first, then fit the scaler on training data only, then transform both train and test
TrapAssuming more training always fixes overfitting
RealityMore data helps reduce overfitting (variance) but only up to a point. Regularization (L1, L2, dropout), cross-validation, and simpler model architectures are the primary remedies

Confusing Pairs

Random Forest (Bagging)Gradient Boosting (Boosting)

Random Forest: trains trees IN PARALLEL on random data subsets, averages predictions, reduces VARIANCE. Gradient Boosting: trains trees SEQUENTIALLY where each corrects previous errors, reduces BIAS. If the question describes a high-variance problem, think Random Forest. If high-bias underfitting, think Boosting

LASSO (L1)Ridge (L2)

LASSO: adds penalty on absolute coefficient values, CAN zero out coefficients, performs feature selection. Ridge: adds penalty on squared coefficient values, CANNOT zero out coefficients, all features remain. Elastic Net combines both. Exam question: which method removes irrelevant features? LASSO

SoftmaxSigmoid

Sigmoid: binary classification output (single probability 0-1). Softmax: multi-class classification output (probability distribution summing to 1 across all classes). The exam will specify the number of target classes — if more than 2, always Softmax

k-MeansDBSCAN

k-Means: requires pre-specifying k, assumes spherical clusters of similar size, no noise handling. DBSCAN: discovers cluster count automatically, handles arbitrary shapes, identifies outliers as noise. If question says 'unknown number of clusters' or 'irregular shapes', DBSCAN is correct

Grid SearchRandom Search

Grid search exhaustively tests ALL hyperparameter combinations — thorough but computationally expensive. Random search samples combinations randomly — faster and often achieves comparable results, especially in high-dimensional hyperparameter spaces

Scenario Tips

If the question asks about:

A model achieves 99% train accuracy and 65% test accuracy — which technique most directly addresses this...

Answer:

Apply regularization (L1 or L2) or reduce model complexity. This is classic overfitting (high variance). Regularization penalizes complexity and closes the train/test gap

Distractor to avoid:

Adding more features or increasing complexity would worsen overfitting. Removing cross-validation removes a diagnostic tool without solving the problem

If the question asks about:

A deep learning model for 10-class image classification — which output activation + loss function combination is correct...

Answer:

Softmax activation + categorical cross-entropy loss. Softmax produces a probability distribution across 10 classes, categorical cross-entropy is the correct loss for multi-class classification

Distractor to avoid:

Sigmoid + binary cross-entropy is correct only for binary classification. ReLU is a hidden layer activation, not an output activation for classification

If the question asks about:

A dataset has 300 features and 150 samples, causing poor model performance — what concept explains this and what is the fix...

Answer:

Curse of dimensionality. Features far outnumber samples causing sparse high-dimensional space. Fix: dimensionality reduction (PCA, feature selection) or collect more data

Distractor to avoid:

This is not data leakage or concept drift. The mismatch between feature count and sample count is the defining characteristic of the curse of dimensionality

If the question asks about:

When the question describes an anomaly detection task with unlabeled data and asks for the right algorithm category...

Answer:

Use unsupervised anomaly detection: isolation forest, autoencoder, or statistical methods (z-score, IQR). Unsupervised because no labeled 'anomaly' examples are available

Distractor to avoid:

Do not choose supervised classification — that requires labeled examples of anomalies, which are typically unavailable in anomaly detection scenarios

Last-Minute Facts

1LASSO (L1) = feature selection via zeroing. Ridge (L2) = shrinkage only, never zeros
2Random Forest = bagging = parallel = reduces VARIANCE. XGBoost = boosting = sequential = reduces BIAS
3Softmax = multi-class output (3+ classes). Sigmoid = binary output
4kNN is SUPERVISED — do not confuse with unsupervised clustering
5Adam = default deep learning optimizer: combines momentum + RMSprop, adaptive learning rate
6SMOTE = creates SYNTHETIC minority samples by interpolation, not duplication
7Dropout = randomly disables neurons during training, reduces overfitting in neural networks
Domain 422% of exam

Operations and Processes

Must-Know Facts

  • CRISP-DM has six phases in order: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. Know what activities happen at each phase
  • Version control for data science covers FOUR dimensions: code (git), data (DVC, Delta Lake), hyperparameters (MLflow experiment tracking), and model artifacts. Versioning code alone is insufficient
  • Data drift = input feature distribution changes over time. Concept drift = the RELATIONSHIP between inputs and target changes over time. Both degrade model performance but require different responses
  • Deployment environments: edge = on-device, no internet required, low latency, limited compute. Cloud = scalable, internet required. Hybrid = combines both. On-premises = maximum data control
  • Streaming processes data in real-time as it arrives (low latency, high complexity). Batching collects data and processes in bulk at intervals (higher latency, simpler)
  • Parquet is a COLUMNAR storage format optimized for analytical queries. CSV is row-based. Parquet dramatically outperforms CSV for large aggregation queries
  • Winsorization CAPS outliers at a percentile threshold (e.g., values above 99th percentile are set to the 99th percentile value). It does not remove them — different from trimming
  • PII must be identified and protected before model training. Anonymization and pseudonymization are the primary techniques. This is a compliance requirement, not optional
  • CI/CD for ML: code commit triggers automated testing, model training, validation against threshold, and deployment with rollback capability if performance drops

Common Traps

TrapConfusing CRISP-DM with DAMA framework
RealityCRISP-DM is a DATA MINING process model covering the lifecycle from business understanding to deployment. DAMA is a data GOVERNANCE body of knowledge focused on data stewardship and management. They are completely different frameworks with different scope and purpose
TrapThinking code version control (git) alone is sufficient for data science reproducibility
RealityData science reproducibility requires versioning across four dimensions: code, data, hyperparameters, and model artifacts. Changing training data without data versioning makes experiments non-reproducible even with perfect code versioning
TrapConfusing data drift and concept drift
RealityData drift: the INPUT distribution changes (new customer demographics shift feature values) while the feature-to-target relationship stays the same. Concept drift: the RELATIONSHIP between inputs and target changes (what predicts churn 2 years ago may differ today). Concept drift requires model retraining; data drift may only require recalibration
TrapAssuming Parquet is always better than CSV
RealityParquet excels for large analytical queries where you select a subset of columns — it reads only needed columns. For small datasets, row-level operations, or human-readable interchange, CSV is often sufficient. The exam tests when each format is the RIGHT choice
TrapConfusing winsorization with data removal (trimming)
RealityWinsorization REPLACES outliers with the value at a percentile threshold — the outlier is transformed, not removed. Trimming REMOVES outlier observations entirely. The sample size changes with trimming but not with winsorization

Confusing Pairs

CRISP-DMDAMA Framework

CRISP-DM = process model for data mining projects — six lifecycle phases. DAMA = data management body of knowledge — data governance, stewardship, quality. The exam may present both and ask which framework applies to a given scenario. Project lifecycle = CRISP-DM. Enterprise data governance = DAMA

StreamingBatching

Streaming: processes events as they arrive, millisecond latency, higher infrastructure complexity, needed for real-time dashboards and fraud detection. Batching: accumulates data then processes periodically, simpler, acceptable for overnight reports and model retraining. The question's latency requirement is the decision key

Data DriftConcept Drift

Data drift: input feature distributions change (the 'X' changes). Concept drift: the mapping from features to target changes (the 'f(X) = Y' relationship changes). Data drift can sometimes be handled with data augmentation; concept drift requires model retraining. Both trigger alerts in model monitoring

Edge DeploymentCloud Deployment

Edge: model runs ON the device, no internet dependency, low latency, limited compute — requires model compression. Cloud: model runs on remote servers, scalable compute, requires stable connectivity. If the question mentions IoT, offline capability, or privacy-sensitive local data, edge deployment is the answer

WinsorizationTrimming (Truncation)

Winsorization: replaces outlier values with the boundary value at a specified percentile — sample size unchanged. Trimming: removes observations identified as outliers — sample size decreases. When the question asks about outlier handling that preserves sample size, the answer is winsorization

Scenario Tips

If the question asks about:

An IoT device needs real-time predictions on sensor data with no internet connectivity...

Answer:

Edge deployment. The model runs directly on the device. No connectivity required, low latency guaranteed. Model must be compressed/quantized for limited compute

Distractor to avoid:

Cloud deployment requires connectivity and introduces latency. Hybrid with cloud as primary still depends on connectivity for main inference. Edge is the only answer that works offline

If the question asks about:

A question asks how to merge customer records from two sources where names are spelled differently across systems...

Answer:

Fuzzy join using string similarity matching (edit distance, Jaccard). It matches records that are similar but not identical, handling typos and name variations

Distractor to avoid:

An inner join on exact name match would miss valid records. Cross join with filtering creates a cartesian product (too expensive). Left join on exact match also misses fuzzy matches

If the question asks about:

A data science team detects that model predictions have drifted in production — how to determine if it is data drift or concept drift...

Answer:

Monitor input feature distributions (PSI, KL divergence) to detect data drift. Monitor model performance on fresh labeled data to detect concept drift. If input distributions shifted but labels are unavailable, assume data drift. If you have new ground truth and performance degraded, it could be concept drift

Distractor to avoid:

You cannot diagnose drift without monitoring. Retraining immediately without diagnosis may not solve the root cause

If the question asks about:

When the question describes a CRISP-DM scenario and asks which phase involves evaluating whether the model meets business objectives...

Answer:

Evaluation phase (Phase 5). This is where you assess the model against business success criteria BEFORE deciding to deploy. It is separate from model validation metrics done in the Modeling phase

Distractor to avoid:

Deployment phase (Phase 6) comes AFTER evaluation — you do not evaluate against business objectives during deployment. Model validation (train/test metrics) happens in Modeling phase, not the Evaluation phase

Last-Minute Facts

1CRISP-DM phases in order: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment
2Version control for ML = code + data + hyperparameters + model artifacts (4 dimensions)
3Parquet = columnar = fast for analytical SELECT column queries. CSV = row-based = simple but slow at scale
4Winsorization = CAPS outliers at percentile. Trimming = REMOVES outliers. Winsorization preserves sample size
5Data drift = X distribution changes. Concept drift = f(X)→Y relationship changes
6Edge deployment = no internet required, low latency, limited compute, needs model compression
Domain 513% of exam

Specialized Applications of Data Science

Must-Know Facts

  • NLP pipeline order: tokenization → stop word removal → stemming or lemmatization → bag of words or TF-IDF → embeddings. Each step must precede the next
  • LDA in NLP context = Latent Dirichlet Allocation = unsupervised TOPIC MODEL for discovering themes in document collections. LDA in ML context = Linear Discriminant Analysis = supervised classification. Same abbreviation, completely different algorithms
  • TF-IDF is NOT bag of words. Bag of words counts raw word frequencies. TF-IDF weights by term frequency multiplied by inverse document frequency — downweighting common words
  • Word2Vec and GloVe produce DENSE vector embeddings that capture semantic relationships. Bag of words and TF-IDF produce SPARSE vectors. Dense embeddings are more powerful for similarity tasks
  • Reinforcement learning is NEITHER supervised NOR unsupervised — it is a third learning paradigm where an agent learns through reward and penalty signals from an environment
  • Data augmentation in computer vision (rotation, flipping, scaling, noise) creates transformed copies of existing images to improve model robustness — it does NOT create new independent data or improve resolution
  • Multi-armed bandit: balances exploration (trying unknown options) vs exploitation (using the known best option). Used in adaptive A/B testing and recommendation optimization
  • Simplex method is the algorithm for solving linear programming (constrained optimization) problems — it traverses constraint boundary vertices to find the optimal solution

Common Traps

TrapTreating LDA as a single concept across all questions
RealityLDA has two completely different meanings depending on context. In NLP: Latent Dirichlet Allocation — unsupervised topic discovery. In supervised ML: Linear Discriminant Analysis — dimensionality reduction for classification. The question context tells you which one applies
TrapClassifying reinforcement learning as a type of unsupervised learning
RealityReinforcement learning is its own learning paradigm. It is NOT unsupervised (no label discovery from unlabeled data) and NOT supervised (no labeled training examples). An agent learns by interacting with an environment and receiving rewards or penalties
TrapAssuming data augmentation in computer vision is equivalent to collecting new data
RealityAugmentation creates transformed copies of existing images (rotated, flipped, scaled). These are NOT independent samples — they are derived from the same underlying images. Augmentation improves diversity and reduces overfitting but does not substitute for genuinely new data
TrapConfusing TF-IDF with bag of words
RealityBag of words counts raw word occurrences per document — common words like 'the' get high counts. TF-IDF multiplies term frequency by the inverse of how many documents contain the term — common words across documents get low weights. TF-IDF is more discriminative
TrapThinking stemming and lemmatization produce the same results
RealityStemming is fast and rule-based but crude — it chops word endings and can produce non-words ('better' → 'bet'). Lemmatization uses vocabulary and morphology to return valid dictionary base forms ('better' → 'good'). Stemming sacrifices accuracy for speed; lemmatization sacrifices speed for accuracy

Confusing Pairs

StemmingLemmatization

Stemming: fast, rule-based, chops endings, may produce non-words ('running' → 'run', 'studies' → 'studi'). Lemmatization: uses vocabulary, always returns valid dictionary form ('better' → 'good', 'running' → 'run'). If speed is the constraint, stemming. If accuracy is the constraint, lemmatization

Bag of WordsTF-IDF

Bag of words: count raw word frequency per document, SPARSE vector, ignores how common a word is across documents. TF-IDF: weights words by frequency in document * rarity across corpus, common words ('the', 'is') get low weight, distinctive words get high weight. TF-IDF is better for document similarity and retrieval

Word2Vec / GloVeTF-IDF

Word2Vec/GloVe: DENSE embeddings that capture semantic meaning (similar words cluster together in vector space). TF-IDF: SPARSE vectors based on frequency statistics with no semantic understanding. For downstream tasks requiring meaning (sentiment, Q&A), embeddings are superior. For keyword extraction or document retrieval, TF-IDF can suffice

Multi-Armed BanditA/B Testing

Traditional A/B testing: assigns equal traffic to variants for a fixed duration, then picks a winner — misses revenue during the test period. Multi-armed bandit: dynamically routes more traffic to better-performing variants during the test, minimizing lost opportunity while still exploring alternatives

Scenario Tips

If the question asks about:

A question asks to discover hidden themes in 50,000 customer reviews without predefined categories...

Answer:

Use LDA (Latent Dirichlet Allocation) — unsupervised topic modeling. It identifies latent topics without predefined labels

Distractor to avoid:

Named entity recognition extracts specific entities (names, places) not themes. Text classification requires predefined categories. Sentiment analysis measures tone, not topics

If the question asks about:

A computer vision team has only 500 training images and asks the best way to improve model performance without collecting new data...

Answer:

Apply data augmentation: rotation, flipping, scaling, cropping, color jitter, noise injection. This artificially increases training set diversity and reduces overfitting

Distractor to avoid:

Increasing learning rate risks divergence. Switching to logistic regression loses spatial feature extraction. Removing pooling layers increases computation without solving the data limitation

If the question asks about:

A question asks which technique balances showing known high-performing product recommendations while also exploring potentially better new options...

Answer:

Multi-armed bandit. It explicitly models the exploration vs exploitation tradeoff, routing more traffic to known performers while testing alternatives

Distractor to avoid:

Linear programming is for constrained optimization with a known objective function. Greedy algorithms always exploit the current best without any exploration. Gradient descent optimizes parameters, not allocation decisions

If the question asks about:

An NLP question describes processing steps in the wrong order — which ordering is correct...

Answer:

Correct pipeline order: tokenization first, then stop word removal, then stemming or lemmatization, then vectorization (bag of words or TF-IDF), then optionally embedding. Never stem before tokenizing or vectorize before normalizing

Distractor to avoid:

Applying TF-IDF before removing stop words gives high weights to common words like 'the' — defeating the purpose. Always normalize text first, then vectorize

Last-Minute Facts

1LDA in NLP = Latent Dirichlet Allocation = TOPIC MODEL (unsupervised). LDA in ML = Linear Discriminant Analysis = CLASSIFIER (supervised). Same acronym, different algorithms
2Reinforcement learning = third paradigm: agent + environment + reward signal. NOT supervised, NOT unsupervised
3NLP pipeline order: tokenize → stop words → stem/lemmatize → bag of words/TF-IDF → embeddings
4TF-IDF > bag of words for distinguishing documents. Word2Vec > TF-IDF for semantic similarity
5Stemming = fast, crude, may produce non-words. Lemmatization = slow, accurate, always valid base form
6Data augmentation creates TRANSFORMED COPIES, not new independent samples — does not substitute for real data collection
7Multi-armed bandit = exploration + exploitation tradeoff. Greedy = exploitation ONLY with no exploration

Feeling confident?

Put your knowledge to the test with a timed DY0-001 mock exam.