CertPrepNow
CompTIADY0-00180 concepts

DY0-001 Cheat Sheet

Quick reference for the CompTIA DataAI (formerly DataX) exam.

Statistical Tests — When to Use Each

t-test (Independent)
Compares means of two independent groups (e.g., control vs treatment); assumes normal distribution and equal variance.
ANOVA (Analysis of Variance)
Compares means across three or more groups simultaneously; use instead of multiple t-tests to avoid inflated Type I error.
Chi-squared Test
Tests independence between two categorical variables (e.g., does gender affect product preference?); operates on frequency counts, not means.
Pearson vs Spearman Correlation
Pearson measures linear relationships between continuous variables; Spearman measures monotonic relationships and handles non-linear data or ordinal scales.
Type I Error (α) vs Type II Error (β)
Type I = false positive (reject a true null hypothesis); Type II = false negative (fail to reject a false null); significance level α controls the Type I error rate.
AIC vs BIC
Both penalize model complexity; lower values are better; BIC applies a heavier penalty and favors simpler models more aggressively than AIC, especially with large datasets.
p-value Interpretation
Probability of observing results at least as extreme as the data assuming the null hypothesis is true; p < α (typically 0.05) means reject the null hypothesis.

Confusion Matrix Metrics

Accuracy = (TP + TN) / (TP + FP + TN + FN)
Proportion of all correct predictions; misleading on imbalanced datasets where predicting only the majority class yields high accuracy.
Precision = TP / (TP + FP)
Of all predicted positives, what fraction is actually positive; optimize when false positives are costly (e.g., spam filtering).
Recall = TP / (TP + FN)
Of all actual positives, what fraction was correctly identified; optimize when false negatives are costly (e.g., fraud or disease detection).
F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall; use when both false positives and false negatives matter and classes are imbalanced.
ROC / AUC
ROC plots true positive rate vs false positive rate across all thresholds; AUC = 0.5 means random classifier, AUC = 1.0 means perfect classifier.
MCC (Matthews Correlation Coefficient)
Balanced metric for binary classification on imbalanced datasets; ranges from -1 to +1; considers all four confusion matrix cells, making it more reliable than accuracy or F1 alone.

Regression Variants and Regularization

R-squared (Coefficient of Determination)
Proportion of variance in the target explained by the model (0–1); adjusted R-squared penalizes adding non-informative features and is preferred for model comparison.
RMSE (Root Mean Squared Error)
Square root of average squared residuals; sensitive to large errors; reported in the same units as the target variable, making it interpretable alongside the target range.
OLS Linear Regression
Ordinary Least Squares minimizes the sum of squared residuals; assumes linearity, independence of errors, homoscedasticity, and normality of residuals.
Ridge (L2): penalty = λ × Σ β²
Shrinks all coefficients toward zero but never to exactly zero; prevents overfitting when all features contribute; does not perform feature selection.
LASSO (L1): penalty = λ × Σ |β|
Can shrink coefficients to exactly zero, automatically performing feature selection; best when many features are irrelevant; sparsity increases as λ increases.
Elastic Net = α×L1 + (1-α)×L2
Combines LASSO and Ridge penalties via mixing parameter α; provides feature selection while handling correlated features better than LASSO alone.

Feature Engineering and EDA

One-Hot Encoding vs Label Encoding
One-hot creates k binary columns for k nominal categories (no implied order); label encoding assigns integers — only correct for ordinal features, not nominal ones like color or city.
Normalization (MinMaxScaler) vs Standardization (StandardScaler)
Normalization scales to [0, 1]; standardization transforms to mean=0, std=1; distance-based algorithms (kNN, SVM, k-means) require one of these before training.
VIF (Variance Inflation Factor) > 10
VIF above 10 indicates severe multicollinearity between features; resolve by removing one correlated feature, applying PCA, or using Ridge/Elastic Net regularization.
Box-Cox Transformation
Power transformation that makes skewed data approximate a normal distribution; requires all positive values; λ=0 is equivalent to log transform; stabilizes variance.
Visualization Types
Bar (categorical comparisons), scatter (two continuous variables), box plot (distribution + outliers; center line = median not mean), violin (distribution shape), heat map (correlations), line (time trends).
Winsorization
Caps outliers at a percentile threshold (e.g., 5th/95th percentile) rather than removing them; preserves sample size while limiting outlier influence — different from trimming (removal).
Data Leakage Prevention
Fit all preprocessing transformers (scaler, encoder, imputer) on training data only, then transform both train and test; fitting on the full dataset before splitting leaks test set statistics into training.

Supervised Machine Learning Algorithms

Bias-Variance Tradeoff
High bias = underfitting (model too simple, poor on train and test); high variance = overfitting (model memorizes training data, poor on test); total error = bias² + variance + irreducible noise.
Random Forest (Bagging — parallel, reduces variance)
Trains multiple decision trees in parallel on bootstrap samples with random feature subsets; averages predictions; reduces variance without increasing bias.
XGBoost / Gradient Boosting (sequential, reduces bias)
Trains trees sequentially; each new tree corrects errors of previous trees; XGBoost adds L1/L2 regularization; typically outperforms random forest on tabular data.
from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='f1')
Runs 5-fold cross-validation; use scoring='roc_auc', 'precision', 'recall', or 'f1' depending on the business objective; returns an array of per-fold scores.
from sklearn.model_selection import GridSearchCV gcv = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc') gcv.fit(X_train, y_train)
Exhaustive search over all parameter combinations; use RandomizedSearchCV(n_iter=50) for large grids — samples random combinations and is often as effective with much less compute.
SMOTE for Class Imbalance
Generates synthetic minority-class samples by interpolating between existing samples; apply only on training data after the train/test split — never on test data.
Naive Bayes (assumes feature independence)
Applies Bayes' theorem with a strong independence assumption between features; fast and effective for text classification despite the assumption frequently being violated in practice.

Unsupervised Learning

k-Means: requires k, assumes spherical clusters
Partitions data into k clusters by minimizing intra-cluster distances to centroids; use elbow method or silhouette score to select k; fails with non-spherical or unequal-density clusters.
DBSCAN: no k required, handles arbitrary shapes
Groups points in dense regions; automatically discovers cluster count; marks low-density points as noise (outliers); preferred over k-means when cluster count is unknown or shapes are irregular.
Hierarchical Clustering (dendrogram)
Builds a tree of clusters by iteratively merging (agglomerative) or splitting (divisive) clusters; cut the dendrogram at a chosen height to determine the number of clusters.
PCA (Principal Component Analysis)
Projects data onto orthogonal directions of maximum variance (principal components = eigenvectors of covariance matrix); reduces dimensionality while preserving global structure.
t-SNE vs UMAP
t-SNE: non-linear dimensionality reduction for 2D/3D visualization, preserves local structure, slow on large datasets. UMAP: faster, preserves both local and global structure, better for very large datasets.
Silhouette Score: range -1 to +1
Measures cluster quality: compares intra-cluster cohesion vs inter-cluster separation; higher is better; score near 0 means overlapping clusters; negative score means misassigned points.

Deep Learning Architecture

Activation Functions by Layer Type
Hidden layers: ReLU (avoids vanishing gradient), Tanh (zero-centered); Binary output: Sigmoid (squashes to 0–1); Multi-class output: Softmax (probability distribution summing to 1).
CNN (images): Conv → Pool → Fully Connected
Convolutional layers extract local spatial features (edges, textures); pooling layers reduce spatial dimensions and provide translation invariance; FC layers perform final classification.
RNN → LSTM → GRU (sequences)
RNN: processes sequences recurrently but suffers vanishing gradient; LSTM: adds cell state and gating to retain long-term dependencies; GRU: simplified LSTM with fewer parameters, similar performance.
Transformers (self-attention mechanism)
Processes all sequence positions in parallel using self-attention weights; avoids RNN sequential bottleneck; foundation of BERT, GPT, and modern NLP/vision models.
GANs: Generator vs Discriminator
Generator creates synthetic samples; discriminator distinguishes real from fake; adversarial training improves both simultaneously; used for synthetic data generation and augmentation.
Autoencoders: Encoder → Latent Space → Decoder
Encoder compresses input to a low-dimensional latent representation; decoder reconstructs the original; anomaly detection: high reconstruction error signals an anomaly.
Dropout Regularization
Randomly disables a fraction of neurons during each training step to prevent co-adaptation and overfitting; all neurons are active during inference with weights scaled by keep probability.
Learning Rate: too high = divergence, too low = slow convergence
Adam optimizer adapts per-parameter learning rates and is the widely recommended default; SGD with momentum may generalize slightly better with careful tuning; RMSprop suits RNNs.

Time Series and Probability

ARIMA(p, d, q)
p = autoregressive order (past values); d = differencing order to achieve stationarity; q = moving average order (past forecast errors); requires stationary data.
from statsmodels.tsa.statespace.sarimax import SARIMAX model = SARIMAX(data, order=(1,1,1), seasonal_order=(1,1,1,12)) result = model.fit(disp=False)
Fits a seasonal ARIMA; seasonal_order=(P,D,Q,s) adds seasonal components; s=12 for monthly data with annual seasonality; use disp=False to suppress convergence output.
Stationarity (ADF Test)
A stationary series has constant mean, variance, and autocorrelation over time; required for ARIMA; test with Augmented Dickey-Fuller test; if non-stationary, apply differencing (d=1 or d=2).
Probability Distributions
Normal (continuous, symmetric, bell-shaped); Poisson (count data, events per fixed interval); Binomial (discrete, n Bernoulli trials with probability p); t-distribution (heavier tails, used for small samples).
Survival Analysis
Analyzes time-to-event data (customer churn, equipment failure); Kaplan-Meier curve is non-parametric and handles right-censored data; parametric methods assume an underlying survival distribution.
Causal Inference: Correlation ≠ Causation
Establish causation via randomized controlled trials (RCTs), A/B tests, difference-in-differences analysis, or directed acyclic graphs (DAGs) that explicitly model causal structure.

NLP Pipeline

NLP Preprocessing Order
Tokenization → stop word removal → stemming or lemmatization → feature representation (bag of words, TF-IDF, or embeddings); tokenize first before any other text transformation.
Stemming vs Lemmatization
Stemming: fast rule-based truncation, may produce non-words ('better' → 'bet'); lemmatization: returns dictionary base form ('better' → 'good'); trade accuracy (lemmatization) for speed (stemming).
Bag of Words vs TF-IDF
Bag of words: sparse vector of raw word counts; TF-IDF: weights words by in-document frequency divided by across-document frequency — frequent words like 'the' get lower weight.
Word Embeddings: Word2Vec, GloVe (dense vectors)
Dense low-dimensional vectors that capture semantic similarity; semantically similar words have similar vectors; unlike sparse bag-of-words, allow vector arithmetic ('king' - 'man' + 'woman' ≈ 'queen').
LDA = Latent Dirichlet Allocation (topic model, NOT discriminant analysis)
Unsupervised method that discovers hidden topics in document collections without predefined labels; completely different from Linear Discriminant Analysis (LDA) in supervised classification.
NLP Application Types
Sentiment analysis (tone), named entity recognition / NER (extract persons, places, orgs), text classification, summarization, question answering, machine translation, speech recognition.

MLOps and Data Operations

CRISP-DM 6 Phases
Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment; phases are iterative; data preparation typically consumes 60–80% of total project time.
Version Control for Data Science (4 dimensions)
Must version: code (git), data (DVC), hyperparameters (MLflow/experiment tracker), and model artifacts; versioning code alone is insufficient for experiment reproducibility.
Data Drift vs Concept Drift
Data drift: input feature distribution shifts (e.g., customer demographics change); concept drift: relationship between features and target changes (e.g., what predicts churn evolves); both degrade model performance.
Deployment Environments
Cloud: scalable, complex models; edge: on-device inference, low latency, works offline, limited compute (IoT/mobile); hybrid: cloud primary with edge fallback; on-premises: data sovereignty or compliance requirements.
Streaming vs Batching
Streaming: processes events in real-time as they arrive (fraud detection, alerting); batching: accumulates data and processes in bulk on a schedule (daily model retraining, reports).
Parquet (columnar) vs CSV (row-based)
Parquet: compressed, fast for analytical column scans on large datasets, schema-enforced; CSV: human-readable, no schema, slow for analytics — use Parquet for data pipelines at scale.
Fuzzy Join (Record Linkage)
Joins records across datasets with approximate string matches using similarity metrics like Levenshtein distance; resolves entity duplication when names differ slightly between sources.
Winsorization vs Trimming vs Imputation
Winsorization caps outliers at a percentile threshold (preserves rows); trimming removes outlier rows (reduces sample); imputation replaces outliers or missing values with estimated values.

Computer Vision and Optimization

CNN for Computer Vision
Convolutional layers learn spatial feature hierarchies (edges → textures → shapes → objects); pooling layers downsample and provide translation invariance; FC layers perform classification.
Data Augmentation Techniques
Artificially increases training diversity via rotation, horizontal/vertical flipping, scaling, cropping, brightness adjustment, noise injection, and occlusion; does not add new real images.
Object Detection vs Semantic Segmentation
Object detection identifies objects with bounding boxes (YOLO, Faster R-CNN); semantic segmentation classifies every pixel by category (U-Net); segmentation is finer-grained than detection.
Multi-Armed Bandit (Exploration vs Exploitation)
Models the tradeoff between exploiting known best options and exploring potentially better ones; used in A/B testing, ad serving, and recommendation systems; a greedy algorithm only exploits.
Linear Programming (Simplex Method)
Optimizes a linear objective function subject to linear inequality constraints; simplex method traverses vertices of the feasible polytope; used for resource allocation, scheduling, and pricing.
Reinforcement Learning
Agent learns an optimal policy by taking actions and receiving reward or penalty signals from an environment; neither supervised nor unsupervised; used in robotics, game AI, and adaptive recommendation.

Key Exam Distinctions and Traps

Bagging (Random Forest) vs Boosting (XGBoost)
Bagging = parallel independent trees, reduces variance, Random Forest; Boosting = sequential error-correcting trees, reduces bias, XGBoost/GBM — the exam tests this distinction directly.
LASSO zeroes coefficients; Ridge does not
LASSO (L1) drives irrelevant feature coefficients to exactly zero, performing feature selection; Ridge (L2) only shrinks toward zero — if the question asks about feature selection via regularization, the answer is LASSO.
Softmax (multi-class) vs Sigmoid (binary) output
Multi-class output layer must use Softmax + categorical cross-entropy; binary output uses Sigmoid + binary cross-entropy; ReLU and Tanh are used in hidden layers only.
Curse of Dimensionality
As features grow relative to samples, data becomes sparse in high-dimensional space, degrading model performance; mitigate with PCA, feature selection, or regularization.
LDA (Topic Model in NLP) vs LDA (Discriminant Analysis in ML)
Latent Dirichlet Allocation = unsupervised NLP topic model; Linear Discriminant Analysis = supervised classification/dimensionality reduction; same abbreviation, completely different algorithms — a deliberate exam trap.
Accuracy is misleading on imbalanced datasets
Predicting only the majority class can achieve high accuracy while completely failing to detect the minority class; use precision, recall, F1, or MCC for imbalanced classification.
ARIMA requires stationary data
Non-stationary time series (with trend or seasonality) must be differenced before ARIMA; the d parameter controls differencing; verify stationarity with the ADF test before fitting.

Ready to test yourself?

Start a timed DY0-001 mock exam or review practice questions by domain.