Quick Navigation
Statistical Tests — When to Use EachConfusion Matrix MetricsRegression Variants and RegularizationFeature Engineering and EDASupervised Machine Learning AlgorithmsUnsupervised LearningDeep Learning ArchitectureTime Series and ProbabilityNLP PipelineMLOps and Data OperationsComputer Vision and OptimizationKey Exam Distinctions and Traps
Statistical Tests — When to Use Each
- t-test (Independent)
- Compares means of two independent groups (e.g., control vs treatment); assumes normal distribution and equal variance.
- ANOVA (Analysis of Variance)
- Compares means across three or more groups simultaneously; use instead of multiple t-tests to avoid inflated Type I error.
- Chi-squared Test
- Tests independence between two categorical variables (e.g., does gender affect product preference?); operates on frequency counts, not means.
- Pearson vs Spearman Correlation
- Pearson measures linear relationships between continuous variables; Spearman measures monotonic relationships and handles non-linear data or ordinal scales.
- Type I Error (α) vs Type II Error (β)
- Type I = false positive (reject a true null hypothesis); Type II = false negative (fail to reject a false null); significance level α controls the Type I error rate.
- AIC vs BIC
- Both penalize model complexity; lower values are better; BIC applies a heavier penalty and favors simpler models more aggressively than AIC, especially with large datasets.
- p-value Interpretation
- Probability of observing results at least as extreme as the data assuming the null hypothesis is true; p < α (typically 0.05) means reject the null hypothesis.
Confusion Matrix Metrics
- Accuracy = (TP + TN) / (TP + FP + TN + FN)
- Proportion of all correct predictions; misleading on imbalanced datasets where predicting only the majority class yields high accuracy.
- Precision = TP / (TP + FP)
- Of all predicted positives, what fraction is actually positive; optimize when false positives are costly (e.g., spam filtering).
- Recall = TP / (TP + FN)
- Of all actual positives, what fraction was correctly identified; optimize when false negatives are costly (e.g., fraud or disease detection).
- F1 Score = 2 × (Precision × Recall) / (Precision + Recall)
- Harmonic mean of precision and recall; use when both false positives and false negatives matter and classes are imbalanced.
- ROC / AUC
- ROC plots true positive rate vs false positive rate across all thresholds; AUC = 0.5 means random classifier, AUC = 1.0 means perfect classifier.
- MCC (Matthews Correlation Coefficient)
- Balanced metric for binary classification on imbalanced datasets; ranges from -1 to +1; considers all four confusion matrix cells, making it more reliable than accuracy or F1 alone.
Regression Variants and Regularization
- R-squared (Coefficient of Determination)
- Proportion of variance in the target explained by the model (0–1); adjusted R-squared penalizes adding non-informative features and is preferred for model comparison.
- RMSE (Root Mean Squared Error)
- Square root of average squared residuals; sensitive to large errors; reported in the same units as the target variable, making it interpretable alongside the target range.
- OLS Linear Regression
- Ordinary Least Squares minimizes the sum of squared residuals; assumes linearity, independence of errors, homoscedasticity, and normality of residuals.
- Ridge (L2): penalty = λ × Σ β²
- Shrinks all coefficients toward zero but never to exactly zero; prevents overfitting when all features contribute; does not perform feature selection.
- LASSO (L1): penalty = λ × Σ |β|
- Can shrink coefficients to exactly zero, automatically performing feature selection; best when many features are irrelevant; sparsity increases as λ increases.
- Elastic Net = α×L1 + (1-α)×L2
- Combines LASSO and Ridge penalties via mixing parameter α; provides feature selection while handling correlated features better than LASSO alone.
Feature Engineering and EDA
- One-Hot Encoding vs Label Encoding
- One-hot creates k binary columns for k nominal categories (no implied order); label encoding assigns integers — only correct for ordinal features, not nominal ones like color or city.
- Normalization (MinMaxScaler) vs Standardization (StandardScaler)
- Normalization scales to [0, 1]; standardization transforms to mean=0, std=1; distance-based algorithms (kNN, SVM, k-means) require one of these before training.
- VIF (Variance Inflation Factor) > 10
- VIF above 10 indicates severe multicollinearity between features; resolve by removing one correlated feature, applying PCA, or using Ridge/Elastic Net regularization.
- Box-Cox Transformation
- Power transformation that makes skewed data approximate a normal distribution; requires all positive values; λ=0 is equivalent to log transform; stabilizes variance.
- Visualization Types
- Bar (categorical comparisons), scatter (two continuous variables), box plot (distribution + outliers; center line = median not mean), violin (distribution shape), heat map (correlations), line (time trends).
- Winsorization
- Caps outliers at a percentile threshold (e.g., 5th/95th percentile) rather than removing them; preserves sample size while limiting outlier influence — different from trimming (removal).
- Data Leakage Prevention
- Fit all preprocessing transformers (scaler, encoder, imputer) on training data only, then transform both train and test; fitting on the full dataset before splitting leaks test set statistics into training.
Supervised Machine Learning Algorithms
- Bias-Variance Tradeoff
- High bias = underfitting (model too simple, poor on train and test); high variance = overfitting (model memorizes training data, poor on test); total error = bias² + variance + irreducible noise.
- Random Forest (Bagging — parallel, reduces variance)
- Trains multiple decision trees in parallel on bootstrap samples with random feature subsets; averages predictions; reduces variance without increasing bias.
- XGBoost / Gradient Boosting (sequential, reduces bias)
- Trains trees sequentially; each new tree corrects errors of previous trees; XGBoost adds L1/L2 regularization; typically outperforms random forest on tabular data.
- from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring='f1')
- Runs 5-fold cross-validation; use scoring='roc_auc', 'precision', 'recall', or 'f1' depending on the business objective; returns an array of per-fold scores.
- from sklearn.model_selection import GridSearchCV gcv = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc') gcv.fit(X_train, y_train)
- Exhaustive search over all parameter combinations; use RandomizedSearchCV(n_iter=50) for large grids — samples random combinations and is often as effective with much less compute.
- SMOTE for Class Imbalance
- Generates synthetic minority-class samples by interpolating between existing samples; apply only on training data after the train/test split — never on test data.
- Naive Bayes (assumes feature independence)
- Applies Bayes' theorem with a strong independence assumption between features; fast and effective for text classification despite the assumption frequently being violated in practice.
Unsupervised Learning
- k-Means: requires k, assumes spherical clusters
- Partitions data into k clusters by minimizing intra-cluster distances to centroids; use elbow method or silhouette score to select k; fails with non-spherical or unequal-density clusters.
- DBSCAN: no k required, handles arbitrary shapes
- Groups points in dense regions; automatically discovers cluster count; marks low-density points as noise (outliers); preferred over k-means when cluster count is unknown or shapes are irregular.
- Hierarchical Clustering (dendrogram)
- Builds a tree of clusters by iteratively merging (agglomerative) or splitting (divisive) clusters; cut the dendrogram at a chosen height to determine the number of clusters.
- PCA (Principal Component Analysis)
- Projects data onto orthogonal directions of maximum variance (principal components = eigenvectors of covariance matrix); reduces dimensionality while preserving global structure.
- t-SNE vs UMAP
- t-SNE: non-linear dimensionality reduction for 2D/3D visualization, preserves local structure, slow on large datasets. UMAP: faster, preserves both local and global structure, better for very large datasets.
- Silhouette Score: range -1 to +1
- Measures cluster quality: compares intra-cluster cohesion vs inter-cluster separation; higher is better; score near 0 means overlapping clusters; negative score means misassigned points.
Deep Learning Architecture
- Activation Functions by Layer Type
- Hidden layers: ReLU (avoids vanishing gradient), Tanh (zero-centered); Binary output: Sigmoid (squashes to 0–1); Multi-class output: Softmax (probability distribution summing to 1).
- CNN (images): Conv → Pool → Fully Connected
- Convolutional layers extract local spatial features (edges, textures); pooling layers reduce spatial dimensions and provide translation invariance; FC layers perform final classification.
- RNN → LSTM → GRU (sequences)
- RNN: processes sequences recurrently but suffers vanishing gradient; LSTM: adds cell state and gating to retain long-term dependencies; GRU: simplified LSTM with fewer parameters, similar performance.
- Transformers (self-attention mechanism)
- Processes all sequence positions in parallel using self-attention weights; avoids RNN sequential bottleneck; foundation of BERT, GPT, and modern NLP/vision models.
- GANs: Generator vs Discriminator
- Generator creates synthetic samples; discriminator distinguishes real from fake; adversarial training improves both simultaneously; used for synthetic data generation and augmentation.
- Autoencoders: Encoder → Latent Space → Decoder
- Encoder compresses input to a low-dimensional latent representation; decoder reconstructs the original; anomaly detection: high reconstruction error signals an anomaly.
- Dropout Regularization
- Randomly disables a fraction of neurons during each training step to prevent co-adaptation and overfitting; all neurons are active during inference with weights scaled by keep probability.
- Learning Rate: too high = divergence, too low = slow convergence
- Adam optimizer adapts per-parameter learning rates and is the widely recommended default; SGD with momentum may generalize slightly better with careful tuning; RMSprop suits RNNs.
Time Series and Probability
- ARIMA(p, d, q)
- p = autoregressive order (past values); d = differencing order to achieve stationarity; q = moving average order (past forecast errors); requires stationary data.
- from statsmodels.tsa.statespace.sarimax import SARIMAX model = SARIMAX(data, order=(1,1,1), seasonal_order=(1,1,1,12)) result = model.fit(disp=False)
- Fits a seasonal ARIMA; seasonal_order=(P,D,Q,s) adds seasonal components; s=12 for monthly data with annual seasonality; use disp=False to suppress convergence output.
- Stationarity (ADF Test)
- A stationary series has constant mean, variance, and autocorrelation over time; required for ARIMA; test with Augmented Dickey-Fuller test; if non-stationary, apply differencing (d=1 or d=2).
- Probability Distributions
- Normal (continuous, symmetric, bell-shaped); Poisson (count data, events per fixed interval); Binomial (discrete, n Bernoulli trials with probability p); t-distribution (heavier tails, used for small samples).
- Survival Analysis
- Analyzes time-to-event data (customer churn, equipment failure); Kaplan-Meier curve is non-parametric and handles right-censored data; parametric methods assume an underlying survival distribution.
- Causal Inference: Correlation ≠ Causation
- Establish causation via randomized controlled trials (RCTs), A/B tests, difference-in-differences analysis, or directed acyclic graphs (DAGs) that explicitly model causal structure.
NLP Pipeline
- NLP Preprocessing Order
- Tokenization → stop word removal → stemming or lemmatization → feature representation (bag of words, TF-IDF, or embeddings); tokenize first before any other text transformation.
- Stemming vs Lemmatization
- Stemming: fast rule-based truncation, may produce non-words ('better' → 'bet'); lemmatization: returns dictionary base form ('better' → 'good'); trade accuracy (lemmatization) for speed (stemming).
- Bag of Words vs TF-IDF
- Bag of words: sparse vector of raw word counts; TF-IDF: weights words by in-document frequency divided by across-document frequency — frequent words like 'the' get lower weight.
- Word Embeddings: Word2Vec, GloVe (dense vectors)
- Dense low-dimensional vectors that capture semantic similarity; semantically similar words have similar vectors; unlike sparse bag-of-words, allow vector arithmetic ('king' - 'man' + 'woman' ≈ 'queen').
- LDA = Latent Dirichlet Allocation (topic model, NOT discriminant analysis)
- Unsupervised method that discovers hidden topics in document collections without predefined labels; completely different from Linear Discriminant Analysis (LDA) in supervised classification.
- NLP Application Types
- Sentiment analysis (tone), named entity recognition / NER (extract persons, places, orgs), text classification, summarization, question answering, machine translation, speech recognition.
MLOps and Data Operations
- CRISP-DM 6 Phases
- Business Understanding → Data Understanding → Data Preparation → Modeling → Evaluation → Deployment; phases are iterative; data preparation typically consumes 60–80% of total project time.
- Version Control for Data Science (4 dimensions)
- Must version: code (git), data (DVC), hyperparameters (MLflow/experiment tracker), and model artifacts; versioning code alone is insufficient for experiment reproducibility.
- Data Drift vs Concept Drift
- Data drift: input feature distribution shifts (e.g., customer demographics change); concept drift: relationship between features and target changes (e.g., what predicts churn evolves); both degrade model performance.
- Deployment Environments
- Cloud: scalable, complex models; edge: on-device inference, low latency, works offline, limited compute (IoT/mobile); hybrid: cloud primary with edge fallback; on-premises: data sovereignty or compliance requirements.
- Streaming vs Batching
- Streaming: processes events in real-time as they arrive (fraud detection, alerting); batching: accumulates data and processes in bulk on a schedule (daily model retraining, reports).
- Parquet (columnar) vs CSV (row-based)
- Parquet: compressed, fast for analytical column scans on large datasets, schema-enforced; CSV: human-readable, no schema, slow for analytics — use Parquet for data pipelines at scale.
- Fuzzy Join (Record Linkage)
- Joins records across datasets with approximate string matches using similarity metrics like Levenshtein distance; resolves entity duplication when names differ slightly between sources.
- Winsorization vs Trimming vs Imputation
- Winsorization caps outliers at a percentile threshold (preserves rows); trimming removes outlier rows (reduces sample); imputation replaces outliers or missing values with estimated values.
Computer Vision and Optimization
- CNN for Computer Vision
- Convolutional layers learn spatial feature hierarchies (edges → textures → shapes → objects); pooling layers downsample and provide translation invariance; FC layers perform classification.
- Data Augmentation Techniques
- Artificially increases training diversity via rotation, horizontal/vertical flipping, scaling, cropping, brightness adjustment, noise injection, and occlusion; does not add new real images.
- Object Detection vs Semantic Segmentation
- Object detection identifies objects with bounding boxes (YOLO, Faster R-CNN); semantic segmentation classifies every pixel by category (U-Net); segmentation is finer-grained than detection.
- Multi-Armed Bandit (Exploration vs Exploitation)
- Models the tradeoff between exploiting known best options and exploring potentially better ones; used in A/B testing, ad serving, and recommendation systems; a greedy algorithm only exploits.
- Linear Programming (Simplex Method)
- Optimizes a linear objective function subject to linear inequality constraints; simplex method traverses vertices of the feasible polytope; used for resource allocation, scheduling, and pricing.
- Reinforcement Learning
- Agent learns an optimal policy by taking actions and receiving reward or penalty signals from an environment; neither supervised nor unsupervised; used in robotics, game AI, and adaptive recommendation.
Key Exam Distinctions and Traps
- Bagging (Random Forest) vs Boosting (XGBoost)
- Bagging = parallel independent trees, reduces variance, Random Forest; Boosting = sequential error-correcting trees, reduces bias, XGBoost/GBM — the exam tests this distinction directly.
- LASSO zeroes coefficients; Ridge does not
- LASSO (L1) drives irrelevant feature coefficients to exactly zero, performing feature selection; Ridge (L2) only shrinks toward zero — if the question asks about feature selection via regularization, the answer is LASSO.
- Softmax (multi-class) vs Sigmoid (binary) output
- Multi-class output layer must use Softmax + categorical cross-entropy; binary output uses Sigmoid + binary cross-entropy; ReLU and Tanh are used in hidden layers only.
- Curse of Dimensionality
- As features grow relative to samples, data becomes sparse in high-dimensional space, degrading model performance; mitigate with PCA, feature selection, or regularization.
- LDA (Topic Model in NLP) vs LDA (Discriminant Analysis in ML)
- Latent Dirichlet Allocation = unsupervised NLP topic model; Linear Discriminant Analysis = supervised classification/dimensionality reduction; same abbreviation, completely different algorithms — a deliberate exam trap.
- Accuracy is misleading on imbalanced datasets
- Predicting only the majority class can achieve high accuracy while completely failing to detect the minority class; use precision, recall, F1, or MCC for imbalanced classification.
- ARIMA requires stationary data
- Non-stationary time series (with trend or seasonality) must be differenced before ARIMA; the d parameter controls differencing; verify stationarity with the ADF test before fitting.