5 exercises — practice structuring strong English answers to data science and ML engineering interview questions: model drift, precision vs recall, model explainability, overfitting, and feature engineering.
How to structure ML interview answers
Drift questions: always distinguish data drift (P(X) changes) from concept drift (P(Y|X) changes) — most candidates miss this
Metrics questions: give the formula → explain the threshold mechanism → give contrasting real-world scenarios with reasoning
Explainability questions: three layers — frame as a decision, translate metrics to business language, use SHAP for specific predictions
Overfitting questions: give the loss curve signature → remedies with mechanism → connect to bias-variance trade-off
Feature engineering questions: open with domain knowledge → structured categories (temporal, encoding, interaction) → feature selection to prune noise
0 / 5 completed
1 / 5
The interviewer asks: "How do you detect and handle model drift in a production ML system?" Which answer best demonstrates ML engineering maturity?
Option B is the strongest: it makes the critical distinction between data drift and concept drift (most candidates conflate them), names specific statistical tests (PSI, KS test) rather than just saying "monitor statistics", explains what drift looks like in practice (output distribution shift), and gives a complete set of response options including roll-back — showing that retraining is not always the only answer. Data drift vs concept drift — the key distinction: Data drift (covariate shift) — the distribution of X (input features) changes: P(X) changes, but P(Y|X) stays the same. Example: a new device type appears in traffic; the model was never trained on it. Concept drift — the relationship between features and the target changes: P(Y|X) changes. Example: the definition of a "fraudulent transaction" shifts as fraud patterns evolve. Detection tools: PSI (Population Stability Index) — industry standard for feature drift; PSI > 0.25 = major drift. KS test (Kolmogorov-Smirnov) — statistical test for distributional differences. ADWIN, Page-Hinkley — drift detection algorithms for streaming data. Response options hierarchy: 1. Retrain on fresh data (most common). 2. Retrain with time-decayed weights (emphasise recent data). 3. Roll back while investigating. 4. Feature engineering to capture the drifted dimension. Option D is also strong (mentions shadow deployments which is a production ML deployment pattern) but misses the data/concept drift distinction.
2 / 5
The interviewer asks: "Explain the difference between precision and recall, and describe a real-world scenario where you would optimise for one over the other." Which answer demonstrates the deepest understanding?
Option B is the strongest: it gives the precise formulas with TP/FP/FN notation, explains the mechanism of the trade-off (the decision threshold), gives two contrasting real-world scenarios with the reasoning behind each choice, and addresses the imbalanced dataset problem — a key practical issue that exposes ML depth. The formulas — always know these for interviews: $\text{Precision} = \frac{TP}{TP + FP}$ (of what I predicted positive, how many were right?), $\text{Recall} = \frac{TP}{TP + FN}$ (of all actual positives, how many did I find?), $\text{F1} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$ (harmonic mean). The threshold mechanism: every classifier outputs a probability. At threshold 0.5, above = positive. Lowering to 0.3 catches more positives (recall up) but with more false positives (precision down). Raising to 0.8 is very conservative (precision up, recall down). Imbalanced datasets: with 99% negative class, a model that always predicts negative has 99% accuracy but 0% recall — useless. Use precision-recall AUC, not accuracy. Scenario decision framework: False negative is more costly → optimise recall (medical, fraud, security). False positive is more costly → optimise precision (spam, content moderation, legal).
3 / 5
The interviewer asks: "How would you explain your machine learning model and its predictions to a non-technical stakeholder?" Which answer demonstrates the best communication strategy?
Option B is the strongest: it presents a systematic three-layer approach (framing, performance translation, explainability), gives concrete before/after examples for each translation, names SHAP specifically (the industry-standard tool for ML explainability), and — critically — includes the transparency about model limitations, which builds genuine stakeholder trust. ML communication framework for interviews: Layer 1: Frame in terms of decisions — connect the model output to the action the stakeholder will take. "The model gives you a ranked list of customers to call" is more useful than "the model outputs a probability vector." Layer 2: Translate metrics — precision/recall → business hit rate language. AUC → lift over baseline. "If we use the model we reach 75% of churners by contacting 10% of customers; without the model we'd need to contact 50% to reach the same 75%." This is called the lift. Layer 3: Explainability tools — SHAP (SHapley Additive exPlanations): assigns a contribution value to each feature for each individual prediction. Answers "why was this specific customer predicted to churn?" LIME: local approximation — builds a simple interpretable model around a single prediction. Feature importance (global): which features the model uses most across all predictions. Trust through limitations — always telling stakeholders where the model performs less well builds more confidence than hiding limitations. Options C and D are competent but less structured and miss the three-layer framework.
4 / 5
The interviewer asks: "What is the difference between overfitting and underfitting, and how do you address each?" Which answer demonstrates the clearest mental model?
Option B is the strongest: it gives both the error pattern signatures (training vs. validation), prescribes the learning curve as the diagnostic tool, provides a comprehensive remedy list for each with the mechanism of each fix, and precisely maps the problem to the bias-variance trade-off framework — which is the theoretical foundation that interviewers are often probing for. Bias-variance trade-off — the theoretical framework: Bias — error from wrong model assumptions. A linear model fitting a quadratic relationship has high bias. Variance — error from sensitivity to small fluctuations in training data. A deep tree that perfectly fits 100 training points has high variance. Total error = Bias² + Variance + Irreducible noise. Underfitting = high bias. Overfitting = high variance. Regularisation remedies explained: L2 (Ridge) — adds λ·Σw² to the loss; penalises large weights, drives them towards zero but not exactly zero. L1 (Lasso) — adds λ·Σ|w| to the loss; drives some weights exactly to zero (sparse solutions, feature selection). Dropout — randomly sets neurons to zero during training, forcing the network to learn redundant representations. Diagnosis sequence: 1. Plot training loss vs. validation loss. 2. If gap is large → overfitting. 3. If both are high → underfitting. 4. Learning curve also shows whether more data would help (overfitting curves converge with more data; underfitting curves don't).
5 / 5
The interviewer asks: "How do you approach feature engineering, and what techniques do you commonly use?" Which answer best demonstrates practical ML experience?
Option B is the strongest: it opens with an important framing statement ("feature engineering is often more impactful than model selection" — a widely cited practitioner wisdom), structures the answer by technique categories with concrete examples tied to a specific problem domain (churn model), gives decision criteria for encoding choices (why target encoding for high cardinality), and includes the important counter-point that more features is not always better. Feature engineering vocabulary: Temporal features — time-since events, rolling window aggregations, trend/velocity. Critical for user behaviour models. Encoding strategies:Label encoding — ordinal categories only (e.g., small/medium/large → 0/1/2). One-hot encoding — nominal categories with low cardinality (< ~20 values). Creates binary columns. Target encoding — replace category with mean target value; handles high cardinality but risks target leakage — use cross-validation or add smoothing. Entity embeddings — learned dense representations for very high cardinality (neural networks). Interaction features — explicit multiplication/division when domain knowledge suggests a ratio or product is meaningful. Tree models find these automatically; linear models need them explicit. Feature selection — SHAP feature importance (model-agnostic), recursive feature elimination (RFE), Pearson/Spearman correlation for linear feature-target relationships. Curse of dimensionality: many irrelevant features add noise, hurt generalisation, and slow training — especially in distance-based models (KNN, SVM).