Advanced 15 terms

Data Science & ML

Vocabulary for ML engineers and data scientists: model types, training concepts, evaluation metrics, pipelines, and deployment terms used in technical discussions and documentation.

  • Feature /ˈfiːtʃər/

    An individual measurable property or variable used as input to a machine learning model. A dataset's feature matrix (X) contains all features for all samples. Feature engineering: the process of selecting, transforming, and creating features to improve model performance.

    "We added purchase_frequency and days_since_last_visit as features — these two engineered features improved the churn prediction model's F1 score from 0.71 to 0.84."
  • Label / Target /ˈleɪbəl / ˈtɑːrɡɪt/

    The output variable a supervised model is trained to predict. In classification: a category (spam/not-spam). In regression: a number (predicted price). The label column (y) is what the model learns to predict from the features (X). Unlabelled data: data without targets, used in unsupervised learning.

    "Our training dataset has 500,000 rows with a binary label: churned (1) or retained (0). The class imbalance — 94% retained — required oversampling the minority class before training."
  • Training / Validation / Test Split /ˈtreɪnɪŋ / ˌvælɪˈdeɪʃən / test splɪt/

    Dividing labelled data into three sets. Training set: data the model learns from (~70%). Validation set: used to tune hyperparameters and evaluate progress during training (~15%). Test set: held out until final evaluation — never used during development (~15%). Using the test set during development causes data leakage.

    "We use a 70/15/15 training/validation/test split stratified by the label distribution — each split has the same 6% churn rate as the full dataset to avoid biased evaluation."
  • Overfitting /ˌoʊvərˈfɪtɪŋ/

    When a model learns the training data too well — including noise and random patterns — and fails to generalise to new data. Signs: very high training accuracy, much lower validation accuracy. Fixes: more training data, regularisation (L1/L2), dropout, cross-validation, simpler model architecture.

    "The decision tree overfit badly — 99% training accuracy but only 62% on validation. Adding max_depth=6 as a constraint dropped training accuracy to 88% but raised validation to 85%, a much better generalisation."
  • Underfitting /ˌʌndərˈfɪtɪŋ/

    When a model is too simple to capture the underlying patterns in the data. Both training and validation accuracy are low. Fixes: more complex model, more features, longer training, reduced regularisation. Opposite of overfitting — the bias-variance spectrum runs from underfitting (high bias) to overfitting (high variance).

    "A linear model underfit the house price data — the relationship between square footage and price is non-linear in our market. Switching to a gradient boosting model captured the non-linearity and halved the RMSE."
  • Inference /ˈɪnfərəns/

    Using a trained model to make predictions on new, unseen data. Distinct from training (learning from data). Inference happens in production — it must be fast and reliable. Batch inference: predictions on large datasets offline. Real-time inference: predictions on single requests, latency-sensitive.

    "Our recommendation engine serves real-time inference at P99 latency under 50ms — we pre-compute embeddings daily but the nearest-neighbour lookup happens at inference time per request."
  • Precision vs Recall /prɪˈʒɪʒən vs rɪˈkɔːl/

    Two complementary metrics for classification quality. Precision: of all items predicted as positive, what fraction are actually positive (accuracy of positive predictions). Recall: of all actual positives, what fraction did the model find (completeness). F1 score: harmonic mean of both. Trade-off: high precision → fewer false positives; high recall → fewer false negatives.

    "For fraud detection we optimise for recall — it is worse to miss a real fraud (false negative) than to flag a legitimate transaction by mistake (false positive). We accept 85% precision to achieve 97% recall."
  • Pipeline /ˈpaɪplaɪn/

    A sequence of data processing steps applied in order: data ingestion → cleaning → feature engineering → model training → evaluation → deployment. Pipelines are automated and reproducible. ML pipeline frameworks: scikit-learn Pipeline, Kubeflow, MLflow, Apache Airflow.

    "Our training pipeline runs every Sunday night — it pulls fresh data, applies the same preprocessing as production, retrains the model, runs evaluation checks, and if all thresholds pass, promotes it to the staging model registry."
  • Hyperparameter /ˌhaɪpərˈpærəmɪtər/

    A configuration value set before training that controls the learning process itself. Examples: learning rate, number of trees, max depth, batch size, dropout rate, number of layers. Distinct from model parameters (weights), which are learned during training. Tuned via grid search, random search, or Bayesian optimisation.

    "The most impactful hyperparameter for our gradient boosting model was learning_rate — we ran a random search over 50 combinations and found that 0.05 with 500 estimators outperformed the default 0.1 with 100."
  • Cross-validation /ˈkrɒs ˌvælɪˈdeɪʃən/

    A technique for evaluating model performance by training and testing on multiple different subsets of the data. K-fold: split data into K folds, train on K-1, test on 1, repeat K times, average the results. More reliable than a single train/test split, especially on small datasets.

    "We use 5-fold cross-validation for model selection — each configuration is evaluated 5 times on different data splits and we report the mean and standard deviation of F1 scores to choose the most robust model."
  • Embeddings /ɪmˈbedɪŋz/

    Dense numerical vector representations of categorical data (words, sentences, users, products) that capture semantic relationships. Words or items with similar meaning are close in the embedding space. Used in NLP (word2vec, BERT embeddings), recommendation systems, and vector databases.

    "We embed product descriptions using a sentence transformer model — similar products cluster together in the 768-dimensional space. The recommendation engine finds the 10 nearest neighbours in embedding space for each viewed item."
  • Data Leakage /ˈdeɪtə ˈliːkɪdʒ/

    When information from outside the training set inappropriately influences the model, leading to over-optimistic evaluation metrics that don't hold in production. Common causes: using the test set for decisions, computing feature statistics on the full dataset before splitting, using future data to predict past events.

    "Our model showed suspiciously high validation accuracy — we discovered the normalisation scaler was fit on the full dataset before splitting, leaking validation distribution into training. Re-running correctly dropped accuracy from 94% to 81%, which was honest."
  • Model Drift /ˈmɒdəl drɪft/

    Degradation in model performance over time as the real-world data distribution changes (data drift) or the relationship between features and target changes (concept drift). Requires monitoring production predictions and periodic retraining. Example: a fraud detection model trained pre-COVID underperforms post-COVID as spending patterns changed.

    "We monitor model drift weekly — if the PSI (Population Stability Index) of any feature exceeds 0.25, it triggers an automatic retraining pipeline. Our churn model was drifting seasonally and needed quarterly retraining."
  • A/B Test (Model) /eɪ biː test/

    A randomized controlled experiment comparing two model versions on real users. Traffic is split: group A receives predictions from model A, group B from model B. Business metrics (revenue, CTR, churn) are compared after a statistically significant period. Used to validate model improvements before full rollout.

    "Before promoting the new recommendation model to 100%, we ran a 2-week A/B test on 10% of traffic — the new model increased click-through rate by 12% with p-value < 0.001, which gave us confidence to fully roll out."
  • MLOps /em el ɒps/

    Machine Learning Operations — practices and tools for deploying, monitoring, and maintaining ML models in production. Covers: experiment tracking, model versioning and registry, automated retraining pipelines, model monitoring, and feature stores. The "DevOps for ML" discipline.

    "We introduced MLflow for experiment tracking and a model registry — every training run logs parameters, metrics, and artifacts. Promoting a model to production requires passing automated evaluation gates, not a manual decision."