Advanced 10 terms

ML Platform & MLOps Tools

Feature stores, experiment tracking, model registries, drift detection, ONNX serialisation, training-serving skew, and ML infrastructure vocabulary for platform engineers.

  • Feature Store /ˈfiːtʃər stɔːr/

    A centralised repository that manages, stores, and serves machine learning features — providing an offline store (for training data) and an online store (for low-latency serving), with consistent feature computation across both contexts.

    "Before the feature store, training and serving used different feature computation code, causing training-serving skew. Now both use the same feature definitions from Feast — the offline store serves training pipelines, the online store serves the prediction API with sub-10ms reads."
  • Experiment Tracking /ɪkˈsperɪmənt ˈtrækɪŋ/

    The systematic logging of ML experiment parameters, metrics, artefacts, and environment information to enable comparison, reproducibility, and auditability of model training runs.

    "MLflow logs every training run: hyperparameters, dataset version, git commit, conda environment, training metrics by epoch, and the trained model artefact. We can reproduce any experiment from 6 months ago with a single CLI command."
  • Model Registry /ˈmɒdəl ˈreʤɪstri/

    A centralised catalogue that tracks model versions, their metadata (training data, metrics, framework), and their lifecycle stage (Staging, Production, Archived). The authoritative source of truth for which model version is in production.

    "The model registry shows version 14 is in Production (precision 0.91, deployed 3 weeks ago), version 15 is in Staging (precision 0.93, pending A/B test approval), and version 13 is Archived. Promotion from Staging to Production requires sign-off from the ML platform team."
  • Training-Serving Skew /ˈtreɪnɪŋ ˈsɜːrvɪŋ skjuː/

    The divergence between feature values or computation logic used during model training versus those used during online inference, causing the model to see different input distributions in production than it was trained on.

    "Training-serving skew was the root cause of the 15% precision drop in production. The training pipeline used pandas to compute the rolling average feature; the serving pipeline used a Java implementation with a different edge-case handling for null values. Centralising feature computation in the feature store eliminated the skew."
  • Model Drift /ˈmɒdəl drɪft/

    The degradation of a model's performance in production over time caused by changes in the statistical properties of input data (data drift) or in the relationship between inputs and the target (concept drift).

    "Model drift monitoring shows the churn prediction model's precision has dropped from 0.87 to 0.71 over 6 weeks. PSI for the 'days since last login' feature is 0.31 (above our 0.2 alert threshold) — data drift in user behaviour patterns post-product-redesign is the likely cause."
  • PSI (Population Stability Index) /ˌpiː es ˈaɪ/

    A statistic measuring how much a feature's distribution has shifted between a reference period (training time) and the current period. PSI < 0.1 = no significant change; 0.1–0.2 = moderate shift (investigate); > 0.2 = major shift (retrain likely required).

    "PSI monitoring runs daily for all 47 features in the credit risk model. The 'income_bracket' feature hit PSI 0.24 this morning — above the 0.2 threshold — triggered by the new higher salary bands applied by HR. A retraining run using updated data has been scheduled."
  • ONNX (Open Neural Network Exchange) /ˈɒnɪks/

    An open format for representing machine learning models, enabling a model trained in one framework (e.g., PyTorch) to be serialised and served in a different runtime (e.g., ONNX Runtime, TensorRT), typically for improved inference performance.

    "We train in PyTorch but serve via ONNX Runtime — the ONNX serialisation step is part of the CI/CD pipeline. ONNX Runtime on the serving infrastructure delivers 3× throughput and 60% lower p99 latency compared to the native PyTorch model server."
  • Shadow Mode Deployment /ˈʃædoʊ moʊd dɪˈplɔɪmənt/

    A deployment strategy where a new model version receives a copy of live production traffic and generates predictions, but those predictions are not returned to users — enabling comparison of the new model's outputs against the current model before full rollout.

    "Version 15 ran in shadow mode for 2 weeks, receiving 100% of production traffic alongside version 14. Shadow mode metrics showed v15 precision 0.93 vs v14 0.89, with no latency regression. These results justified the A/B rollout decision."
  • Champion/Challenger /ˈtʃæmpiən ˈtʃælɪndʒər/

    A model deployment pattern where the current production model (champion) and a candidate model (challenger) both receive a percentage of live traffic, with performance compared in real conditions before the challenger is promoted.

    "Champion/challenger split: v14 (champion) handles 90% of traffic, v15 (challenger) handles 10%. After 1 week, v15 shows higher precision and lower false positive rate on the same traffic distribution. IC approved full promotion to challenger → champion."
  • Lineage (ML context) /ˈlɪniɪdʒ/

    The traceable chain of provenance from raw data through feature computation, training runs, and model version to a specific production prediction — enabling root-cause analysis of prediction errors and compliance demonstration.

    "Full ML lineage for prediction ID P-8472: input data from dataset v3.1 (snapshot 2024-11-15), feature computation via feature pipeline v8, training run ID exp_447 (commit a3f9c), model registry version 14, deployed 2024-11-22. Audit trail complete, reviewable by the compliance team."