How to Write a Data Contract in English

A data contract is a formal agreement between the producer of a dataset and its consumers, defining the structure, quality, freshness, and semantics of the data. As data mesh and data platform architectures mature, data contracts have become a core communication tool — and writing them clearly in English is a professional skill every data engineer needs. This guide covers the full structure, vocabulary, and language patterns.

What Is a Data Contract?

A data contract is a formal specification that describes:

What data a producer provides
What schema (fields, types, formats) the data has
What quality guarantees the producer makes
What service level the consumer can expect (freshness, availability)
Who owns the data and who to contact when things break

Data contracts codify the implicit agreement between teams into an explicit, versioned document — preventing the classic problem where schema changes break downstream consumers without warning.

“Before we let any team consume the orders topic in Kafka, they sign off on the data contract. Any schema change requires a new contract version and consumer notification.”

Why Data Contracts Matter

Without data contracts, breaking changes flow silently downstream:

Column renamed → analytics dashboard breaks
Timestamp format changes → ML feature pipeline fails
New nullable field added → downstream null-handling breaks

With data contracts:

Changes are versioned and announced
Consumers know what stability guarantees they can rely on
Producers know who will be affected before they change anything

Data Contract Structure

Section 1: Header / Metadata

# orders-v2.data-contract.yaml

id: "orders-v2"
version: "2.1.0"
status: "active"  # draft | active | deprecated
title: "Customer Orders"
description: >
  Contract for the customer orders dataset. Contains all confirmed orders
  placed through the web and mobile channels, refreshed hourly.
owner: "data-platform-team@acme.com"
contact:
  slack: "#data-platform"
  oncall: "https://pagerduty.com/teams/data-platform"
created: "2024-06-01"
updated: "2026-03-15"

Key terms:

Status — lifecycle state. draft means not yet stable; active means in use; deprecated means consumers should migrate.

Versioning (Semantic Versioning):

Major version (2.0.0 → 3.0.0) — breaking change (field removed, type changed)
Minor version (2.0.0 → 2.1.0) — backward-compatible addition (new optional field)
Patch (2.0.0 → 2.0.1) — documentation or metadata update only

“This is a breaking change — we’re removing the legacy_id field. This requires a major version bump from v2 to v3.”

Section 2: Schema Definition

The schema section describes every field in the dataset.

schema:
  - name: order_id
    type: string
    required: true
    description: "Unique identifier for the order (UUID v4)"
    example: "550e8400-e29b-41d4-a716-446655440000"

  - name: customer_id
    type: string
    required: true
    description: "Foreign key referencing the customers table"
    example: "cust-12345"

  - name: status
    type: string
    required: true
    description: "Current order status"
    enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]

  - name: total_amount
    type: decimal(10,2)
    required: true
    description: "Total order value in EUR, excluding VAT"
    constraints:
      minimum: 0.01

  - name: placed_at
    type: timestamp
    required: true
    description: "Timestamp of order placement in UTC (ISO 8601)"
    example: "2026-03-15T14:30:00Z"

  - name: metadata
    type: object
    required: false
    description: "Optional unstructured metadata from the order source"
    nullable: true

Schema vocabulary:

Required / nullable — whether the field must be present and non-null. “The placed_at field is required and non-nullable — any record missing it is a producer bug.”

Enum — an explicit list of valid values for a field. “The status field is an enum — any value not in this list indicates a data quality issue.”

Type precision — use precise types: decimal(10,2) instead of just “number”; timestamp with timezone instead of just “date”.

Section 3: Quality SLOs (Service Level Objectives)

This section defines what quality guarantees the producer commits to.

quality:
  completeness:
    - field: order_id
      threshold: 100%
      description: "Every record must have an order_id"
    - field: customer_id
      threshold: 99.9%
      description: "Occasional nulls accepted for guest checkouts"

  freshness:
    lag_slo: "< 30 minutes"
    description: "New orders must appear in the dataset within 30 minutes of confirmation"

  volume:
    daily_min: 1000
    daily_max: 500000
    description: "Expected daily record volume range. Alerts fire outside this range."

  uniqueness:
    - field: order_id
      constraint: "Must be globally unique — no duplicates"

Quality vocabulary:

SLO (Service Level Objective) — a target metric that the producer commits to. “Our freshness SLO is 30 minutes — if data is older than 30 minutes, it’s an incident.”

SLA (Service Level Agreement) — a contractual commitment with consequences for breach. “The SLA says we’ll fix data quality breaches within 4 hours during business hours.”

Completeness — the proportion of records with non-null values in required fields.

Uniqueness constraint — a guarantee that a field (or combination of fields) contains no duplicate values.

Section 4: Freshness and Availability

freshness:
  schedule: "0 * * * *"  # every hour, on the hour
  lag_slo: "< 30 minutes"
  availability_slo: "99.5%"
  history:
    retention: "3 years"
    description: "Rolling 3-year history retained in the warehouse"

Freshness vocabulary:

Lag — the delay between an event occurring in the source and it appearing in the downstream dataset. “Our current lag is 45 minutes — we’re in breach of the 30-minute SLO.”

Availability — percentage of time the dataset/endpoint is accessible and current.

Retention — how long historical data is kept. “The contract guarantees 3-year retention — don’t depend on data older than that being available.”

Section 5: Ownership and Support

support:
  owner: "data-platform-team"
  steward: "alice@acme.com"  # day-to-day contact
  oncall: "https://pagerduty.com/teams/data-platform"
  incident_slo:
    critical: "< 1 hour"  # data missing for >1 hour
    major: "< 4 hours"    # quality SLO breach
    minor: "< 24 hours"   # documentation issue

Ownership vocabulary:

Data owner — the team accountable for the dataset’s quality and availability.

Data steward — the individual responsible for day-to-day data quality and consumer support.

Incident SLO — how quickly the team commits to acknowledging and resolving data issues by severity.

Section 6: Change Management

versioning:
  policy: "Semantic versioning (semver)"
  deprecation_notice: "30 days"
  breaking_changes:
    - "Field removal"
    - "Type narrowing"
    - "Enum value removal"
    - "Required field added to previously optional list"
  backward_compatible:
    - "New optional field added"
    - "Enum value added"
    - "Description updated"

Change management phrases:

“This field rename is a breaking change — we’ll give consumers 30 days notice and publish the v3 contract before deprecating v2.”

“Adding an optional field is backward-compatible — we’ll bump the minor version and notify consumers, but they don’t need to make any changes.”

“We’re setting v2 status to ‘deprecated’. Consumers have until Q3 to migrate to v3.”

When emailing or messaging a data contract to a consumer team:

“Hi [Team], I’ve attached the data contract for the orders dataset (v2.1.0). Key points for your integration:

Schema is in orders-v2.data-contract.yaml

Freshness SLO: 30-minute lag guarantee

The metadata field is nullable — please handle null values in your pipeline

Breaking changes: we’ll notify you with 30 days advance notice

Please review and confirm you’re aligned with the contract before building your pipeline. Let me know if you have any questions.”

Practice

Build data engineering vocabulary with the Data Engineering exercise set and the Data Engineer learning path.