How to Write a Data Contract in English
A complete guide for data engineers: what a data contract is, how to write one in English, the required sections, vocabulary, and ready-to-use templates.
A data contract is a formal agreement between the producer of a dataset and its consumers, defining the structure, quality, freshness, and semantics of the data. As data mesh and data platform architectures mature, data contracts have become a core communication tool — and writing them clearly in English is a professional skill every data engineer needs. This guide covers the full structure, vocabulary, and language patterns.
What Is a Data Contract?
A data contract is a formal specification that describes:
- What data a producer provides
- What schema (fields, types, formats) the data has
- What quality guarantees the producer makes
- What service level the consumer can expect (freshness, availability)
- Who owns the data and who to contact when things break
Data contracts codify the implicit agreement between teams into an explicit, versioned document — preventing the classic problem where schema changes break downstream consumers without warning.
“Before we let any team consume the orders topic in Kafka, they sign off on the data contract. Any schema change requires a new contract version and consumer notification.”
Why Data Contracts Matter
Without data contracts, breaking changes flow silently downstream:
- Column renamed → analytics dashboard breaks
- Timestamp format changes → ML feature pipeline fails
- New nullable field added → downstream null-handling breaks
With data contracts:
- Changes are versioned and announced
- Consumers know what stability guarantees they can rely on
- Producers know who will be affected before they change anything
Data Contract Structure
Section 1: Header / Metadata
# orders-v2.data-contract.yaml
id: "orders-v2"
version: "2.1.0"
status: "active" # draft | active | deprecated
title: "Customer Orders"
description: >
Contract for the customer orders dataset. Contains all confirmed orders
placed through the web and mobile channels, refreshed hourly.
owner: "data-platform-team@acme.com"
contact:
slack: "#data-platform"
oncall: "https://pagerduty.com/teams/data-platform"
created: "2024-06-01"
updated: "2026-03-15"
Key terms:
Status — lifecycle state. draft means not yet stable; active means in use; deprecated means consumers should migrate.
Versioning (Semantic Versioning):
- Major version (2.0.0 → 3.0.0) — breaking change (field removed, type changed)
- Minor version (2.0.0 → 2.1.0) — backward-compatible addition (new optional field)
- Patch (2.0.0 → 2.0.1) — documentation or metadata update only
“This is a breaking change — we’re removing the
legacy_idfield. This requires a major version bump from v2 to v3.”
Section 2: Schema Definition
The schema section describes every field in the dataset.
schema:
- name: order_id
type: string
required: true
description: "Unique identifier for the order (UUID v4)"
example: "550e8400-e29b-41d4-a716-446655440000"
- name: customer_id
type: string
required: true
description: "Foreign key referencing the customers table"
example: "cust-12345"
- name: status
type: string
required: true
description: "Current order status"
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
- name: total_amount
type: decimal(10,2)
required: true
description: "Total order value in EUR, excluding VAT"
constraints:
minimum: 0.01
- name: placed_at
type: timestamp
required: true
description: "Timestamp of order placement in UTC (ISO 8601)"
example: "2026-03-15T14:30:00Z"
- name: metadata
type: object
required: false
description: "Optional unstructured metadata from the order source"
nullable: true
Schema vocabulary:
Required / nullable — whether the field must be present and non-null. “The placed_at field is required and non-nullable — any record missing it is a producer bug.”
Enum — an explicit list of valid values for a field. “The status field is an enum — any value not in this list indicates a data quality issue.”
Type precision — use precise types: decimal(10,2) instead of just “number”; timestamp with timezone instead of just “date”.
Section 3: Quality SLOs (Service Level Objectives)
This section defines what quality guarantees the producer commits to.
quality:
completeness:
- field: order_id
threshold: 100%
description: "Every record must have an order_id"
- field: customer_id
threshold: 99.9%
description: "Occasional nulls accepted for guest checkouts"
freshness:
lag_slo: "< 30 minutes"
description: "New orders must appear in the dataset within 30 minutes of confirmation"
volume:
daily_min: 1000
daily_max: 500000
description: "Expected daily record volume range. Alerts fire outside this range."
uniqueness:
- field: order_id
constraint: "Must be globally unique — no duplicates"
Quality vocabulary:
SLO (Service Level Objective) — a target metric that the producer commits to. “Our freshness SLO is 30 minutes — if data is older than 30 minutes, it’s an incident.”
SLA (Service Level Agreement) — a contractual commitment with consequences for breach. “The SLA says we’ll fix data quality breaches within 4 hours during business hours.”
Completeness — the proportion of records with non-null values in required fields.
Uniqueness constraint — a guarantee that a field (or combination of fields) contains no duplicate values.
Section 4: Freshness and Availability
freshness:
schedule: "0 * * * *" # every hour, on the hour
lag_slo: "< 30 minutes"
availability_slo: "99.5%"
history:
retention: "3 years"
description: "Rolling 3-year history retained in the warehouse"
Freshness vocabulary:
Lag — the delay between an event occurring in the source and it appearing in the downstream dataset. “Our current lag is 45 minutes — we’re in breach of the 30-minute SLO.”
Availability — percentage of time the dataset/endpoint is accessible and current.
Retention — how long historical data is kept. “The contract guarantees 3-year retention — don’t depend on data older than that being available.”
Section 5: Ownership and Support
support:
owner: "data-platform-team"
steward: "alice@acme.com" # day-to-day contact
oncall: "https://pagerduty.com/teams/data-platform"
incident_slo:
critical: "< 1 hour" # data missing for >1 hour
major: "< 4 hours" # quality SLO breach
minor: "< 24 hours" # documentation issue
Ownership vocabulary:
Data owner — the team accountable for the dataset’s quality and availability.
Data steward — the individual responsible for day-to-day data quality and consumer support.
Incident SLO — how quickly the team commits to acknowledging and resolving data issues by severity.
Section 6: Change Management
versioning:
policy: "Semantic versioning (semver)"
deprecation_notice: "30 days"
breaking_changes:
- "Field removal"
- "Type narrowing"
- "Enum value removal"
- "Required field added to previously optional list"
backward_compatible:
- "New optional field added"
- "Enum value added"
- "Description updated"
Change management phrases:
“This field rename is a breaking change — we’ll give consumers 30 days notice and publish the v3 contract before deprecating v2.”
“Adding an optional field is backward-compatible — we’ll bump the minor version and notify consumers, but they don’t need to make any changes.”
“We’re setting v2 status to ‘deprecated’. Consumers have until Q3 to migrate to v3.”
Cover Note When Sharing a Contract
When emailing or messaging a data contract to a consumer team:
“Hi [Team], I’ve attached the data contract for the orders dataset (v2.1.0). Key points for your integration:
- Schema is in
orders-v2.data-contract.yaml- Freshness SLO: 30-minute lag guarantee
- The
metadatafield is nullable — please handle null values in your pipeline- Breaking changes: we’ll notify you with 30 days advance notice
Please review and confirm you’re aligned with the contract before building your pipeline. Let me know if you have any questions.”
Practice
Build data engineering vocabulary with the Data Engineering exercise set and the Data Engineer learning path.