FrameworkData Contracts

Data Contracts

Data Contracts are comprehensive working documents that facilitate the collaborative development of data platforms, ensuring clarity and consistency across teams.

WIP
This article is a work in progress

Data Contracts are essential working documents and documentation that streamline the process of building a data platform collaboratively. They serve as a foundation for clear communication, ownership assignment, and consistent understanding across teams.

The most important aspects of data contracts are:

  • Clearly defining ownership of the different components
  • Establishing definitions in both business and technical terms

This article will explain one approach to implementing data contracts. It is highly recommended to use this only as a starting point and then tailor it to your organization's specific needs and culture.

If you are new to data contracts, start with source contracts and model contracts. Reporting contracts can quickly become outdated due to frequent changes in reports, potentially slowing down your progress.

If you are implementing data contracts in an organization with an already existing data platform, it is recommended to start with just one or a few contracts and then expand to more contracts as you get more knowledgeable about the process and find a better fit for your organization.

It's best practice to assign one primary owner for each component, along with multiple backup owners. This ensures that the data platform can continue to function even if the primary owner is unavailable. One effective way to assign backup owners is to refer to a team or department. For example: John from the Data Team.

Data contracts are not an SLA between people or teams but a document that makes working together easier. They aren't meant to replace collaboration; they are meant to enhance it.

Source Contract

Source System

A source system is the original system or application where data is first created or stored, such as a CRM, ERP, or financial database.

Technical Details

This section should include information about the source system's technical details, such as the database type, API, or data storage format.

Timezone

The timezone in which the source system operates and the timezone of the underlying database.

Currency

The currency used in the source system.

PII

This section should list the personally identifiable information (PII) data stored in the source system.

PII Deletion

This section should outline when and how PII data is deleted from the source system.

Source System Owner

The Source System Owner is responsible for the system. This individual should be able to provide access and explain the business logic within the system.

In some cases, there may be multiple people with different system responsibilities. For example, in an accounting system, there could be:

  • A person responsible for the general ledger
  • A person responsible for accounts payable
  • A person who knows how to grant access to the API or database

Integration

An integration is a system or process that moves or syncs data from a source system to a data platform.

Integration System

The most common tools are: Fivetran, Airbyte, Meltano, and Azure Data Factory. Some organizations use custom-built solutions, typically developed in Python and often utilizing frameworks like Airflow, Dagster, or Prefect.

Integration Owner

Main responsibilities:

  • Responding to alerts if the integration fails
  • Ensuring there are no cost overruns both on the integration system and where the data is stored

The Integration Owner may not necessarily be the person responsible for fixing integration issues if they arise. However, they should be able to delegate the task to the appropriate person or team.

Known Issues

This section should list any known issues with the source system or integration. It should also include any workarounds or solutions that have been implemented.

Model Usage (optional)

This section outlines which data models utilize the source systems.

It is recommended not to define this when you are starting out with data contracts. The reason is that model usage tends to change frequently, and this part can quickly become outdated.

Source Contract Template

# Source Contract for <source system>
 
## Source System
 
<source system> is an <xyz> service offered by <provider> that tracks and reports <xyz>.
 
We primarily use it for monitoring <x> to our <y> and tracking <z>.
 
### Technical Details
 
<source system> is a <xyz> system that uses <xyz> to store data.
 
**Timezone:** <timezone>
 
**Currency:** <currency>
 
### PII
 
The source system contains the following PII data: <list of PII data>.
 
**PII deletion:** The source system deletes PII data after <time period> and when <an event happens>.
 
### System Owner
 
<name> from <department>
 
## Integration
 
**Integration System:** <integration system>
 
**Schedule:** The data is updated every <x> hours.
 
**Integration Owner:** <name> from <department>
 
## Known Issues
 
- <issue 1>
- <issue 2>

Model Contract

Business

The business is responsible for defining the business logic within a data model.

Business Definitions

Each metric and dimension should have a clear business definition, expressed in straightforward terms rather than acronyms. These definitions should be described in business language and, where applicable, include the relevant calculations.

Table / Data Set:

What is one row in the table? For example, is it a customer, an order, or a product?

Dimensions / Filters:

What are the dimensions or filters in the data model? For example, is it a city, a date, or a department?

Metrics / Calculations:

What are the metrics or calculations in the data model? For example, is it revenue, profit, or customer lifetime value?

  • Customer Lifetime Value: A measure of the total economic value a customer brings over their lifetime.
  • Calculation: (monthly revenue per customer * gross margin) / monthly churn rate.

Business Owner

Key responsibilities include:

  • Providing clarity on business definitions
  • Identifying users of the data model
  • Determining if and when a data model can be retired or deleted

Source

The Source section provides the technical specifications needed to build the data model.

Source Definitions

Table

To support the business definitions, specify the level of detail for each table by identifying the key columns that define a single row.

Example:

  • Customer table: Level of detail is defined by customer_id.
  • Revenue table: Level of detail is defined by invoice_id, invoice_line_id.

Columns / Calculations

Each column in the data model should have a precise technical description detailing how it is derived. This can be expressed using SQL, pseudocode, or by referencing specific columns in source tables.

Source Owner

The Source Owner is typically the development manager responsible for the source system, but it can also be a domain expert familiar with the data. Their role is to ensure the technical accuracy and integrity of the data model's source definitions.

Development

The Development section outlines the technical ownership and responsibilities for maintaining the data model.

Development Owner

The Development Owner is typically a member of the data team. This person is responsible for managing the technical aspects of the data model, including fixing bugs, adding new features or objects, answering technical questions, scheduling updates, and overseeing day-to-day operations.

Model Contract Template

# Model Contract for <data model>
 
## Business
 
This data model are used by <business unit> to <business purpose>.
 
### Business Definitions
 
- **<metric 1>**: <definition 1>
- **<metric 2>**: <definition 2>
- **<dimension 1>**: <definition 3>
- **<dimension 2>**: <definition 4>
 
### Business Owner
 
<name> from <department>
 
## Source
 
This data model is sourced mainly from <source system 1> but also from <source system 2>.
 
### Source Definitions
 
#### Table
 
**Level of detail:** <column 1>, <column 2>, <column 3>
 
#### Columns
 
- **<column 1>**: <definition 1>
- **<column 2>**: <definition 2>
- **<column 3>**: <definition 3>
- **<column 4>**: <definition 4>
 
#### Calculations
 
- **<metric 1>**: <definition 1>
- **<metric 2>**: <definition 2>
- **<dimension 1>**: <definition 3>
- **<dimension 2>**: <definition 4>
 
### Source Owner
 
<name> from <department>
 
## Development
 
This data model is developed by <development team> using <development tool>.
 
The source code is stored in <source code repository>.
 
### Development Owner
 
<name> from <department>
 
## Known Issues
 
- <issue 1>
- <issue 2>

Reporting Contract

Depending on the size and complexity of your organization, you might benefit from implementing a reporting contract.

However, be aware that a potential drawback of a reporting contract is that it can become outdated quickly. Reporting processes should be agile and adaptable, allowing for quick iterations and updates as needed.

In some cases Reporting Contracts is a legal requirement, for example, in the financial industry or medical industry where you would need complete liniage of reporting.

Getting templates into Google Docs, Notion, and GitHub