Data Contracts
Data Contracts are comprehensive working documents that facilitate the collaborative development of data platforms, ensuring clarity and consistency across teams.
Data Contracts are essential working documents and documentation that streamline the process of building a data platform collaboratively. They serve as a foundation for clear communication, ownership assignment, and consistent understanding across teams.
The most important aspects of data contracts are:
- Clearly defining ownership of the different components
- Establishing definitions in both business and technical terms
This article will explain one approach to implementing data contracts. It is highly recommended to use this only as a starting point and then tailor it to your organization's specific needs and culture.
If you are new to data contracts, start with source contracts and model contracts. Reporting contracts can quickly become outdated due to frequent changes in reports, potentially slowing down your progress.
If you are implementing data contracts in an organization with an already existing data platform, it is recommended to start with just one or a few contracts and then expand to more contracts as you get more knowledgeable about the process and find a better fit for your organization.
It's best practice to assign one primary owner for each component, along with multiple backup owners. This ensures that the data platform can continue to function even if the primary owner is unavailable. One effective way to assign backup owners is to refer to a team or department. For example: John from the Data Team.
Data contracts are not an SLA between people or teams but a document that makes working together easier. They aren't meant to replace collaboration; they are meant to enhance it.
Source Contract
Source System
A source system is the original system or application where data is first created or stored, such as a CRM, ERP, or financial database.
Technical Details
This section should include information about the source system's technical details, such as the database type, API, or data storage format.
Timezone
The timezone in which the source system operates and the timezone of the underlying database.
Currency
The currency used in the source system.
PII
This section should list the personally identifiable information (PII) data stored in the source system.
PII Deletion
This section should outline when and how PII data is deleted from the source system.
Source System Owner
The Source System Owner is responsible for the system. This individual should be able to provide access and explain the business logic within the system.
In some cases, there may be multiple people with different system responsibilities. For example, in an accounting system, there could be:
- A person responsible for the general ledger
- A person responsible for accounts payable
- A person who knows how to grant access to the API or database
Integration
An integration is a system or process that moves or syncs data from a source system to a data platform.
Integration System
The most common tools are: Fivetran, Airbyte, Meltano, and Azure Data Factory. Some organizations use custom-built solutions, typically developed in Python and often utilizing frameworks like Airflow, Dagster, or Prefect.
Integration Owner
Main responsibilities:
- Responding to alerts if the integration fails
- Ensuring there are no cost overruns both on the integration system and where the data is stored
The Integration Owner may not necessarily be the person responsible for fixing integration issues if they arise. However, they should be able to delegate the task to the appropriate person or team.
Known Issues
This section should list any known issues with the source system or integration. It should also include any workarounds or solutions that have been implemented.
Model Usage (optional)
This section outlines which data models utilize the source systems.
It is recommended not to define this when you are starting out with data contracts. The reason is that model usage tends to change frequently, and this part can quickly become outdated.
Source Contract Template
Model Contract
Business
The business is responsible for defining the business logic within a data model.
Business Definitions
Each metric and dimension should have a clear business definition, expressed in straightforward terms rather than acronyms. These definitions should be described in business language and, where applicable, include the relevant calculations.
Table / Data Set:
What is one row in the table? For example, is it a customer, an order, or a product?
Dimensions / Filters:
What are the dimensions or filters in the data model? For example, is it a city, a date, or a department?
Metrics / Calculations:
What are the metrics or calculations in the data model? For example, is it revenue, profit, or customer lifetime value?
- Customer Lifetime Value: A measure of the total economic value a customer brings over their lifetime.
- Calculation:
(monthly revenue per customer * gross margin) / monthly churn rate
.
Business Owner
Key responsibilities include:
- Providing clarity on business definitions
- Identifying users of the data model
- Determining if and when a data model can be retired or deleted
Source
The Source section provides the technical specifications needed to build the data model.
Source Definitions
Table
To support the business definitions, specify the level of detail for each table by identifying the key columns that define a single row.
Example:
- Customer table: Level of detail is defined by
customer_id
. - Revenue table: Level of detail is defined by
invoice_id, invoice_line_id
.
Columns / Calculations
Each column in the data model should have a precise technical description detailing how it is derived. This can be expressed using SQL, pseudocode, or by referencing specific columns in source tables.
Source Owner
The Source Owner is typically the development manager responsible for the source system, but it can also be a domain expert familiar with the data. Their role is to ensure the technical accuracy and integrity of the data model's source definitions.
Development
The Development section outlines the technical ownership and responsibilities for maintaining the data model.
Development Owner
The Development Owner is typically a member of the data team. This person is responsible for managing the technical aspects of the data model, including fixing bugs, adding new features or objects, answering technical questions, scheduling updates, and overseeing day-to-day operations.
Model Contract Template
Reporting Contract
Depending on the size and complexity of your organization, you might benefit from implementing a reporting contract.
However, be aware that a potential drawback of a reporting contract is that it can become outdated quickly. Reporting processes should be agile and adaptable, allowing for quick iterations and updates as needed.
In some cases Reporting Contracts is a legal requirement, for example, in the financial industry or medical industry where you would need complete liniage of reporting.