What is referential integrity and what does it look like in practice?

Everyone knows garbage in, garbage out when it comes to databases and tables, but even if your data is good at source, you can still hit some bumps in the road when it comes to ensuring the accuracy of the information you and your stakeholders are receiving. That’s where referential integrity comes in.

What is referential integrity?

Referential integrity refers to the accuracy and consistency of the relationship between tables. For example, in its simplest form, two columns that are linked between two tables, such as product identifiers in sales records, also need to be found in the product table.

Within a database, each table must have a primary key. However, this primary key can also appear in other tables because it has a relationship to data within those tables too. When a primary key from one table appears in another table, it is called a foreign key. Whenever a foreign key value is used, it must reference a valid existing primary key in the dimension or parent table.

Referential integrity within your database helps enforce this consistency. It also ensures that relationships between records in related tables are valid and that you don’t accidentally delete or incorrectly change related data.

Referential integrity vs. database normalization

Referential integrity differs from normalization, although they are interconnected. Normalization is the process of efficiently organizing data in a database, such as eliminating redundant data and ensuring data dependencies make sense. Essentially, it’s the process of specifying and defining tables, keys, columns, and relationships to create an efficient database. Referential integrity, on the other hand, refers to ensuring the quality of the relationships between the tables.

Why is referential integrity important?

While normalization is considered critical and, therefore, standard practice — any database would struggle to operate without it — referential integrity is sometimes (incorrectly) considered optional. However, a lack of referential integrity could lead to data inconsistencies, such as incorrect records being deleted, added, or modified. This leads to incomplete data being returned, often without any knowledge of why, when, or even if an error has occurred. Records then become “lost” in the database because they never appear in queries or reports.

The result of this could be far-reaching — for example, on a basic level, it could lead to anomalies appearing in reports that influence business decision-making. Far more critically, it could lead to customers not receiving goods they’ve paid for, or even patients receiving incorrect medication.

What are the conditions required to enforce referential integrity?

There are some occasions when referential integrity isn’t the right fit. For example, some standard implementations of referential integrity under database management systems don’t work for data warehouses. There are three reasons for this:

The amount of data is simply too huge for rules to be consistently applied.
Records are not necessarily altered within the database environment.
Current referential integrity implementations don’t consider the timing of a relationship. For example, there could be a relationship between two records at one point in time, but no relationship at another point in time, such as when a store’s address changes.

This can be overcome by opting for “bounded referential integrity”, where traditional referential integrity is applied to a subset of a table or tables within the data warehouse. It is applied after the data has been loaded into the warehouse, not as it’s being loaded as per traditional referential integrity.

How to enforce referential integrity

So now we know what referential integrity is, why it’s important, and how it applies to your data. The next logical question is: how can you enforce referential integrity?

Foreign key constraints

One of the most common ways to enforce referential integrity is to create foreign key constraints. What this means is rules are in place that govern how foreign keys are added to a table. For example, foreign key constraints in table 1 link to a column with unique values in table 2 and say that a value in the 1s column is only valid if it also exists in the 2s column. A foreign key constraint specifies that the key can only contain values that are in the referenced primary key, and thereby ensures the referential integrity of data that is joined on the two keys.

Essentially, if a customer places an order, then foreign key constraints say that customer must have a unique customer ID in the database and be ordering from an SKU number that also exists in the database.

Data contracts

An alternative option (or even an additional option if you are aiming for top levels of data integrity) to foreign key constraints that works especially well for data warehouses is having data contracts.

A data contract is an interface or schema that governs the way data is used throughout its lifecycle by forming agreements between the different parties involved in data management. With data contracts in place, each element of the data chain, from pipelines through to stakeholders who require the output, is governed to ensure transparency and cross-team collaboration.

Data contracts can be used to create an abstraction between departments or teams. For example, a data contract can be between a data engineer and a data scientist. To illustrate, let’s say the data scientist uses 10 features for a machine learning model. The contract here is that the data engineer ensures the 10 used features or columns will not change, even if there is a schema change or evolution in the underlying table.

Data contracts should be organic and can change depending on the data your stakeholder is seeking, being both forward and backward compatible.

Referential integrity and Y42

Data contracts can help enforce referential integrity in data warehouses. At Y42, we understand the need for high-quality, trustworthy data, and that’s why we’ve integrated data contracts into our technology. Contracts are quick to set up and enforce within the Y42 platform and will help to establish communications between everyone who handles or requires data while simultaneously ensuring that your data is accurate and consistent.

Building collaboration between the data teams is essential to maintaining overall data integrity, and with Y42’s inbuilt data contracts element, you can ensure that your stakeholders will receive the right data, at the right time, and in the right way.