We’ve made Data Quality an engineer’s problem. It’s actually a tooling issue

Introduction

The data industry offloaded data quality onto engineers, expecting them to write tests that cover as many edge cases as possible before data reaches production.

After years of trying out this approach, building numerous projects running hundreds of models and thousands of tests, we've come to two clear conclusions. First, running a high number of tests is expensive. Second, despite our efforts, integration runs are often delayed, causing bad data to sneak in and leaving users frustrated. We've marginally improved the user experience, but it has significantly raised our warehouse costs.

What we really need are better tools, not more tests. If you are relying solely on traditional data tests to ensure data quality, you may be doing more work than necessary.

This article discusses why smarter, more efficient tooling can catch errors early and reduce the risks of issues going live. Read on to discover a proactive approach to improving data quality.

Data assets as living entities

Data assets do not live in isolation. They connect a source model to a downstream use case (BI, ML, data apps, etc.). However, a data asset is way more than a link between objects.

Think of a data asset as a living entity. It has a state and its state updates with every pipeline run, code update, or change in upstream assets.

Every asset follows its own journey, from the initial planning and design, through materialization in the data warehouse, deployment, and activation. An asset isn't just a table; it's a collection of data that is being created over time, capturing a snapshot of our understanding of the world, along with its associated metadata. As the asset evolves, it collects more and more metadata.

The need for better tooling

What we need is a system capable of navigating this metadata layer to deeply understand how data assets are connected and is able to automatically make certain decisions on our behalf. These decisions include:

Preventing bad code merges: A system should catch syntax errors or invalid model references ( {{ ref(‘inexistent_model’) }} ) without the need to run the asset or test it. It has to happen at compile time.
Managing downstream changes: Changes affecting downstream assets should be flagged when they occur in the development lifecycle.
Automatically reverting bad production states: If an asset fails in production, the system should immediately revert to the asset's last healthy state. Bad data is worse than no data. To do this, we need a system that keeps track of all asset materializations and knows how to automatically reroute to the appropriate materialization.
Instant rollbacks: But even when an asset runs successfully, we need a way to revert to a state of the asset as soon as possible – preferably instantly – and without extra compute cost. A system that knows how to reroute any asset definition (e.g. code) to a particular materialization (e.g. table) of the asset in the data warehouse. Why redo changes that were previously available in the data warehouse?

The ideal system: A stateful data compiler

A data compiler that can check syntax, catch errors, and verify all references are valid before any code runs. This system should be stateful and know the table history of every asset and how to connect it to a specific version of the code. It should let us quickly push development work into production, roll back changes, and auto-switch to the last healthy version if something goes wrong.

The system should let us work safely with production data during development without messing up the live environment. Too many mistakes happen when we're limited to sample data in dev environments and some of the edge cases cannot be validated.

If we had a system like this, we'd catch most errors before we even write our first data test. Plus, it would handle the initial steps to fix any production issues by reverting to a healthy state, giving us the extra time to find the best solution instead of rushing and making things worse.

How great would that be?

The downside of (too many) data tests

While data quality tests are useful, their focus is primarily on basic data assumptions like volume change, null values, or referential integrity, but they cannot understand the broader context of your project – how does an asset relate to the other assets in your project and how does it relate to its previous versions? Essentially, data tests evaluate only the current state of an asset and are not aware of temporal or upstream changes of the asset.

The limits of data tests are excellently summarized by Benn Stancil in this article:

Don’t tell me that the tests are passing; even the best analysts can’t come close to anticipating all of the ways that a “simple” problem can go wrong. Don’t tell me a report is verified, or is owned by someone I trust [...] Instead, tell me that the dashboard that currently says we made $3.7 million on the weekend of August 1, 2003, said the same thing yesterday, and the day before that, and in September of 2003. [...]

If we want to build faith in our analytical institutions, we shouldn’t be telling people what’s working or broken, or what’s right or wrong; instead, we should tell people what’s changed [...]

When we define key entities, track how they evolve. If there’s a table of purchases or customers, let me define which columns should be immutable—purchase date, contract amount, initial customer segment, etc.—and tell me when they’re different today than they were yesterday. Don’t just validate source schemas and run unit tests on dummy data, and assume that the combination will produce the correct calculations on production data."

Given the narrow focus of each test – such as checking for null or unique values – you will end up adding a large number of tests to have good testing coverage. This often leads to alert fatigue as outlined in this blog post:

Many data teams quickly grow their data warehouse to hundreds of tables. Then, inevitably, things start to break, and they follow up by adding hundreds of dbt tests. Consequently, with code changes and data updates, your data team may quickly experience alert fatigue with failing tests.

Lastly, continuously adding and running these tests also consumes lots of computing resources. In some cases, dbt tests represent half of total dbt cost.

These points underline the need for a broader, more comprehensive, efficient approach to data quality, one that transcends individual tests.

A new generation of data quality tools

Such a framework already exists: a framework that treats your tables as assets. A framework that understands the relationships between assets and understands how an asset evolves over time. This framework enables you to develop on production data instead of samples, reuse assets across different environments, and perform preliminary checks before you commit changes.

Let's see it in action. We'll use the Formula 1 repo that we've just imported into Y42 as an example to demonstrate how the framework protects us from making bad code changes.

Preventing bad code merges

We’ll start by making a change to one of the assets – stg_f1_dataset__races – by updating one of the upstream references to an invalid one, and attempt to commit:

`Compilation Error: Model stg_f1_dataset__races depends on a source named ‘f1_dataset.face1’ which was not found.`;

Model's log surface the above error when comitting.

So the system prevented us from referencing a nonexistent model. In a similar vein, Y42 can detect syntax errors and notify the user before any changes are committed.

Managing downstream changes

Let's update the reference and return to the Lineage view. We now see that the existing model and its downstream models are marked yellow. This indicates that both the current asset and the downstream assets are stale—the existing definition of the model doesn't match the latest materialization (table version) in the data warehouse.

Furthermore, any change to the current model might impact downstream models as well, thus they need to be refreshed. For instance, if a column is removed from model A and model B depends on that column, model B might break in production if the change isn't addressed downstream.

We can use the --stale flag to recompute all stale assets, including dependencies.

y42 build --stale

All our assets’ materializations are up to date with their code definition. ✅

We can now safely merge our feature branch into the main branch and push the changes to production.

Merging to main

Because both the asset definition and the state of the tables/materializations in the data warehouse are up to date, we can leverage the Virtual Data Builds mechanism to reuse the assets from the feature branch in production and not recompute the asset again once in production. To merge the feature branch into main and reuse the asset, we can either push directly to main or create a Pull Request* for the asset to be reused.

Creating a Pull request to merge changes in.

Secondly, we can push and materialize any asset from any branch using the Branch Environments option.

Branch environment screen that allows to publish datasets or set orchestration.

Automatically reverting bad production states

Let’s now simulate a pipeline failure by altering the races input file to introduce an incorrect date format. While the source task will load without issue, the stg_f1_dataset__races model will fail during the date casting. To run everything downstream of the source, we can use the following build command:

y42 build -s source:f1_dataset.races+

We can also preview what the command will build before running it using the Build Preview option.

As expected, the model failed.

However, the system recognizes the faulty run and automatically retains the previous successful build of the asset. This ensures that end users won’t see incorrect or partial data, as the asset will always roll back to a healthy state.

Asset still reference the previous healthy state.

Instant zero-copy rollbacks

Finally, to revert our changes to the initial state, we can select the desired code version for the asset. The system then uses the Virtual Data Builds mechanism to automatically restore the asset's materialization—i.e., the table version associated with that specific code definition—eliminating the need to recompute and create a new table in the data warehouse.

Conclusions

Traditional approaches to maintaining data quality have relied heavily on individual developers writing extensive tests to cover all edge cases. While important, this reactive approach is expensive and inefficient as it often catches issues too late, usually after the data already hit production.

The solution isn’t more tests, but smarter, more efficient tooling that can catch problems early in the development cycle (shift-left testing). The Systems Sciences Institute at IBM has reported that:

The cost to fix an error found after product release was four to five times as much as one uncovered during design, and up to 100 times more than one identified in the maintenance phase. So the cost of a bug grows exponentially as the software progresses through the Software Development Lifecycle.

The Y42 platform acts as a data compiler capable of understanding the asset definition, its relation to the asset materialization, and other connected assets in the project. It makes real-time decisions, such as notifying in case of syntax errors or invalid references, flagging stale assets whenever the asset or an upstream asset definition changes, executing rollbacks in case of failures, or performing efficient zero-copy deployments and rollbacks.