Join our Community Kickstart Hackathon to win a MacBook and other great prizes

Sign up on Discord
hello
The Write-Audit-Publish is a data engineering design pattern that prevents bad data from going live. Learn how GitOps for Data enables it by default across all your data pipelines.

In the first part, we’ve looked at what GitOps for Data is, the challenges of keeping code and data separately, and how it works.

In addition to using Git as a key mechanism to track code changes, GitOps for Data also involves version controlling your data. It unifies code and data under a single tracking system: from logical changes to your codebase, to the physical materialization of those changes in the data warehouse, and the handling of operational changes across multiple environments, everything is guided by a single system using Git only.

In this second part, we’ll look at some practical examples of applying GitOps to Data, the benefits of managing code and data together, and how it enables the Write-Audit-Publish (WAP) design pattern out of the box for your data pipelines, so no bad data passes into production.

GitOps examples

Imagine a team identified a bug on one of their data assets and is also tasked with a feature request to add some new fields to it to debug it more easily in the future.

Always working on production data in an isolated environment

Sarah begins working on the feature request in the development environment. She can either work with sample data from dev at the cost of not covering all edge cases or ask for a fresh production copy to be generated in a new development schema.

In contrast, using a system that bundles data and code together, when a new feature branch is created, you get the latest copy of production data in an isolated environment instantly out of the box.

You always work on the latest production data in an isolated envrionemnt when creating new branches.

You always work on the latest production data in an isolated envrionemnt when creating new branches.

Collaboration

Meanwhile, another member of the team, Bob, gets assigned the second issue. Bob finds a quick solution and submits a pull request (PR) – it was a filter wrongly applied. All CI checks pass and, due to its urgency, it gets merged into production immediately.

The first pull request Sarah created needs to bring the latest changes merged into her own branch.

Notify of new changes.

Zero-compute deployment

Everything looks good with the pull request Sarah created, so it’s time to merge it as well. When the code is merged, we have two options:

  • Rebuild the asset according to the new definition.
  • Repoint the asset to the version materialized during development. This is possible using the Virtual Data Builds mechanism. In other words, you don’t need to generate the same data twice – once for development and once for production. Another benefit of reusing the asset is that the deployment is instant.
Deployment becomes a zero-compute instant operation by reusing assets created during development.

Deployment becomes a zero-compute instant operation by reusing assets created during development.

Preventing faulty builds from going into production

The pipeline is triggered again the next day. John, an analyst, gets a notification that the pipeline is now broken. Wouldn’t it be great if we could have prevented the faulty build from getting into production in the first place?

Y42 protects the validity of the production system by prventing the fauly build from going live.

Y42 protects the validity of the production system by prventing the fauly build from going live.

Restore previous version instantly

Bob looks at the faulty pipeline build and realizes the filter applied was too broad, allowing to enter some unexpected values in the production system in the last 24 hours. He decides to revert the commit that introduced the new filter. Bob is facing two options:

  • Revert the code, and reprocess/backfill the data according to the new code definition
  • Use a system that ties code and data together, and when the code is reverted, the data associated with the commit is also restored automatically.
In Y42 you get a full commit history of all your changes and can revert to any commit you want. The data warehouse state is also reverted to the match the codebase state.

In Y42 you get a full commit history of all your changes and can revert to any commit you want. The data warehouse state is also reverted to the match the codebase state.

Benefits of GitOps for Data

As showcased above, using a system that unifies code and data provides us multiple benefits:

  • Less mental overhead. Code becomes the single source of truth. Every code change introduced is automatically reflected in the data warehouse.
  • Full traceability. By coupling code and data, you can preview the different states a data asset went through and the code commit that introduced the new materialization in the data warehouse.
  • Efficient and instant deployments and rollbacks. By pinpointing a specific commit to a specific materialization of a data asset in the data warehouse, you can reuse assets from development environments in your production environment.
  • Reduced data warehouse costs. Reusing assets means we can significantly cut down compute costs. Instead of rebuilding assets whenever we deploy, we can clone assets from other branches and swap their references.
  • Developing on production data in an isolated environment. You can branch off from the production environment and develop new features in an isolated environment with zero compute cost to replicate it.
  • No more bad data. Pipelines can fail due to three possible causes:
    • Code errors: If your production is gated and the only way to change its state is through Git, you can prevent bad code merges by running specific linters that check for syntax errors or invalid model references.

    • Input data errors: Another reason is having bad data generated upstream. In such cases, we want to halt execution as soon as possible. Only the first-line assets should be affected, while any downstream asset should be marked as having stale data. For the first-line assets, we could restore the most recent healthy state until the problem is fixed.

    • Logic errors: An incorrect join or a metric formula needing to be reverted to a previous state. In such cases, we would want to restore the previous state of an asset.

      In all three cases, using Git as a proxy for all changes pushed to the data warehouse, and tying each commit to a state of the data warehouse, allows us to prevent or immediately correct (as in the third case) all potential issues.

Write-Audit-Publish pattern

The Write-Audit-Publish (WAP) is a data quality pattern popularized by Netflix that gives data engineering teams greater control over data quality. It achieves this by implementing quality checks after the data is processed but before it’s made available to downstream consumers (BI, data apps, ML, etc.).

As the name suggests, the Write-Audit-Publish involves three steps:

  1. Write: The data is processed into an isolated environment, where consumers don’t have access to it.
WAP - Write step
  1. Audit: Data quality checks are performed on the data in the isolated environment.
WAP - Audit step
  1. Publish: If all quality checks pass, the data is then made available to downstream consumers.
WAP - Publish step

By using this intermediate layer to validate data, the WAP pattern prevents bad data entering production systems.

The Write-Audit-Publish (WAP) pattern

This contrasts to the more commonly used Write-(Publish)-Audit pattern, where data is processed and made available to all consumers before being tested, when the data might already be incorrect. It’s a 'deploy and hope everything is fine' process that often leaves engineers uneasy every time the data pipeline runs.

The Write-Publish-Audit (WPA) pattern

However, despite its advantages, implementing the WAP pattern across your data pipelines requires some adjustments and must be done explicitly for each pipeline. Furthermore, it involves additional computation that leads to higher costs, due to the need of moving data around from the temporary/audited area to the final reporting/consumer area instead of writing it once.

Write-Audit-Publish pattern in Y42. Enabled by default

How can we automate the process of deploying new code changes with confidence, and ensure the data flows reliably at every run before publishing it to end users? The solution involves three foundational components:

  1. Tracking all data changes for an asset, similar to how git works for code.
  2. Implementing quality gates.
  3. Automatically reverting to any previous state of an asset that doesn’t successfully pass all quality gates.

Let’s see how. To track all data changes for an asset, we first need to have a unified system that oversees both code and data. This system tracks all code and data changes together and maintains their lineage in a single graph – similar to how Git’s tracking mechanism of code changes.

The WAP pattern in Y42 - tracking all asset states.

Secondly the system embeds quality gates, such as assertion tests, anomaly detection tests, grants and tags access control privileges, data-diffs, unit tests to determine the validity of each asset version.

The WAP pattern in Y42 - adding quality gates.

Lastly, the system automatically protects the production environment integrity by preventing faulty builds from going live – builds that fail to pass all quality checks. The faulty builds can be inspected in a separate schema for debugging purposes.

The WAP pattern in Y42 - intelligent asset based orchestrator that builds the pipeline but also publishes the fully audited versions.

By integrating these three concepts, we provide a solution that enables the Write-Audit-Publish pattern out of the box for all data pipelines. The data is written, audited, and if it fails to meet the quality standards, the system prevents the new asset version from going live. This protects downstream consumers from seeing incorrect data, while allowing for debugging of the faulty asset in an isolated environment.

The WAP pattern in Y42.

Summary

We have explored how applying the GitOps methodology to Data enables the Write-Audit-Publish (WAP) design pattern by default for data pipelines with no additional code changes, so no bad data passed into production.

The WAP pattern helps us prevent bad data from entering the production system more holistically by addressing data quality issues before they reach production systems.

Category

Opinion

In this article

Share this article

More articles