GitOps for Data - Version control code and data together - Part 1

This is a two-part blog series. In the first part, we’ll cover what GitOps for Data is, the challenges of keeping code and data separately, and how it works.

In the second part, we’ll explore some practical examples of applying GitOps to Data, the benefits of managing code and data together, and how it enables the Write-Audit-Publish (WAP) design pattern out of the box for your data pipelines, so no bad data passes into production.

What is GitOps for Data?

GitOps for Data is a framework for managing data assets, leveraging Git as the single source of truth to deliver analytics as code. Every code change is checked in the Continuous Integration (CI) process, preventing syntax errors or invalid model references from being merged into the production branch.

In addition to using Git as a key mechanism to track code changes, GitOps for Data also involves version controlling your data. It unifies code and data under a single tracking system: from logical changes to your codebase, to the physical materialization of those changes in the data warehouse, and the handling of operational changes across multiple environments, everything is guided by a single system using Git only.

GitOps for Data ensures that data projects are reproducible based on the state of the Git repository. Pull requests (PRs) modify the state of the Git repository. Once PRs are approved and merged, the pull requests automatically trigger updates both to the codebase and the data warehouse states. In other words: once the code change goes live, so does your data.

The challenges of keeping data and code separately

The alternative, and often the status quo, is to maintain two versions of the truth: one in the codebase, and one in the data warehouse. This leads to a wide range of issues that only compound over time.

Once the data warehouse state drifts apart from the codebase state, you face inconsistencies that are hard to remediate. What do you do if a new request comes in to make a new code change? Do you take the current code version and make the necessary changes, or do you first start from the data warehouse version, trying to find why it's out of sync with the codebase, updating the code version to match the data warehouse state, and then applying the new changes? Or do you do it the other way around – syncing the data warehouse from the codebase? How do you know which one holds the latest version? Debugging becomes a nightmare. What was a simple feature request turns into a conflict resolution between the code state and the data warehouse state.

Furthermore, how can we ensure everyone is developing on the latest version of the codebase when cloning the production environment in development? What if we need to roll back a change on an out-of-sync project?

All these issues stem from running two isolated systems in parallel. A stateful system – where the state of the codebase is always in sync with the state of the data warehouse – is the best option to this problem.

Stateful is a term used to describe a system, or an object that retains information or “state” across interactions over time, from one operation to the next. This memory of past events can influence future operations.This is in contrast to stateless systems, which do not retain any information between interactions.

In the context of data engineering and pipelines, stateful refers to the fact that data assets can retain information across the different stages of the asset lifecycle, transformations, or over time. This stateful nature of data assets means that the output at any stage of the pipeline may be influenced by the accumulated state of the data up to that point.

How GitOps for Data works

Everything as Code

At the core of the GitOps for Data mechanism is Git. Every user action translates into a corresponding YAML config file that is version-controlled, from sources, to models, snapshots, seeds, tests, schedulers, and alerts.

At Y42, we are following the Everything as code mantra. You can find what this entails and an end-to-end example of how analytics as code looks in practice here.

Stateful systems

Secondly, in order to extend the Git concept from code to data, and easily access previous states of the data warehouse, we need a stateful system. This system would be responsible for maintaining the state of the codebase in sync with the state of the data warehouse. With this approach, we have the ability to work solely on the codebase while any changes made are automatically reflected in the data warehouse. For example, reverting a commit on the codebase would result in a rollback to a previous state of the data warehouse. Similarly, merging changes from a feature branch to the main branch would automatically update the production environment with those changes to ensure it stays in sync with the code merge operation. Whenever we create a branch it allows us to work in an isolated environment using production-like data rather than stale or sample data.

By connecting and maintaining synchronization between our codebase and data warehouse we create a medium for navigating through the states of our data warehouse. This is done by operating directly on our codebase. In this scenario, the codebase becomes the single source of truth for all the data warehouse operations.

Code as the single source of truth for managing the data warehouse ops.

You can read more about the differences between stateful and stateless systems and how stateful systems make it easier to develop and maintain pipelines with confidence in this article.

Virtual Data Builds

However, version controlling data at scale is challenging – you cannot store terabytes of data in Git, nor can you update the repository every time data changes.

An alternative, more efficient approach involves linking code commits to tables in the data warehouse. But we don’t want to keep changing the name of materialized tables just to keep our history whenever we make a change. Therefore, we need two schemas:

An internal one, where as you commit, all materializations are stored as tables
An external, customer-facing one, where views or clones point to the appropriate materialization from the internal schema based on where you are in the commit tree.

This two-layer approach gives us the best of both worlds. The data is fully version-controlled, allowing us to painlessly revert to a previous state of our codebase. At the same time, we have appropriate customer-facing assets that point to the right materializations.

This underlying mechanism that maintains the whole system and keeps the state of the data warehouse in sync with the codebase is called Virtual Data Builds. You can read more about it here.

How Virtual Data Builds mechanism works.

Summary

GitOps for Data is a powerful workflow mechanism for managing data assets. It connects the state of the codebase with the state of the data warehouse, allowing us to guardrail the data warehouse from usage outside of Git. This way, Git becomes the control center for our data warehouse.

In the second part, we’ll look into the benefits of using GitOps for data, such as providing a complete traceability and control over all data operations by navigating to any previous state through code commits, how it prevents bad data from entering the data warehouse, or zero-copy deployments and rollbacks. Moreover, we’ll look at some concrete examples of applying the GitOps for Data paradigm and how it can unlock out of the box the Write-Audit-Publish (WAP) pattern for your data pipelines, ensuring that no bad data passes into production.