How to effectively use Git flows for version control in your data products

Imagine you have just spent the week updating a data model to fit new naming standards that were put in place. Eager to finish, you push the code straight to production. You go to test the production data pipeline, but it turns out it’s broken. Now you need to spend even more time going through all the changes you made and testing each one in the hope of finding the change that broke the data model.

Most of us have experienced something similar to this. It’s extremely stressful and a situation we try to avoid at all costs. Luckily, testing code changes in different environments before pushing them to production can help prevent this from happening. Environments allow us to have multiple versions of the same code, each existing for different changes. This allows us to test code at various levels before it can impact production.

The trouble is that working on pipelines in parallel is challenging. But, there are different DevOps practices to address the challenges we often face when using different environments. Git has made it easier to work on pipelines in parallel through the different branching and CI/CD flows they have developed. In this article, we will look at these Git flows and how they affect version control for our data products.

Brief history of Git workflows for code

Let’s first discuss how Git workflows and their various branching strategies came to exist and the challenges they solve.

GitFlow

GitFlow was originally created over 10 years ago. This flow is the one that probably comes to mind when you think of Git because it’s typically the first strategy that you learn as a software or data engineer. GitFlow consists of a main branch and a develop branch. The main branch is for production code, while the develop branch is for pre-production code. Code on the develop branch eventually gets merged to main when it’s pushed to production.

Three other branches are typically used during the development process: feature, hotfix, and release. The feature branch is used to develop specific features related to the upcoming production release. A hotfix branch is created when there is a production issue, and the code here is a reaction to whatever went wrong in production. A release branch is created to prepare for a production release and typically involves small bug fixes.

While these branches create an easy-to-follow pattern that makes sense in the grand scheme of things, it also creates a history that is hard to read. But, most importantly, the use of release branches forked from the develop branch makes a CI/CD approach more challenging. Code changes in order to merge with the main branch when written off of the development branch, and if it varies from what’s in production.

GitHub Flow

This workflow was created just a year after the original. The idea behind it is that anything on the main branch can be deployed to production. Instead of having to create a branch with one of the three names, you can create a branch from main with a descriptive name that explains the code on that branch.

You then regularly push your changes to this branch. If you need help or want to merge what you have to main, you can create a pull request. This is essentially a code review where someone can see the changes you made and choose whether to suggest corrections or approve the code for merging. Once merged to main, the code should be immediately pushed to production.

Unlike GitFlow, GitHub Flow allows for easier adoption of CI/CD. However, it can also cause an unstable environment in production due to all the code changes that could occur. Not to mention that this also fails to address the issue of different environments, releases, or features.

GitLab Flow

In 2014, the GitLab Flow was created. This is essentially a new and improved version of GitHub Flow that allows the use of environments. With this flow, all commits are tested on all branches and deployments are automatic, based on the branches. Everyone must create a pull request which needs to be approved before merging to main. Commit messages must be descriptive. Every branch starts from main and bugs are also fixed directly on main.

You can build CI/CD pipelines for all branching strategies, but some workflows make it easier than others. The increased complexity of the flow and branches used here results in a process that is safer and keeps code organized.

The data product and version control

These branching strategies work well for software products. However, when building data products, branching for code changes is no longer enough. Data products include multiple data models, each of which produces a different data table that needs to be version controlled. Rather than just testing the code changes, we also need to test the data product itself to ensure that the resulting data tables are as expected.

Git branching in conjunction with environments allows you to not only version control the code itself, but also the data tables produced.

When using, for instance, a GitLab workflow with your data products, it is best to have three environments — development, staging/pre-production, and production. Development will allow you to create and test new features in your code. Staging and pre-production will allow you to test those changes against the data that sits more closely to production. Fixing the bugs that come up in this environment will help prevent bugs in production. And, lastly, when the code has been thoroughly tested in each environment, you can merge it to the main branch with confidence.

Environment-aware branching with Y42

Regardless of which Git branching strategy you choose, it is important to note that all branches should be environment-aware. While the typical approach to table materialization requires dropping and replacing tables in the same namespace, a tool offering a new alternative has emerged — Y42.

Y42’s materialization logic creates content-addressable tables in the data warehouse, thereby empowering its execution engine to determine if a data product on a new branch should create a new table or point to an existing one. This allows you to save on computation and storage costs.

Y42 offers all this in a way that’s fast and easy to implement. It works great when setting up advanced branching strategies in conjunction with zero-configuration environment management and makes these workflows accessible to analytics engineers and data analysts.