Logo y42
Madison Schott
Madison Schott
Guest Writer (Analytics Engineer)
03 Feb, 2023 · 6 min read

How to effectively use Git flows for version control in your data products

version-control-git-blog-banner

Imagine you have just spent the week updating a data model to fit new naming standards that were put in place. Eager to finish, you push the code straight to production. You go to test the production data pipeline, but it turns out it’s broken. Now you need to spend even more time going through all the changes you made and testing each one in the hope of finding the change that broke the data model.

Most of us have experienced something similar to this. It’s extremely stressful and a situation we try to avoid at all costs. Luckily, testing code changes in different environments before pushing them to production can help prevent this from happening. Environments allow us to have multiple versions of the same code, each existing for different changes. This allows us to test code at various levels before it can impact production.

The trouble is that working on pipelines in parallel is challenging. But, there are different DevOps practices to address the challenges we often face when using different environments. Git has made it easier to work on pipelines in parallel through the different branching and CI/CD flows they have developed. In this article, we will look at these Git flows and how they affect version control for our data products.

Brief history of Git workflows for code

Let’s first discuss how Git workflows and their various branching strategies came to exist and the challenges they solve.

GitFlow

GitFlow was originally created over 10 years ago. This flow is the one that probably comes to mind when you think of Git because it’s typically the first strategy that you learn as a software or data engineer. GitFlow consists of a main branch and a develop branch. The main branch is for production code, while the develop branch is for pre-production code. Code on the develop branch eventually gets merged to main when it’s pushed to production.

Three other branches are typically used during the development process: feature, hotfix, and release. The feature branch is used to develop specific features related to the upcoming production release. A hotfix branch is created when there is a production issue, and the code here is a reaction to whatever went wrong in production. A release branch is created to prepare for a production release and typically involves small bug fixes.

While these branches create an easy-to-follow pattern that makes sense in the grand scheme of things, it also creates a history that is hard to read. But, most importantly, the use of release branches forked from the develop branch makes a CI/CD approach more challenging. Code changes in order to merge with the main branch when written off of the development branch, and if it varies from what’s in production.

GitHub Flow

This workflow was created just a year after the original. The idea behind it is that anything on the main branch can be deployed to production. Instead of having to create a branch with one of the three names, you can create a branch from main with a descriptive name that explains the code on that branch.

You then regularly push your changes to this branch. If you need help or want to merge what you have to main, you can create a pull request. This is essentially a code review where someone can see the changes you made and choose whether to suggest corrections or approve the code for merging. Once merged to main, the code should be immediately pushed to production.

Unlike GitFlow, GitHub Flow allows for easier adoption of CI/CD. However, it can also cause an unstable environment in production due to all the code changes that could occur. Not to mention that this also fails to address the issue of different environments, releases, or features.

GitLab Flow

In 2014, the GitLab Flow was created. This is essentially a new and improved version of GitHub Flow that allows the use of environments. With this flow, all commits are tested on all branches and deployments are automatic, based on the branches. Everyone must create a pull request which needs to be approved before merging to main. Commit messages must be descriptive. Every branch starts from main and bugs are also fixed directly on main.

You can build CI/CD pipelines for all branching strategies, but some workflows make it easier than others. The increased complexity of the flow and branches used here results in a process that is safer and keeps code organized.

The data product and version control

These branching strategies work well for software products. However, when building data products, branching for code changes is no longer enough. Data products include multiple data models, each of which produces a different data table that needs to be version controlled. Rather than just testing the code changes, we also need to test the data product itself to ensure that the resulting data tables are as expected.

Git branching in conjunction with environments allows you to not only version control the code itself, but also the data tables produced.

When using, for instance, a GitLab workflow with your data products, it is best to have three environments — development, staging/pre-production, and production. Development will allow you to create and test new features in your code. Staging and pre-production will allow you to test those changes against the data that sits more closely to production. Fixing the bugs that come up in this environment will help prevent bugs in production. And, lastly, when the code has been thoroughly tested in each environment, you can merge it to the main branch with confidence.

Environment-aware branching with Y42

Regardless of which Git branching strategy you choose, it is important to note that all branches should be environment-aware. While the typical approach to table materialization requires dropping and replacing tables in the same namespace, a tool offering a new alternative has emerged — Y42.

Y42’s materialization logic creates content-addressable tables in the data warehouse, thereby empowering its execution engine to determine if a data product on a new branch should create a new table or point to an existing one. This allows you to save on computation and storage costs.

Y42 offers all this in a way that’s fast and easy to implement. It works great when setting up advanced branching strategies in conjunction with zero-configuration environment management and makes these workflows accessible to analytics engineers and data analysts.

If you want to learn more about how branching works in Y42, click here to check out their interactive demo.

Get in touch

Share this on

About the author:

Madison is an analytics engineer with a passion for data, entrepreneurship, writing, education, and wellness. Her goal is to teach in a way that everyone can understand — whether you’re just starting out in your career or you’ve been working in engineering for 20 years. She is an avid writer on Medium and shares her thoughts on analytics engineering in her weekly newsletter.

About Y42

Y42 is an Integrated Data Development Environment (IDDE) purpose-built for analytics engineers. It helps companies easily design production-ready data pipelines (integrate, model, orchestrate) on top of their Google BigQuery or Snowflake cloud data warehouse. Next to interactive, end-to-end lineage and embedded, dynamic documentation, DataOps best practices such as Virtual Data Builds are baked in to ensure true pipeline scalability.

It's the perfect choice for experienced data professionals that want to reduce their tooling overhead, collaborate with junior data staff, or (re)think their data stack from scratch.

Leave us a comment

This field is required
This field is required
This field is required