Benchmarking Virtual Data Builds - Cutting data warehouse costs

Since introducing the concept of Virtual Data Builds: A data warehouse environment for every Git commit, we have received many inquiries about the actual savings on data warehouse costs when enabling Virtual Data Builds.

In this article, we aim to shed some light on the topic and conduct a cost analysis to evaluate the savings potential of utilizing Virtual Data Builds (VDBs) in your analytics workflows for different organizational setups. To make it interactive, we've also built a simple application that you can use to estimate the cost savings based on your setup.

→ ACCESS THE COST SAVINGS CALCULATOR HERE

What are Virtual Data Builds?

Virtual Data Builds (VDB) allow you to manage code changes and the state of your data warehouse from a single system driven by git. VDBs facilitate the creation of virtual data warehouse environments for each Git commit in your codebase without the need to duplicate data or set up new physical environments. This is achieved through a smart view pointer swap mechanism at the data warehouse level, optimizing the compute costs.

This deep integration between your codebase and your data warehouse state brings several benefits:

Work on production data in your feature branch without duplicating data or impacting the production environment/branch with no additional cost.
Instant deployment and rollback of your code changes. Your asset’s view definition can be updated to reference any previous materialization (from any branch).
Less mental overhead: VDBs convert your code operations into data warehouse assets, saving you the pain of managing environments and promoting code throughout the pipeline.
Reliable data: Using VDBs, the system always displays the last successful build of your assets, preventing end users from reporting based on inaccurate data.
Significant savings on warehouse costs, ranging from 37% to 68% depending on your setup, as we will explore further.

You can read more about the advantages of Virtual Data Builds in this article.

Benchmarking setup - establishing 3 data team scenarios

To establish a tangible benchmark, we categorized teams into small, medium, and large sizes, each possessing unique characteristics that influence various aspects of a data project, from the number of code changes to the number of environments.

Small teams (1-10 members)
- Data models: 20-500 [1]
- Total data volume: Up to 1TB [2]
- Code changes: 10-50 commits per month
- Number of environments: 1-2 (e.g., development and production)
- Rollbacks: Up to 1 time per month.
Medium teams (10-50 members)
- Data models: 500-3000
- Total data volume: 1TB-10TB
- Code changes: 50-400 commits per month
- Number of environments: 4 (e.g., development, qa, staging and production)
- Rollbacks: 2-5 times per month.
Large teams (50+ members)
- Data models: 3000+
- Total data volume: 10TB+
- Code changes: 400+commits per month
- Number of environments: 5+ (multiple stages including testing, QA, staging, pre-production, production)
- Rollbacks: 5+ times per month.

Customizing for your team

Your team setup might not exactly fit into any of the three buckets. Use these parameters as a guideline and adjust them to match your setups’ specifics.

Benchmarking methodology

We conducted an analysis breaking down the total data warehouse cost into storage and compute costs, and comparing scenarios with VDBs enabled and disabled. Initially, we considered examining different data warehouse solutions, including BigQuery and Snowflake, and providing precise dollar amounts. However, due to the heterogeneous nature of different setups, we decided to standardize the measurement using compute and storage units.

Now, let's explore the math behind the compute and storage costs under both scenarios:

Compute cost

Compute cost is a function of code changes, rollbacks, and the number of environments.

❌ Compute costs without VDBs

Compute cost = (Code changes + Rollbacks) × Number of environments

Whenever you make a code change, you need to promote it through all environments (dev, qa, staging, production). Similarly, any rollback requires a reverse-order downgrade (production, staging, qa, dev) before you can move on to the next item.

*The assumption is that all code changes are equally weighted.

✅ Compute costs with VDBs

Compute cost = Code changes

With VDBs enabled, rollbacks do not add to the overall compute cost, as they are managed by reverting the asset definition to a prior materialization.

This principle also applies to promoting code changes to higher environments. Whenever you want to promote a change to a higher environment the system reuses the materialization from the lower environment and switches the higher environment asset definition to the lower environment asset materialization. In other words, when you push a change from dev (lower environment) to staging (higher environment), the staging asset is reusing the dev materialization (table) without the need to recreate the materialization in the staging environment anymore.

Storage cost

Storage cost is a function of the number of data models and overall data project size (volume).

❌ Storage costs without VDBs

Storage cost = Data volume

In this scenario, storage doesn't increase with code changes. Although one might argue that storage could expand if new code changes modify the data model or integrate new data sources, this is not directly a consequence of the code modifications themselves. Instead, it is influenced by data in upstream sources or connected models. For example, adding a new column to your data model has minimal impact if the column exists upstream but hasn't been populated yet. In data warehouse setups employing columnar storage and compaction algorithms, the introduction of an empty column hardly affects the existing storage footprint. However, the storage will experience an increase once the new column starts being populated with data from upstream sources.

✅ Storage costs with VDBs

Storage cost = Data volume + New materializations resulting from code changes

With VDB enabled, each code modification leads to the creation of a new materialization (table) in your data warehouse. For instance, if you have a data model with 100 tables and you perform 20 code changes (assuming each change affects one table at a time), you will end up with 120 tables in the data warehouse and 100 views (each view corresponding to a data asset).

When you make a change, a new materialization (table) is created in the internal layer (lower) and the public view reference gets updated.

Virtual Data Builds mechanism explained - new materialization created.

When you merge the change, the old materialization referenced (the yellow table in the lower layer on the left) is discarded in the public layer, but is still kept in case you want to revert your data asset to it. Unreferenced materializations are discarded from the system after a period of 30 days (customizable).

Compute-to-storage ratio

In order to calculate the total cost and see the benefit of VDBs, we must convert compute-to-storage units. This isn't a standardized value and varies greatly depending on your setup. For our calculations, we utilized data from two products in the space, Snowflake and Hightouch, as a basis. However, we encourage you to adjust this according to your actual data warehouse expenditure ratio.

Snowflake blog:
- Monthly compute cost: 4,464 credits (@ $2/credit) = $8,928
- Monthly storage cost: 65 TB (@ $23/TB) = $1,495
- Compute to storage ratio: 6
Hightouch blog:
- Monthly compute cost: 524 credits (@ $2.60/credit x 0.95 (5% discount on capacity pricing)) = $1,294
- Monthly storage cost: 6TB x $23 (per TB per month) = $144
- Compute to storage ratio: 8.6

Actual numbers

Now that we have gathered the three components - compute cost, storage cost, and the compute-to-storage ratio when VDBs are disabled - we can calculate the total cost for the small, medium, and large team setups under both scenarios: with VDBs enabled and disabled.

For our calculations, we'll use a compute-to-storage ratio of 7 and Snowflake's standard pricing of $23 per TB.

Small teams (1-10 members)

Data models: 200 [1]
Data volume: 100GB [2]
Code changes: 40 commits/month
Number of environments: 2 (e.g., development and production)
Rollbacks: 1/month

❌ Without VDBs

Storage cost (SC): 100 GB * $23 / TB = $2.30/month
Compute-to-storage ratio = 7 => Compute cost (CC) = $2.30 * 7 = $16.1/month
Compute:
- Number of compute units (CU) = (40 code changes + 1 rollback) * 2 environments = 82 CU
- Cost per compute unit (CPCU) = $16.1 / 82 = $0.19
Total: $2.30 (SC) + $16.1 (CC) = $18.4/month

✅ With VDBs

Storage cost (SC): 100 GB + (100GB / 200 models) * 40 changes = 120 GB * $23/TB = $2.76/month
Compute cost (CC): 40 code changes = 40 compute units (CU) * $0.19 = $7.6/month
Total: $2.76 (storage) + $7.6 (compute) = $10.81 💲Savings percentage: 100 * ($18.4 - $10.81) / $18.4 = 43.6% or $7.59/month 💡 With this setup, the benefits of Virtual Data Builds alone would not yield sufficient absolute cost savings to justify a significant tooling spend. Additional cost-, use-case-, or productivity benefits would be needed to do so.

Medium teams (10-50 members)

Data models: 1000
Data volume: 10TB
Code changes: 300 commits/month
Number of environments: 3
Rollbacks: 3/month

❌ Without VDBs

Storage cost (SC): 10TB * $23/TB = $230/month
Compute-to-storage ratio = 7 => Compute cost (CC) = $230 * 7 = $1610/month
Compute:
- Number of compute units (CU) = (300 code changes + 3 rollbacks) * 3 environments = 909 CU
- Cost per compute unit (CPCU) = $1610/ 909 = $1.77
Total: $230 (SC) + $1610 (CC) = $1840/month

✅ With VDBs

Storage cost (SC): 10TB + (10TB / 1000 models) * 300 changes = 13TB * $23 / TB = $299/month
Compute cost (CC): 300 code changes = 300 compute units (CU) * $1.77 = $531/month
Total: $299 (storage) + $531 (compute) = $830 💲Savings percentage: 100 * ($1840 - $830) / $1840 = 54.9% or $1010/month 💡 With this setup, a tooling investment of $1,000/month would fully amortize itself as of month 1 and even be slightly ROI-positive.

Large teams (50+ members)

Data models: 3000
Data volume: 100TB
Code changes: 600 commits/month
Number of environments: 5
Rollbacks: 5/month

❌ Without VDBs

Storage cost (SC): 100TB * $23/TB = $2300/month
Compute-to-storage ratio = 7 => Compute cost (CC) = $2300 * 7 = $16100/month
Compute:
- Number of compute units (CU) = (600 code changes + 5 rollbacks) * 5 = 3025 CU
- Cost per compute unit (CPCU) = $16100 / 3025 = $5.32
Total: $2300 (SC) + $16100 (CC) = $18400 / month

✅ With VDBs

Storage cost (SC): 100TB + (100TB / 3000 models) * 600 changes = 120TB * $23 / TB = $2760/month
Compute cost (CC): 600 code changes = 600 compute units * $5.32 = $3192/month
Total: $2760 (storage) + $3192 (compute) = $5952 💲Savings percentage: 100 * ($18400 - $5952) / $18400 = 67.6% or $12448/month 💡 With this setup, a tooling investment of for instance $6,000/month would fully amortize itself as of month one and drive a significant ROI of 107%.

Visualized Results

If we plot the data, using the X-axis to represent the number of code changes and the Y-axis to represent the number of environments, the savings percentage chart would for the three scenarios would look like this:

Adding a few more data points shows a link between the number of code changes, the number of promotion environments, and increased savings from enabling Virtual Data Builds. The marginal utility starts to level off at around 68% cost savings.

Cost savings as a function of code changes and environments.

You can use the following calculator we developed to customize the parameters according to your own setup.

What about dbt?

dbt has been instrumental in streamlining data transformations. However, it's worth noting that dbt’s execution model is very resource-intensive.

During the development of a new feature in a dev environment, you need to refresh all upstream models before integrating your changes, and additionally, you need to refresh downstream models to ensure that your changes don't cause unintended behaviors downstream.

Once you are prepared to merge those changes into higher environments (like prod), the downstream models must be run again, along with the models to which you made changes, to prevent any unexpected outcomes. This translates to utilizing double the compute resources instead of leveraging existing materializations.

Navigating the complexity of the defer feature

dbt introduced the --defer flag, which can be used in tandem with --state:modified to decrease execution time by selecting only the models that were modified (state:modified) and referencing upstream models from another build (using --defer –state /path/to/prod-run-artifacts). The command looks like this:

dbt run --target dev --full-refresh --select state:modified+ --defer --state /path/to/prod-run-artifacts

While this approach reduces the number of models being executed, users need to manually select the correct state file during the development cycle. Moreover, if the user's local state has been changed and is not in sync anymore with the manifest file referenced, dbt reprocesses the entire DAG graph.

The Y42 execution model overcomes these limitations, eliminating the need to manually select the appropriate manifest file. Y42 promotes the reuse of models across various environments, automatically detecting those models that have been modified and their downstream dependencies and can be simply rebuilt using the following command:

y42 build --stale

Whenever you create a branch, promote a change, or initiate a rollback, the system automatically updates all asset references according to your code change and ensures you always work with the latest successful build – if the last run was unsuccessful, Y42 falls back to the previous, most successful run of your build when reporting data. Moreover, as part of the out-of-the-box CI checks, Y42 prevents merging into main if the model is stale (indicating that either the last run was unsuccessful or the latest code change hasn’t been materialized yet), or if the model tests have failed.

Conclusion

As we can observe from the analysis above, enabling the Virtual Data Builds mechanism results in a marginal increase in the storage bill. However, the compute cost decreases significantly, causing an overall cost reduction of 37-68% in storage and compute combined.

The more complex your setup is (i.e., a higher number of environments and code changes), the more savings you gain by adopting VDBs. Hereby, the number of environments is the most impactful variable.

Moreover, the complexities found in a dbt setup – which involves high compute resource usage and the need for careful state file management – are avoided with a VDB configuration.

To recap, the adoption of VDBs allows you to:

Work with production data in your dev environment or feature branch without duplicating data or affecting the production environment, at no extra cost. Build your assets once and then reuse them across different environments and branches.
Cut total data warehouse costs by 37-68%, depending on your setup. For reference: 37% for one-man setups (50 models, 15 commits per month, 2 environments); 68% for large setups (3000 models, 600 commits per month, 5 environments)
Instant deployment and rollback of code changes. When merging into the main, your asset's view definition is updated to reference the materialization (table) from your feature branch by swapping the materialization name within the view definition.
Reduced complexity. In contrast with dbt's setup, VDB simplifies state file management and resource allocation, reducing the possibility of errors and compute resource overuse.
Reliable data. Using VDBs, the system always displays the last successful build of your assets, preventing end users from reporting on inaccurate data.

Benchmarking Virtual Data Builds - Cutting data warehouse costs

What are Virtual Data Builds?

Benchmarking setup - establishing 3 data team scenarios

Benchmarking methodology

Compute cost

Storage cost

Compute-to-storage ratio

Actual numbers

Small teams (1-10 members)

Medium teams (10-50 members)

Large teams (50+ members)

Visualized Results

What about dbt?

Navigating the complexity of the defer feature

Conclusion

Be the first to know