Learn how to stop bad data from going live with CI/CD.

Register for masterclass
hello
Find out the most common mistakes made when migrating to data warehouses like Snowflake and the best practices you must follow in order to avoid them.

I still remember my first data migration like it was yesterday. We needed to switch from one region of Snowflake to another, in order to take full advantage of an external tool we were using. We already had a ton of data being used in the one region, with all of our production pipelines pointing there.

Data warehouse migrations are often stressful due to the time they take and the quality of your data being at stake. However, they are often necessary (who wants to migrate data if they don’t absolutely have to?).

Maybe you are moving from on-premise to the cloud. Or, you could be changing data warehouse providers. You may even be migrating for the same reason I first did - to integrate your warehouse with a specific tool.

When I first migrated data from one Snowflake region to another, there were many checks and balances I implemented along the way. It’s better to move slowly and take the time to understand each best practice, as it's all of your data on the line. That being said, there were still mistakes that I learned along the way.

Here are the most common mistakes made when migrating data warehouses like Snowflake and the best practices you must follow in order to avoid them.

1. Not making copies of your data.

The number one mistake I see when people migrate from one data warehouse to another is not making copies of your data. When migrating, there is always a large chance of data loss.

Assume the worst and hope for the best.

This means you need to think about what you would need if you did lose all of your data. How can you fix the issue of data loss given this happens? You can make copies in advance.

Having copies of all of the raw data that you are migrating will allow you to compare the original data to the data in your new data warehouse. This will ensure there are no gaps in your data that you may have missed during the migration process. And, in the case that there is, it will allow you to easily debug what went wrong and how you can mitigate future chances of error.

2. Not considering the best process for the warehouses you are migrating to and from.

Every data migration process will be a bit different depending on what warehouses you are migrating between. If you are migrating from on-prem to the cloud, for example, your strategy will be much different than migrating from one cloud data warehouse to another. It may also vary depending on the specific data you are trying to migrate, like PII data.

In my case, I was migrating from one Snowflake region to another. It made the most sense to use internal Snowflake tooling for this rather than an external data migration tool. After all, Snowflake understands Snowflake the best! Snowflake’s Replication/Failover feature was created for that specific use case, so it make the most sense to use that.

Make sure you heavily research your different options before diving straight into a migration. This will give you the highest success rate possible and minimize any bumps in the road along the way.

3. Migrating all of your data at once.

Many people think you should migrate all of your data at one time, then wait until it’s all in the new warehouse to begin validation. This idea could not be more wrong. You want to move your data as slowly as possible, checking it for quality along the way.

Migrating slowly will allow you to catch any mistakes in the process and then apply your learnings from those to the migration of the next chunk of data. Moving your data all at once is overwhelming and makes it harder to pinpoint certain problems.

When deciding which pieces of data to move first, I recommend starting with the lowest-risk tables. This way, if something goes wrong, it’s not the end of the world since those tables aren’t used as frequently. Starting with ones that don’t impact the business will help you keep quality high for the tables that matter most.

Once you’ve mastered the process, and can ensure your data is correctly making it to its destination in the way that you expect, move on to your important core tables. While working through core tables, start with the base of the dependency chain and move your way through to the end.

4. Not utilizing different environments and not keeping both data warehouses active for some time.

When migrating your data from one warehouse to another, you may be tempted to point all production pipelines to the new warehouse right away. However, this increases the chance of errors and makes it way harder to debug data quality issues. Imagine switching everything over and then stakeholders across the business start reporting fishy patterns in the data. That would be a nightmare because you wouldn’t even know where to start!

I recommend keeping both data warehouses running for some time, until you can validate that data is flowing into the new data warehouse as expected and that the historical data between warehouses also matches. This means you will have to duplicate your current data pipeline so that one points to production in your old data warehouse and the other points to development in your new data warehouse.

When duplicating your pipeline, the same data syncs and data transformations will be happening twice, but within two different warehouses. I know what you’re thinking… this sounds expensive and load-intensive. You’re right! But you know what’s more expensive? The business not having the data it needs because something went very wrong in the data pipeline. I can’t think of much worse than stakeholders getting on your case while you are trying to debug every data record in every schema of your databases. Even worse, imagine having to redo the entire migration because of a process going wrong from the very start.

You don’t have to have the two data pipelines running simultaneously in their entirety. Set the second pipeline up for only a few select database sources, or a few schemas in one source if you have something like a Postgres database being synced. Once you’ve validated that the pipelines produce the same result in both the new data warehouse and the old one, you can reconfigure the old one to no longer sync this source and its dependencies, and switch entirely to the new one.

5. Not validating your data post-migration.

Validating your data between the two data warehouses allows you to identify gaps in the data. Is data from an entire day missing? Were records where a certain field is NULL not loaded? Comparing the results of two seemingly identical tables will give you full confidence in the success of your migration.

Unfortunately, you can’t assume that everything is perfect. There will be slight differences between different data warehouse vendors that you need to account for. But first, you need to find what these are. To do this, you can compare counts and exact records.

Comparing counts of tables is fast and effective. It gives you a good idea if all records made it to the new warehouse, allowing you to pinpoint big issues right away.

This can be as simple as an aggregated count query:

SELECT 
	data_trunc(‘year’, created_at) AS year_created, 
	count(*) AS n_records
FROM dev.raw_data.accounts 
GROUP BY data_trunc(‘year’, created_at)

The one problem with using counts is that it does not highlight differences in fields within the same records.

In order to understand how records within a field compare to one another you can use table hashes. This is also fast and will tell you the count of records whose fields are all the same, rather than if they simply exist. However, this won’t tell you which records and values are different, making it difficult to correct the issue.

Lastly, you can compare record hashes, which will give you all of the data you need to understand the differences between specific records and their fields. However, this is extremely slow and not ideal for a huge database migration.

For the best of all of these methods, I highly recommend checking out Datafold’s data-diff, an open-source tool to compare records within two data warehouses. This does a great job of finding the differences between records in a short amount of time. Y42 automatically provides data-diff at the asset level with each pull request submission, integrating it into the CI checks.

Now, after doing this, if you do find differences (and you always will), you need to figure out why this is. Remember what I said about different data warehouse vendors - they will all handle data differently. When you find the differences, see if you can identify a pattern. Then, compare that pattern to what is normal for each data warehouse. This will help you discover the expected differences between the same data stored in different warehouses.

Conclusion

Migrations don’t have to be stressful if you prepare ahead and take the process slowly. Make sure you give yourself enough time to switch between data warehouses, ensuring contracts aren’t expiring soon or there are no other large data engineering projects planned.

Data migrations are a marathon, not a sprint. Luckily, you are taking the first step you need to be successful - learning what not to do. If you avoid these mistakes, and follow the best practices outlined, you should protect yourself against the worst-case scenarios. Good luck!

Category

Data Insights

In this article

Share this article

More articles