Testing is one of the most important practices in the engineering world. In software engineering, testing prevents bugs from being carried through to production. In data engineering, testing ensures data is captured and moved around properly. In analytics engineering, it ensures your data is high-quality and ready to be used by business teams.
Testing data models catches problems in your data before they manifest in downstream models. Without testing, issues can go undetected for days, weeks, or even months. I’ve run into scenarios where no data is being collected on the backend of a website due to expired tokens, leading to two weeks’ worth of lost data. I’ve also had tables that were outdated due to a schema or data type change at the source, causing stakeholders to use stale data. Testing is one of the only ways to ensure you are being proactive about your data rather than reactive.
As a key data governance practice, testing can be used to uphold high standards put in place to control access to data and mask customer PII information. This helps keep the data within your data warehouse secure and accurate.
Testing at the source and model levels is key to covering all of your bases in terms of data quality. It’s also important to set up alerts that will notify the dataset’s owner (or whoever is responsible for data quality) when these tests fail. In this article, we will talk about the three components of good testing and the specific attributes of sources and models that you want to test.
Testing at the source, meaning the first point at which your data loads into your warehouse, is essential for creating a strong testing environment. You always want to test the data that loads directly from an external source or ingestion tool because this is the earliest point at which you can catch any problems.
Catching issues during this first data-loading step will allow you to fix them before they affect downstream data models or dashboards. It’s like a leak in your faucet — you’d want to catch it after a few drips rather than when the whole kitchen is flooded. Testing at the source will help you do just this.
Now, with testing, there are different attributes for you to look at. Some are on the table level, where the tests look at the characteristics of the table as a whole. Others are on the column level, where they look at the values in a certain column. Let’s discuss each of these.
The two main metrics that you should track at the table level are number of columns and freshness.
While the number of columns is pretty basic, it’s important to load all of the columns that you expect to see. If you find a column is missing, this can indicate an issue in your ingestion pipeline or a schema change for that external data source. Being alerted to a missing column will allow you to investigate why it is missing before downstream models that reference the missing column break.
In order to find the number of columns in a table using SQL, you can run the following query in Snowflake:
SELECT COUNT(column_name) AS number_of_columns FROM information_schema.columns WHERE table_name=‘ad_campaigns’
This will return the number of columns in the table “ad_campaigns”. You can use a query like this to compare the actual number of columns to what you were expecting and ensure the number is the same.
As for freshness, this refers to how up-to-date the data in the table is. If data is ingested every day, the latest data row should have been ingested no more than 24 hours ago. If data is ingested weekly, the latest data row should have been ingested no more than seven days ago. While freshness varies depending on an organization’s needs, the most important thing is that the table’s freshness matches your expectations.
The easiest way to check for freshness is to look at a timestamp column that best represents when a row of data in the table was created. For example, for an events table, this would be “event_created_at”. For the “ad_campaigns” table we used in the example above, it would be “campaign_created_at”. Make sure this is a timestamp column and not a date!
Now, you can see how fresh a table is by looking at the most recent value in that timestamp column. You can use the following SQL query to find this:
SELECT TOP campaign_created_at FROM ad_campaigns ORDER BY campaign_created_at DESC
You can then use this value to determine the difference between that time and the current time, and compare this to your acceptable threshold.
There are a few key attributes you should test your columns for, depending on their purpose. First, always test the uniqueness of the primary keys in your table. Primary keys are supposed to be unique identifiers of rows in your table. If they are not unique within a table, you may have unexpected duplicate rows or an even deeper issue.
An easy way to check for this is using a simple COUNT function.
SELECT COUNT(campaign_id) FROM ad_campaigns
Now find the count of the distinct campaign ids.
SELECT COUNT(DISTINCT campaign_id) FROM ad_campaigns
If there are no duplicates, the count should be equal.
Next, you need to test for NULL values. While NULL values aren’t necessarily a bad thing, they are when you are expecting a value. Testing for NULLs will prevent incorrect results in your downstream models. NULLs are infamous for messing up aggregations, so it’s best to catch this problem at the source.
A simple SQL query like this will give you the number of NULL values for a particular column:
SELECT COUNT(ad_id) FROM ad_campaigns WHERE ad_id IS NULL
Testing your columns for accepted values is particularly helpful when values are supposed to map back to values in a mapping table. For example, the “ad_campaigns” table has a column named “campaign_type_id”. This column then maps to a “campaign_types” table which has two columns: “camapign_type_id” and “campaign_type_name”. Every value present in the “campaign_type_id” column in the “ad_campaigns” table must be present in the “campaign_types” table.
You can specify that the only accepted values for this column are 1–5 since those are the only values present in the corresponding mapping table. If a test were to find a “campaign_type_id” of 6, you would then be alerted to investigate. While these problems may not affect data quality initially, you will begin to see values that are unaccounted for, causing incorrect data down the line.
Lastly, check your columns for the correct data type. If a column isn’t the expected data type, calculations or castings downstream can error out. It is common for columns to need to be a certain data type in order to perform a particular function on them.
I’ve also experienced issues where a timestamp column is a different type of timestamp than I expected. Surprisingly, this caused a discrepancy in the comparison statements I had in my SQL code. Now, I always write tests to check that a column is the right type of timestamp.
Intermediate and core data models are the models that transform your source data. It is important to not only test at the source but to test each model as well. This will help you pinpoint the exact location where an error occurred.
I often see people leave out testing in their intermediate models and only include them in the end model. This can be problematic because you don’t know exactly when the model failed. Adding tests in intermediate models will allow you to identify the last model that was passed before the error, eliminating the need to check the previous models in the process.
Debugging is already a huge pain for analytics engineers. Testing the smallest pieces of code possible will help make debugging less of a hassle.
With intermediate and core data models, you will want to test many of the same attributes you tested in your source data. On the table level, you should test for the number of columns and freshness. On the column level, it’s a good idea to test for the uniqueness of primary keys, null values, accepted values, and data types.
Testing your primary keys for uniqueness is particularly important with intermediate and core models because of the joins that are often used. When incorrect joins are used in a model’s SQL code, this usually causes duplicate rows to be created. Testing for uniqueness will help you catch when the wrong join is used, allowing you to fix it before continuing to build the model.
Another way to ensure the proper join is being used in your models is to count the number of rows and columns in your table, as mentioned earlier. To take this a step further, compare these numbers with those of a model’s source tables. Comparing the count of rows between models and sources will help you better understand how the data is being transformed.
Should the rows have a one-to-one relationship? Or were more rows meant to be created when the source tables were joined? Asking yourself these questions will help you better understand your model and its purpose. This also ensures that you are testing for the right qualities.
What good is testing if you’re not alerted when your tests fail? You can implement as many tests as you want, but if you’re not notified about them failing, you have to manually check your models every day, regardless of whether something went wrong or not. When deciding on the types of tests to use on your data models, make sure you use those with strong alerting capabilities.
I highly recommend looking for tests that work with an alerting feature. You want to make sure you can set up alerts to notify the right people in case your tests fail. You should also make sure you can set alerts to notify you on the medium where you most often check your messages. Where does your team communicate back and forth? What do you look at most frequently? Personally, I’m always talking to my team through Slack. This is the best place to reach me during the work day, so it is also the best place for my testing alerts to be sent.
You’d ideally reduce as much friction as possible with any type of data quality initiative. You don’t want to have to remember to check a product’s UI every day to make sure everything is working as expected. If you haven’t received an alert, you should be able to assume that everything is working correctly.
It’s also important to set alerts to notify the person responsible for data quality, or the person who owns a particular data asset. Ensure this is being seen by the team that uses that asset most so they can plan accordingly. An alert sent to someone who has no context on that particular asset is essentially useless.
For example, if a Salesforce integration fails because the sales team added a new column to the dataset and now your dataset doesn’t meet the expectations you defined in a test, then the sales team should be notified to fix this. In this case, you are notifying the owner — the person that has the most business context — of the dataset. You can also choose to notify the analytics engineer, or whoever owns data quality, to fix an issue in the dataset. I recommend alerting both!
Platforms like Y42 allow you to create alerts and add descriptions and tags to keep them well organized. You can select from a few different types of triggers and set a rule strategy for what you want to be alerted of. Extensive features like these are key to creating a proactive data quality environment.
When you properly test your sources and models, you can be confident that downstream data models and dashboards are up-to-date and accurate when you make code changes upstream. This makes a team confident in the value they produce and unafraid to make changes. You know that if you make a wrong change, your tests will fail and alert you that there’s a problem to solve.
Remember, you want to test your data for all different types of attributes in order to get a clear picture of it and ensure it meets your expectations. Proper testing includes testing for uniqueness, nulls, data types, and accepted values on columns, as well as freshness and counts on tables. Implementing these tests is the next best step you can take to creating a high-quality data environment.
Madison is an analytics engineer with a passion for data, entrepreneurship, writing, education, and wellness. Her goal is to teach in a way that everyone can understand — whether you’re just starting out in your career or you’ve been working in engineering for 20 years. She is an avid writer on Medium and shares her thoughts on analytics engineering in her weekly newsletter.