Solving Data Pipeline Dependencies with dbt

Oh no! Your dbt model is failing. Not only is there a bunch of other models that depend on it, but also an external reverse ETL tool that syncs at the same time each day, independent of the model’s status.

You probably feel this pain very deeply if you’re an analytics engineer. Running downstream data models and external jobs that will have to be re-run anyway is causing you to waste resources like time, money, and computing power. You also need to alert the business not to use certain dashboards and reports because they are currently unreliable. On top of all that, of course, you also need to figure out why the model failed in the first place!

Data pipeline failures like these are most commonly caused by dependencies being improperly handled.

What is a dependency?

A dependency is something that requires another action to be completed before it can also be completed. In the data world, this is typically a type of job that relies on another job finishing. It could be a data model that needs to run after data is finished syncing or a reverse ETL job that runs once a data model has finished running.

If you are a dbt user, when you see a reference (denoted by {{ ref(‘’) }}) within a dbt model, that model being referenced is a dependency of the current model.

Let’s look at this fb_spend_by_day model:

select
   month(created_at) AS spend_month,
   year(created_at) AS spend_year,
   created_at AS spend_date,
   sum(spend) AS spend
from {{ ref('stg_facebook_ads') }}
where spend != 0
group by
   created_at,
   month(created_at),
   year(created_at)

Here we can see that the model depends on another dbt model called stg_facebook_ads. Stg_facebook_ads is a dependency of fb_spend_by_day.

The dependency problem

So, why do dependencies cause so many issues in data pipelines? Well, they cause errors, data gaps, and poor data quality when incorrectly handled. When a model that references another model starts running before the other one is finished, you have yourself a dependency problem.

Some warehouses, like Redshift, don’t support different tools querying the same data at the same time, causing an error in both systems trying to access the data. If, for instance, dbt is still transforming a data source and Census is also trying to read from that data source, both tools will error out. The only way to fix this is to re-run the reverse ETL sync after the data model has finished running.

Other times, both data models will run, but because the model hasn’t finished running, the dependent model will only utilize the data that is currently present in that table. In this case, you will see records in the previous model for the same time period that don’t exist in the downstream data model. You might be wondering why a source table has 200 more rows than the final core model, or why a certain primary key is present at the source but not in the core model, only to realize these errors occurred because your dependencies were running at the same time.

Different pieces of the problem

There are two main reasons why dependency problems can go unnoticed: having cron jobs as the only run-time option and a lack of transparent data lineage.

Cron jobs as the only run option

Cron jobs are jobs that are set to run at a certain time. Rather than running based on the dependencies before them finishing or failing to finish, they run at a set time. This is problematic when you have a job that is taking much longer than usual to run. If it runs past the time the dependency is scheduled to run, gaps will be created in your data or your data warehouse will error out.

It’s important to have a dynamic system where a dependency running or not depends on the results of a model or table that runs before it. For instance, if a data sync for your Facebook source fails, the fb_spend_by_day model shouldn’t be run. If fb_spend_by_day takes longer to run than usual, then the Tableau dashboard that depends on this shouldn’t be refreshed until it finishes running.

dbt Cloud only uses cron jobs to run its data models, which is extremely limiting when using external tools like Airbyte, Fivetran, or Census. Ideally, you should be using an orchestration tool that is capable of scheduling jobs other than just cron.

Lack of data lineage

If you don’t have proper data lineage set up for all of your data models, there’s no way you’ll be able to see the dependency chain of models. You can only really see one dependency at a time when looking at a data model’s code. You can’t see the relationships between models, nor can you see the data syncs a staging model depends on.

DAGs, or data lineage diagrams, help you see all the dependencies from the data source to the final data model. Some tools, like Y42, will even show you dependencies that exist outside of your data modeling environment, making it much easier for you to see your true dependencies across your entire data pipeline. This ability does exist in dbt in the form of exposures, but you need to set these up and do the hard work of tracing down all external dependencies manually. I like to think of this as “one and done” type of work, which it can be with the right tool.

In addition to offering transparency into external dependencies, it’s an added bonus for data lineage tools to have column-level lineage. This way, you know exactly which column and data source a downstream field is being pulled from. This helps to ensure the correct data is being passed through every model when the transformations within your data models become complex. Unfortunately, dbt does not offer column-level lineage.

Luckily, there are tools like Y42 that are working on solving these problems while offering transparent tooling into your data stack. These features will help you make changes in your data pipelines while minimizing data gaps, errors, and data quality issues.

Conclusion

If you want to minimize data pipeline issues and keep your data quality high, you need to have the right tools in place to avoid the dependency problem. If an orchestration tool can’t check all the boxes, make sure it at least provides you with comprehensive data lineage, enabling you to be as proactive as possible with your dependencies.

Instead of reacting to problems, you can prevent them from happening in the first place by building transparency into your stack. Now, before a model fails, you can check the dependencies and properly schedule them so that issues don’t occur when you push the model to production.