Join our Community Kickstart Hackathon to win a MacBook and other great prizes

Sign up on Discord
Data lineage is helpful for knowing which data sources are used in a particular model and if any other models are referenced in the process. It builds transparency into your stack, a key quality when it comes to producing high-quality data.

When I’m writing data transformations, I often notice an inconsistency along the way. Perhaps a column in a data source doesn’t follow the naming convention I put in place, or a timestamp hasn’t been cast to the right data type. Instead of waiting to fix these issues further down the line, I make the changes as I see them.

However, the next day, I notice that the production data pipeline is failing. A few of the downstream data models are broken because of that “simple” name change I made yesterday. Now, I need to scramble to change all the references to that column downstream in the failing data models.

Unfortunately, this can be quite common if you don’t have a tool that shows you the flow of your data. You’re forced to be reactive, waiting until something breaks in order to know what changes are needed. Luckily, data lineage tools can help us avoid this situation and be proactive about dependencies.

What is data lineage

Data lineage shows the pathway your data follows when moving through the pipeline. It highlights where it comes from, how it’s transformed, and what visualizations it’s included in.

Data lineage is helpful for knowing which data sources are used in a particular model and if any other models are referenced in the process. You can use it to quickly spot dependencies. It builds transparency into your stack, a key quality when it comes to producing high-quality data.

How to build transparency using data lineage

Data lineage helps to build transparency across your business when used correctly. Here are the top three reasons I recommend using a data lineage tool.

Data lineage shows exactly which data sources are being used in production models

Data lineage tools are perfect for showing you which data models depend on which data sources. Column-level lineage tools are even more powerful because they show you the exact columns upstream that generate columns in a downstream data model. This is especially helpful if columns have been renamed across different sources and models.

Column-level lineage tools are particularly useful when data engineers make changes that affect the ingestion of source data. They can reference the lineage of downstream data models to see what schema changes will have a major impact across the business. This will then allow them to work with the analytics engineer to minimize downtime for business users.

Analytics engineers can prioritize documentation and testing for the most important data sources and models. Without lineage tools, you are guessing on what will have the biggest impact. With them, you can focus on the data that is driving the most business decisions.

Lineage tools can also be helpful for business teams because they allow them to see which data sources the data teams use most. I’ve seen a lot of outdated Google Sheets cluttering up data environments because nobody knows whether they are important or not. Being able to see the lineage of tables from source to insight will allow business users to weigh their importance while keeping their data assets clean and up to date.

Data lineage allows you to see the models being used by analysts in dashboards and reports

Data lineage tools give you insight into which data sources and data models are being used to power key dashboards and reports used by the business. This will help analytics and data engineers get a better idea of the models being used every day to make key business decisions. When everyone is aware of the most critical data sources, things like schema changes, data type casting, and renaming can become more strategic. This helps everyone understand the full impact and potential downtime involved.

When data and engineering teams understand how a data pipeline change impacts the business, they are more likely to communicate and be proactive about solutions. For example, if a data engineer changes a source schema without a data lineage tool, they may have no idea that it will break the marketing ROI dashboard. However, if the engineer has a lineage tool to see what depends on this source, they can notify business teams of the downtime.

Open communication helps to build trust between data and business teams, which is essential for a data-driven organization. Better choices can be made when both teams trust each other. The business will have faith in the data stack and be able to easily verify the results it sees. What’s more, the data team can confidently make changes to the pipeline knowing they won’t break things, helping to drive the business ahead.

Data lineage sheds light on data bottlenecks

Data lineage tools optimize your data pipelines by drawing attention to the bottlenecks that occur as data flows from its source to the final product. Instead of wondering why certain transformations are taking forever to run, or even why they are costing you so much money, you can see the exact flow of your data.

For example, you may notice that two data models are taking unusually long for the size of the data and the complexity of the query. You look at the lineage for these models and notice they reference the same source dataset. After looking into that source, you can immediately identify the issue.

Lineage tools allow you to understand how dependencies between different pieces work and how you can improve them. Seeing how your data flows through different models enables you to easily identify areas where they can be optimized. This allows you to create a faster, cheaper, and more reliable data pipeline.

Data lineage as a proactive tool

When you update one of your source data models, you should always reference your data lineage tool first. Look to see what downstream data models are using that source. Which columns from that source are being used? This is where a column-level lineage tool comes in handy, like the one offered by Y42. Once you understand how changes to your data sources impact downstream models and dashboards, you can feel confident in making them.

When I rename a column in my data source, I always use a lineage tool to see how this change propagates downstream. Now, I never get any unexpected production pipeline failures the next time my models are scheduled to run.

Data lineage tools are powerful because of the transparency they bring to your data stack, data team, and collaboration with business stakeholders. Making changes is no longer a guessing game, as you can measure the immediate impact beforehand.


Data Insights

In this article

Share this article

More articles