The recent advent of cloud data warehousing (such as BigQuery and Snowflake) has democratized working with large volumes of data for all types of companies. What used to be possible only for big corporations is now accessible to a wider audience thanks to pay-as-you-need cloud computing and storage solutions. These solutions tend to be more cost-effective and performant than the alternative of purchasing and maintaining an in-house data warehouse infrastructure.
As an early adopter of data warehouses myself and having already implemented BigQuery in my previous data analytics venture, I’ve always been excited to see the industry move toward more democratized and efficient data use. I was convinced that these innovations would help data teams move closer to the ultimate goal:
Having access to reliable, high-quality, and high-performing data (tables) at a low cost ensures that organizations make the best possible business decisions.
But as time progressed, the initial hype turned into a sober realization that this new wave of disruption was still in its infancy. Implementing a data warehouse wouldn’t bridge the gap to make data pipelines accessible, reliable, and production-ready. At least not on its own.
Just to clarify, a data pipeline is production-ready when it has been fully tested for errors, continuously monitored, and optimized for cost. If you can tick those boxes, then your pipeline is ready to be deployed and used by any user or tool.
Why are production-ready data pipelines a necessity though? Well, the use case for data has moved beyond ad-hoc reporting (i.e., creating reports as needed) to become a company’s lifeblood, which today includes key decision-making, business-critical machine learning, and operational applications. Data pipelines built ad-hoc inevitably break over time and lead to an overflow of fire-fighting requests and, ultimately, mistrust in data.
In an attempt to tackle this issue, there has been a surge in specialized data tooling in recent years — tooling that caters to each of the data pipeline’s stages (integration, transformation, orchestration, visualization, etc.). It works in conjunction with the data warehouse — the so-called modern data stack (MDS).
However, after interviewing and working with hundreds of data teams, from single data analysts starting out on their data journey to established teams at big corporations, we identified three core pitfalls with the current state of the data industry when it comes to building production-ready data pipelines: accessibility of the modern data stack, lack of data governance, and broken collaboration.
To build a decent data infrastructure and an automated data pipeline with the modern data stack, companies need to stitch together at least five different data tools. A typical data stack could look like this:
Modeling and transformation: dbt
Monitoring and observability: Monte Carlo
While these tools are truly best-in-class within their respective disciplines, gluing and maintaining them results in serious costs for each organization:
Companies need to hire highly sought-after (i.e., expensive and hard-to-find) data engineers to build and maintain the entire stack.
The process of researching, testing, purchasing, and eventually learning how to use each of the tools is a big time investment for data teams.
Paying for the managed version of each individual tool adds up.
Silos are inevitable, as each individual tool is operated by different teams or users.
Best practices are not baked into the tools, so you need to build workarounds.
This means that only highly tech-savvy companies with enough resources can leverage the full power of the MDS, which excludes the majority of the market and makes it highly inaccessible.
Data governance refers to the enforcement of rules that reduce the margin of error for working processes, transforming a team’s working style from reactive to proactive.
Without implementing proper data governance processes, the whole data architecture becomes sub-optimal at best and completely unmaintainable at worst. With a dispersed data stack like the MDS, data governance to keep the whole tooling landscape in sync and accounted for becomes extremely difficult. Here are some of the resulting problems:
Lack of clear ownership: This leads to issues not getting fixed, blocking the entire organization.
A stack out of sync: The version control, asset ownership, access control, and docs for different stack parts are managed separately, often leading to contradictory, invalid states.
Expensive, unmaintainable code: A chain of thousands of inefficient SQL files increases your warehouse costs.
Undiscoverable and duplicated data: Lack of discoverability across the whole stack leads to duplicate metrics, business logic, and data across tools.
Access control unsolved: Data assets cannot be protected across the entire stack, which leads to serious downstream compliance and productivity issues.
Unreliable and useless data: This occurs due to disjointed data workflows.
Data has become indispensable for better decision-making in organizations. Introducing collaboration best practices fosters knowledge sharing and deep alignment, supporting the co-creation of high-quality data pipelines.
However, there’s still a huge gap when it comes to how cross-functional teams work together to deliver business value using data. Here are the natural issues we’ve witnessed firsthand that have truly crippled an organization’s data function:
Data professionals lack business context: This leads to the slow turnaround of data products and multiple iteration cycles due to constant misalignment with business users.
Business users lack technical data context: This leads to unreasonable requests and friction with the data team.
Lack of access for end-users: Data needs to be accessible for every business function, such as marketing, sales, product, and HR. This is not currently possible with the MDS.
Shaky collaboration among data professionals: Collaboration among data engineers, data analysts, data scientists, and analytics engineers is still a somewhat unsolved problem. This is reflected in the lack of documentation and enforcement of documentation standards, among other things.
Tooling accessibility, data governance, and collaboration surface as problems for companies when data use cases and their respective complexity increase. Naturally, the business value for these use cases needs to be higher than the cost of designing and maintaining the processes to execute them. However, the process complexity and cost increase exponentially the more data use cases companies have. And the more use cases they want to tackle, the more data professionals they need to hire.
So, as soon as data teams grow from one team member to two, three, or more, organizations are faced with convoluted lines of communication (due to Brooke’s Law) and complex process dependencies.
In order to overcome these problems, as well as the accessibility, governance, and collaboration issues, implementing software and product management best practices become a necessity for companies with more than two people in their data team.
But why should we follow software engineering best practices? Simply put, they help create extremely high-quality and maintainable processes by minimizing failure, automating systems, and following certain failure recovery steps. Software engineering best practices include:
Git — version control
DRY (don’t repeat yourself) philosophy
Tests (minimize the chance of failure)
Observability (monitoring, alerts, rollbacks)
However, data management efficiency is also a matter of architecture, and that’s where product management best practices come in. They help to center processes around human collaboration, enabling goal-setting alignment between cross-functional teams. These activities result in a scalable architecture, as goals and expected business value are not subject to constant change. The learnings to take from product management are:
Goal alignment through collaboration enhancement (ideation canvas, chat communication, code / no code, etc.)
Knowledge sharing (documentation, catalog)
Governance (access control, audit, lineage)
So how are current data management solutions navigating these efficiency dilemmas? The modern data stack only covers software engineering best practices, like data tests or monitoring. But it’s falling behind in terms of enabling collaboration, implementing governance practices, and offering things like data catalogs or data contracts. This means no modern data solution out there is taking advantage of the power of product management.
So, in order to achieve the desired goal of building production-ready data pipelines that deliver reliable and high-quality data to every downstream user or application, the data industry needs to embrace software, collaboration, and governance best practices simultaneously, as first-class citizens. A world where data, software, and product collide.
The question that remains is, who’s going to take the data industry there?