Join our Community Kickstart Hackathon to win a MacBook and other great prizes

Sign up on Discord
hello
Has the surge in specialized roles fragmented the workflow and, consequently, led to slower development cycles? How can we empower and give data practitioners more autonomy?

Over the past decade, the data industry has seen an explosion of roles. What was once the space of SQL Developers, Analysts, and Architects has lately expanded to include Data Engineers, Analytics Engineers, Data Scientists, BI Engineers, ML Engineers, Platform Engineers, and many more roles.

In this article, we'll explore what this means for us both as an industry and as individual practitioners. Has the surge in specialized roles fragmented the workflow and, consequently, led to slower development cycles? How can we empower and give data practitioners more autonomy?

One option is upskilling, but it’s often slow and impractical. Just look at Slack's recent decision to halt business operations for a week to focus on internal training. This points to the need for solutions that balance learning with support from data tools and platforms, so data practitioners can work at their full potential. After all, not all companies can afford to pause business for a week to focus on training.

To address these challenges, we first need to understand what fueled this proliferation of data roles:

  • The importance of data analytics, and
  • Tool sprawl

Importance of data analytics

According to Gartner, the top areas of investment for 2023 include cyber and information security (66%), business intelligence/data analytics (55%) and cloud platforms (50%).

Leading CIOs are more likely to leverage data, analytics and AI to detect emerging consumer behavior or sentiment that might represent a growth opportunity. This marks a significant increase from 2019, where combined investment in data analytics and business intelligence was just 28%, and up from 25% in 2016. More investments in the analytics space attracted more VCs, eager to fund startups in the space and capitalize on the growing market. While this influx of capital seems beneficial in theory, it made matters worse for practitioners and buyers. You’ve now got too many options to choose from when building a data stack – to be more precise, 10x more tools than a decade ago.

Tool sprawl

The high number of tools has led to the embrace of modularity in the modern data stack, allowing companies to pick and choose specialized tools to better suit their needs. However, this freedom to select 'best-in-class' tools for specific problems comes with some serious drawbacks for practitioners: integrating these disparate tools is challenging, each one with a unique UI, and specialized skills to use, maintain, monitor, and debug.. This results in a) increased costs for hiring and training practitioners, and b) complex, hard-to-debug workflows.

And on a general note, the greater the number of individuals, tools, and processes involved at solving an issue, the longer your development lifecycle will be, and therefore, also the overall time it takes to extract value from the problem at hand.

Erik Bernhardsson talks about the pitfalls of excessive specialization in his 2021 article on finding the right level of specialization:

What are some drawbacks of specialization?

  • Resource allocation. If you have a chef who only chops onions, they are probably idle most of the time. That sounds bad! If they are more versatile, they can jump around and do a larger set of things, depending on what's needed at the moment.
  • Reduction of transaction cost. If every project involves coordinating 1,000 specialists, and each of those specialists have their own backlog with their own prioritization, then (a) cycle time would shoot up, with a lot of cost in terms of inventory cost and lost learning potential (b) you would need a ton more project management and administration to get anything done.
Erik Bernhardsson's tweet about the pitfalls of excessive specialization.

Shifting left roles

The shift left movement advocates for pushing testing as early as possible in the development lifecycle. It is the first half of the test early and often paradigm.

Here is the dilemma in software development: defects are expensive, but eliminating defects is also expensive. However, most defects end up costing more than it would have cost to prevent them. What if we extended this principle to data roles? Why wait for someone else to deploy a feature if you, as analyst, already validated the results, all tests pass, and the CI checks are green? Why wait for someone to codify a business rule when you already know how to translate it to SQL? What kind of guardrails do we need to establish to ensure this empowerment doesn't create additional burdens for others? Is the solution to technically upskill everyone, or is it a matter of developing better tools? Let’s explore both options.

The full data value chain would also include an ML stream, composed of MLOps Engineers, Data Scientists, and ML engineers, but for simplicity we will only focus on the core Data platform that extracts raw data from data sources up to building Dashboards or Data apps.

The full data value chain would also include an ML stream, composed of MLOps Engineers, Data Scientists, and ML engineers, but for simplicity we will only focus on the core Data platform that extracts raw data from data sources up to building Dashboards or Data apps.

Upskilling

The data landscape is populated by a variety of roles, each with its own set of skills and tooling. Data engineers, analysts, analytics engineers, scientists, product managers, and business analysts, all contribute to the data workflow. Yet, this diversity brings a challenge: the range of tools is equally broad, spanning from SQL, Python, Kubernetes, Tableau, Looker, dbt, Airflow and R, to name a few. Mastering all of these tools—or even just a few of them—is an art for any individual. It’s both time-consuming and challenging.

Better tooling

Instead, the focus should shift to us, the data community, to develop tools that are more user-friendly and safe. Drawing upon our collective experiences, we should aim to create platforms with built-in guardrails for most common scenarios. This will enable everyone to safely be more autonomous without causing bottlenecks or additional work for others.

Shifting left isn't about forcing everyone to learn something new and making them uncomfortable in their roles. It's about building the right tools so that everyone can accomplish their tasks with minimal dependencies on others and without creating additional work for them.

Risks of shifting left

Before we move forward with the concept of shifting left, let's consider some common cases where certain activities are restricted to a select few individuals.

  • Develop: You can run into issues like duplicating assets if you don’t have a good understanding of the data model, write unoptimized queries, don’t write any tests. There is also the challenge of how to configure upstream dependencies in a development environment to have data to work with.
  • Deploy: Deploying to production is risky because there's often no easy way to undo a change, figure out how it affects other parts, or fix it if it fails while backfilling. Techniques like blue/green deployments offer some safety but are not foolproof.
  • Govern: It's not just about making changes; you also need to document and own your changes post-deployment. If things go south, your documentation and/or in-line script comments should be intuitive enough for others to understand why the change was made. Your access to models should be secured. Incorrectly setting permissions or failing to implement data masking policies can expose sensitive information, which can lead to security issues.
Risks of shifting left.

What do we need?

We need one platform that can address the needs highlighted above:

  • Enables instant rollback of change.
  • A seamless path to deployment, and not having to run through several hoops such as updating metadata and schema or backfilling data manually.
  • Helps you clearly assess the downstream impact of your changes.
  • Understand what has been developed so far, so while you develop, you are not duplicating assets or adding attributes or measures in the wrong place.
  • Provide real-time asset health status to understand if the data is stale or not according to the asset definition.
  • Offers built-in code formatting.
  • Provides usage and billing information at the query and asset level so you can optimize them further.
  • Not make documentation an afterthought while developing, but a clear process with attributes to be filled, such as: column-level documentation, asset classification (verified, deprecated), ownership, or setting the level of incoming notifications, past comments and ability to view an asset lineage history.
  • Simplifies the addition of data and unit tests.
  • Allows for defining permissions at the asset level, and cascading them through dependencies.
  • Prevents merging code changes that haven’t been materialized yet in the data warehouse. I.e. if the asset code definition has been changed, but never materialized, that asset cannot be pushed to production or any other branch yet.
  • Retains assets last successful state in case of failure, so your stakeholders don’t operate on incorrect/incomplete data. Current tools often specialize in specific tasks like data ingestion, transformation, or data quality, and lack a holistic approach to your stack. A data quality solution, for instance, might alert you when a transformation fails, but it won’t be able to fix the issue, and you would need to go back to the transformation tool to address it. What we need is a platform that acts as a nervous system for analytics.

An end-to-end platform that provides complete visibility across the entire stack, and enables you to express your logic through code. But more critically a platform that ensures that your data warehouse is in perfect sync with your codebase. This is the key to unlocking all of the above capabilities listed. By integrating your data operations with Git actions, your data warehouse becomes a live reflection of your codebase, always in sync with every Git operation you make.

Y42's Turnkey Data Orchestration Platform.

Y42's Turnkey Data Orchestration Platform

Think of it as an operating system comparable to iOS or Android. You can install your own apps, set permissions, and the system has built-in guardrails to prevent misuse. It should be able to aggregate metadata from all these "apps," make sense of it, and take action—whether that's backing up data or analyzing usage.

In the analytics world, our apps are data assets. These aren't static tables; they're dynamic entities that evolve over time. They have dependencies and carry metadata that updates as new data flows in or as the logic changes.

Right now, what's missing is this centralized nervous system for analytics—a control center that provides a unified view of your assets. Ananth Packkildurai encapsulates this idea over here:

The data community often compares the modern tech stack with the Unix philosophy. However, we are missing the operating system for the data. We need to merge both the model and task execution unit into one unit. Otherwise, any abstraction we build without the unification will further amplify the disorganization of the data. The data as an asset will remain an aspirational goal.

What’s in for me as an analyst?

Stepping back, the above features create a secure development environment. They offer a way to conduct A/B tests on production data in isolated environments without duplicating data. They allow merging of changes into the production environment and the ability to instantly revert if needed. They enable autonomous model development, keeping best practices in mind: code testing, formatting, understanding both upstream dependencies and downstream impact, and assessing the costs of any changes made.

What’s in for me as an analytics engineer?

The platform frees you from becoming a bottleneck for your analyst peers. You can focus on core models without worrying that actions taken by others will have a negative impact: any changes can be rolled back instantly and at no cost. When some pipelines fail, the system automatically restores to the last successful build of those models.

What’s in for me as a data engineer?

Say goodbye to 24/7 pipeline maintenance. With a managed platform, that aspect is covered, freeing you to think about the bigger picture: source integrations, shifting pipelines from batch to streaming, and implementing more holistic permission models rather than tagging individual columns, table by table.

Y42 - an all-in-one git-powered platform

Such a platform does exist and ticks all of the boxes:

  • Guardrails while developing: build assets with production data in a separate environment without duplicating data, SQL linting via SQLFLuff, column-level impact analysis, and understanding model upstream dependencies.
  • Prevents deploying assets that haven’t been materialized yet, ensuring the codebase state is always in sync with the data warehouse state.
  • Instant deployment and rollback with Virtual Data Builds.

Y42 brings git to data. Every feature branch you work on becomes a zero-copy clone of your production environment. Whenever you change your code through git operations, such as merging, reverting, committing or creating branches, Y42 updates the data warehouse to keep the codebase and data warehouse, always in sync.

This approach leverages the idea of data snapshots, which simplifies workflows. An extension of earlier ideas like showcased in Functional Data Engineering, where Maxime Beauchemin, creator of Airflow and Superset, suggests that instead of using complex methods like Slowly Changing Dimension Type 2 (SCD2) to track changes, creating new snapshots is a more efficient and understandable way to handle data. With storage costs continually decreasing, this approach starts to be applied in the transactional (OLTP) space as well, as noted by Pete Hunt:

It’s often better to add tables than alter existing ones. This is especially true in a larger company. Making changes to core tables that other teams depend on is very risky and can be subject to many approvals. This reduces your team’s agility a lot.

Pete Hunt's tweet about new best practices when it comes to OLTP.

In the analytics space, Y42 implemented a mechanism similar to what Pete Hunt described in the transactional space: [Virtual Data Builds]((/blog/virtual-data-builds-one-data-warehouse-environment-for-every-git-commit/), where each change the user performs is automatically materialized as a new table in the data warehouse, and a new data warehouse layer reroutes the user to the appropriate materialization determined by the current state of the codebase and the active branch.

Conclusion

We believe managing both data and code through git is the missing link to breaking role silos in today’s analytics landscape. It elevates everyone to build, and deploy pipelines and data products with confidence, thanks to the guardrails unlocked by this central system that manages both data and code through a unified system. Git has passed the test of time when it comes to delivering a solid foundation for any software project, and offers the right level of protection against introducing undesired code changes. That’s why we built the Y42 data platform around git interactions: any change you make in your codebase is reflected in the data warehouse. We extend the power of Git to data, in addition to code.

This aligns with dbt’s purple people philosophy, which identifies analytics engineers as the intersection between business (red) and engineering (blue) teams. However, our method diverges by focusing not on upskilling, but on building a robust platform built on top of git that simplifies data operations and provides the necessary guardrails for everyone to develop and deploy safely.

Category

Opinion

In this article

Share this article

More articles