3 key components of a well-written data model

When I first began my career as an analytics engineer, I was tasked with rewriting data models that lived in our business intelligence tool. When I started reading through them, I was shocked to find that they were impossible to understand, awfully organized, and used poorly written code. No wonder some of them were taking over a day to run! It’s important to build your data models right the first time around so that you never have to go through the process of rewriting them.

When building a high-quality data model, you need to keep modularity, readability, and speed in mind. Writing your models while prioritizing these three qualities will help you to write the best data models possible. The goal is to have them last forever, or as long as the business logic remains unchanged, so you can continue to focus on other areas of the company to improve the use of data.

Key components of a well-written data model

Modularity

When your model is modular, it can exist as separate pieces of code that can be reused and referenced in other places. This means you have to write less repeatable code, saving you computing power.

When writing modular models, you want to follow the concept of DRY code. This means you write as little code as possible while still writing code that performs the function you need. Do not repeat code! Your functions are concise, clear, and easy to read. DRY code and modular data models go hand in hand. When writing modular models, you need to ask yourself three different questions:

What code is repeated in multiple models?

It’s helpful to do a full inventory of your data models before rewriting them — or, if starting from scratch, think about elements that repeat themselves. This will help you determine which pieces of code to turn into their own data model. For example, if you are mapping users to some anonymous browser cookie, and using this mapping in all your models, this should exist as a single model that can be repeated throughout other models.

Once you’ve figured this out, you need to ask yourself, how can this repeated code be split up into its own model that can be reused in multiple models?

You need to figure out how a piece of code can be universal to multiple use cases. This could mean you need to remove any filters that are specific to one model, or that you need to include more columns that are used in some, but not all, of the models.

For example, if a subscription’s data model filters for only active users within the user mapping code, you would need to remove this from the user mapping model that you are making modular. Then, when you reference the user mapping model directly in the subscriptions model, you can add the filter for active users. This way, a marketing model that also uses the “user mapping model”, but needs it for all types of users, can still reference the modular data model.

Lastly, you need to ask yourself if this modular model can truly serve the same purpose as code already built directly into the model. Does making a piece of code from a larger model modular change the model’s outcome?

The goal is to create modular data models that don’t affect your code’s output. The resulting dataset should always be the same, no matter how the code is separated. If this changes the logic, maybe it’s not meant to be its own data model. You don’t need to force every piece of code to be its own model, only the code that is truly repeatable across all your business logic.

Let’s look at two different data models. The first maps a user’s first visit to the website back to all their personal information.

WITH users_mapped_to_cookies AS (
          SELECT
                    sessions.User_id,
                    browser_cookies.Cookie_id,
                    Browser_cookies.first_url 
         FROM browser_cookies
         LEFT JOIN sessions 
         WHERE sessions.session_started_at <= browser_cookies.event_timestamp <=sessions.session_ended_at

),
mapped_users_joined_user_details AS (
            SELECT 
                      users_mapped_to_cookies.user_id, 
                      users_mapped_to_cookies.first_url, 
                      users.name,
                      users.email,
                      users.phone 
            FROM users_mapped_to_cookies 
            LEFT JOIN users 
            ON users_mapped_to_cookies.user_id = users.user_id 
)
SELECT * FROM mapped_users_joined_user_details

The second data model maps a user’s first page visit to their orders.

WITH users_mapped_to_cookies AS (
          SELECT
                    sessions.User_id,
                    browser_cookies.Cookie_id,
                    browser_cookies.first_url 
         FROM browser_cookies
         LEFT JOIN sessions 
         WHERE sessions.session_started_at <= browser_cookies.event_timestamp <=sessions.session_ended_at
),
mapped_users_joined_orders AS (
              SELECT 
                         users_mapped_to_cookies.user_id, 
                         users_mapped_to_cookies.first_url, 
                         users.name,
                         users.email,
                         users.phone 
              FROM users_mapped_to_cookies 
              LEFT JOIN orders
              ON users_mapped_to_cookies.user_id = orders.user_id 
)
SELECT * FROM mapped_users_joined_orders

Instead of having these first mapping subqueries in both data models, we would pull the code out to create a separate model. Then, we would reference this data model in the other model’s code, like so:

SELECT 
          users_mapped_to_cookies.user_id, 
          users_mapped_to_cookies.first_url, 
          users.name,
          users.email,
          users.phone 
FROM users_mapped_to_cookies 
LEFT JOIN users 
ON users_mapped_to_cookies.user_id = users.user_id; 

SELECT 
          users_mapped_to_cookies.user_id, 
          users_mapped_to_cookies.first_url, 
          users.name,
          users.email,
          users.phone 
FROM users_mapped_to_cookies 
LEFT JOIN orders
ON users_mapped_to_cookies.user_id = orders.user_id

And, don’t forget, the users_mapped_to_cookies code exists as its own data model!

Readability

If code is truly readable, then others should be able to read it and understand exactly what it does without any resources except the code itself. If your colleague has to continually ask you questions about the code you’ve written, chances are it can be written more concisely. If you do find your code is unreadable, you’ll most likely have a good opportunity to optimize it for performance and costs as well.

Code can also be made more readable for non-technical users by utilizing tools such as data catalogs, descriptions, comments, and lineage. These all help the user understand the full context of the data model.

When writing readable code for technical users, there are three things to keep in mind:

Always comment on code that cannot be intuitively understood.

If a reader requires more knowledge to understand what your code is doing, comment that in your code! There is often extensive research going into understanding business logic and how to properly code models. Make sure you are capturing this tribal knowledge in the comments of your code. This way, you will remember why you did something, and it will help explain to others why you wrote your code the way you did.

I personally like to comment on the meaning of values I’m filtering out, why I used a certain join, and even what the query is doing in the greater sense of the model. These are all things that can be difficult to understand when someone is reviewing code. By adding these comments, you are also ensuring your models aren’t dependent on the knowledge of the person who built them. This is particularly helpful when the person who wrote the original code leaves the business and you can no longer ask them questions.

Use CTEs instead of subqueries.

Subqueries are notorious for making code hard to read. It’s difficult to understand their context in the larger model because they make the code messy and hard to follow. By using CTEs, you break up the code into smaller steps that produce their own output.

Smaller blocks of code make it easier to debug if needed. The only way to debug a subquery is by turning it into a CTE or pulling it out of the model entirely. If a piece of code can’t easily be debugged on its own without making changes, it shouldn’t be used in the first place.

People usually go for subqueries over CTEs because they think they are more sophisticated. This isn’t the case. CTEs and subqueries have a similar runtime, but subqueries make your code more complex for no reason. Readable code is always superior to unnecessary complexity.

Use descriptive names.

Lastly, for your models to be readable, you need to use descriptive names for tables, columns, and aliases within your CTEs. Those reviewing your code should understand exactly what they mean without having to search through the rest of your code or ask you. This will make your models easier to debug and understand. The more specific you are with your naming, the easier it will be to use your data models in the future.

For example, if you are joining the tables “users” and “addresses” in a CTE, you would want to name it “users_joined_addresses” instead of “user_addresses”. This tells the person reading your code that you are joining the two tables. “User_addresses” tells you that the users and addresses tables are being used but not how they are being used.

Now let’s look at an example of a well-written data model that’s very readable due to alias names, column names, and code comments.

WITH 
active_users AS (
      SELECT 
                   Name AS user_name,
                   Email AS user_email,
                   Phone AS user_phone,
                   subscription_id 
      FROM users 
      WHERE subscription_status = 1 /* status of 1 means a subscription is active */

),
active_users_joined_subscriptions AS (
          SELECT 
                     active_users.user_name,
                     active_users.user_email,
                     subscriptions.subscription_id,
                     subscriptions.start_date , 
                     subscriptions.subscription_length
         FROM active_users 
         LEFT JOIN subscriptions 
         ON active_users.subscription_id = subscriptions.subscription_id
)
SELECT * FROM Active_users_joined_subscriptions

You can read this model and understand exactly what it does because of the clear naming. When the naming isn’t so clear, the comment tells the reviewer exactly what the status of 1 is referring to.

Speed

One of the main purposes of writing data models is to speed up the delivery of datasets to your data warehouse. Having automated data models to produce datasets that are ready to be used by the business makes your data more reliable and available when needed.

Your data models are a lot less useful if they are slow and take hours to run. Oftentimes, this creates a bottleneck within business teams. Business teams will be slowed down if they try to make decisions based on a dashboard powered by one of your data models but it’s taking all day to run. Or worse, they won’t be able to use data to make their decision at all.

Speeding up your data models can be as simple as removing duplicates or filtering out null values at the beginning of your model. It can also be more complex, such as replacing complex code with window functions. As I mentioned at the beginning, breaking up your models into smaller, modular models also helps with this.

When I was rewriting long-running data models, I saw two functions being used a lot. First, I saw the TOP function being used with GROUP BY. This meant the code had to group all the values, order them, and then select the first value in each ordered group. This wastes a lot of computing power.

Instead, you can use the FIRST_VALUE() window function. This allows you to PARTITION your values instead of using a GROUP BY, and then order them within each group. The window function will select the first value much quicker than the TOP function.

The following function can be used to find a student’s highest test score:

FIRST_VALUE(test_score) OVER(PARTITION BY student_name ORDER BY test_score DESC)

I also see people using subqueries to help them calculate the sum or average of a value. As I mentioned earlier, you always want to use CTEs instead of subqueries because subqueries slow down your code. In this case, you can use an aggregate window function to replace SUM or AVERAGE used in subqueries. Simply specify the column you wish to group by after PARTITION BY and the function will calculate the aggregate for each group.

The following function can be used to calculate the average test score of each student:

AVG(test_score) OVER(PARTITION BY student_name)

Here, you don’t have to include ORDER BY, since order does not matter when finding the sum or average values.

Write long-lasting and impactful data models

Well-written data models have the power to change how your business operates. They will allow your data team’s work to grow with the business rather than it needing to be replaced as the business grows. SQL models written directly in platforms like Y42 are extensively documented through catalog and lineage features, making them easy to scale alongside the business. When you build your models with modularity, readability, and speed in mind, they become timeless and invaluable. They don’t have to be replaced every few months because of how difficult they make it to create dashboards, reports, and debug issues.

While writing your future data models with these points in mind is important, you also want to reassess the models you currently have in place. Do they lack in any of these areas? Can you pull out pieces of code to create a modular data model that can be used in multiple places? Do you need to add comments to the code? These are all questions you can ask yourself when you review what has been written in the past. Focusing on this now will help prevent future tech debt and prioritize a healthy data ecosystem.