Best practices for working with dbt and Snowflake - A practitioner’s guide

Why use dbt with Snowflake?

When I first started out as an analytics engineer, my data warehouse of choice was Snowflake, which I used alongside my favorite data transformation tool, dbt. Together, these tools were powerful. They allowed me to quickly query my data, automate my data models with strong data governance practices in place, and scale up any intensive loads.

You can’t go wrong with using Snowflake as your data warehouse and dbt as your data transformation tool. Both tools are intuitive and seamless to use, making it easy to provide high-quality data to the right people.

dbt allows you to specify databases, schemas, users, and roles directly in your project, making it easy to work directly with Snowflake. It also provides built-in configurations to specify clustering policies, materializations, and data masking policies as a model is built. All of these features together already create a solid data stack that you can control within one tool.

After using Snowflake and dbt together for almost two years, I learned a few tricks that help you get the most out of both tools. In this article, we’ll discuss how to set up your Snowflake database environments, create a dbt-specific role and assign it to the appropriate users, use table clustering, and choose the appropriate materialization for your models.

Defining a development and production database

If you follow software engineering best practices, then you’re probably familiar with the idea of setting up different environments. With analytics, you can argue if you need a staging environment, but you most definitely need to set up a development and production environment.

Development is where you first write and test code changes in your dbt models. Production is where they are scheduled to run daily, hourly, you name it, using a platform with orchestration abilities, like Y42.

In Snowflake, create two databases within your data warehouse- db_dev and db_prod. You can name these however you see fit, as long as it’s clear which is which.

dbt allows you to easily specify which database to use in your different testing environments using the profiles.yml file. Here, you specify two targets- one for development (your local environment) and one for production (your scheduling tool).

Then, for each target, you specify the name of the database that you want the data to be sent to. Here you can also specify your data warehouse credentials and a default schema in case a schema is not defined.

madison:
    target: dev
    outputs:
        dev:
            type: snowflake
            account: [account id]
            # User/password auth
            user: madison
            password: [password]
            role: transformer
            database: dbt_dev
            warehouse: [warehouse name]
            schema: dbt_dev
        prod:
            type: snowflake
            account: [account id]
            # User/password auth
            user: Y42
            password: [password]
            role: orchestrator
            database: db_prod
            warehouse: [warehouse name]
            schema: dbt_prod

You’ll also want to specify the specific schemas to send your data to in dbt. Dbt allows you to specify the schema depending on the file structure of your dbt project. This is ideal because then your development and production schemas will match exactly, making testing a breeze.

In dbt, you specify your schemas in the dbt_project.yml file. When defining these, you want to mimic the structure of your project. Start with the name you used at the top of your project.yml. In my example, it was “Madison”.

Next, you’ll specify the same folder structure you use in the models directory. I always have a staging, intermediate, and marts folder, so I would write those out. Below each of these, I can then specify any subdirectories, but you only need to do this if using a different schema from the directory.

# =====
# Project Details
# ======
name: 'snowflake-dbt-best-practices'
require-dbt-version: ">=1.6.0"
config-version: 2
profile: 'madison'

# ======
# File Path Configs
# ======
model-paths: ["models"]
test-paths: ["tests"]
macro-paths: ["macros"]
target-path: "target"
Clean-targets:
    - "target"
    - "dbt_packages"
# ==================
# Model Configs
# ==================
models:
    Madison:
        staging:
        # defaults if not declared more specifically
            +schema: 'staging'
        marts:
            Marketing:
                +schema: ‘marketing’

As you can see, in order to specify the schema name in Snowflake, you would write:

+schema: <schema name>

In my example dbt_project.yml, all models in models/staging would write to the schema staging. All models in models/marts/marketing would write to the schema marketing.

Now, when you run one of these models, in dev or prod, you will be able to find it in that schema within Snowflake, in the database corresponding to the target you used.

Set up a dbt Snowflake role and users

dbt allows you to specify the user and role you wish to use when running dbt with Snowflake directly within your project.yml. This allows you to maintain tight data governance, ensuring dbt only has access to the resources it needs in Snowflake, and nothing more.

I recommend setting up two different roles for dbt- one for working with dbt in development (transformer) and another for working with it in production (orchestrator). This way, you will never be able to accidentally write to your production database while working with dbt locally.

You can do this in Snowflake by running the following commands:

grant all on database db_dev to role transformer;
grant all on database db_prod to role orchestrator;

Both of these roles also need to read from the database where your raw data is stored. If the name of the database that contains this data is called raw, the grant commands for giving the transformer role permission to read from it would look like this:

grant usage on database raw to role transformer; 
grant usage on future schemas in database raw to role transformer; 
grant select on future tables in database raw to role transformer;
grant select on future views in database raw to role transformer;

Keep in mind that the exact permissions you give each role may look different depending on your company’s data governance policies.

In addition to roles, you also need to create users. I recommend creating a Snowflake user for each person who works with Snowflake and dbt directly. This will typically be your analytics engineers and data engineers.

You’ll also want to create a user for every tool you are using. To use dbt in the your personal local development environment, you can create a user with your own name. For production, I recommend naming the user after the tool that is orchestrating your data models in production.

To do this, run the following commands in Snowflake:

create user madison password='abc123' 
must_change_password = TRUE;

create user y42 password='abc123' 
must_change_password = FALSE;

Here I am setting must_change_password as true or false depending on the type of user. For tools like Y42, I recommend using a secure password generator and storing that somewhere. For people on your team, your most secure option is letting them create their own password.

You’ll also need to assign each user an appropriate role. Here, we will assign Madison the role transformer and Y42 the role orchestrator.

grant role transformer to user madison;
grant role orchestrator to user y42;

After defining your roles in Snowflake, you are ready to use them in your profiles.yml! We will use TRANSFORMER as the role and MADISON as the user for the dev target, and ORCHESTRATOR as the role and Y42 as the user for the prod target.

madison:
    target: dev
    outputs:
        dev:
            type: snowflake
            account: [account id]
            # User/password auth
            user: madison
            password: [password]
            role: transformer
            database: dbt_dev
            warehouse: [warehouse name]
            schema: dbt_dev
        prod:
            type: snowflake
            account: [account id]
            # User/password auth
            user: Y42
            password: [password]
            role: orchestrator
            database: db_prod
            warehouse: [warehouse name]
            schema: dbt_prod

Now, when each of these targets is used, the corresponding role and user in Snowflake will be used to access the data in the warehouse. This will allow you to keep a close look at costs for running dbt in development, in production, and together as a whole.

Use table clustering on high-volume tables

Table clustering is a technique used to increase the performance of tables with high volumes of data. Snowflake is unique in that it offers the ability to apply table clustering automatically on these table.

Instead of manually specifying this on every query or data model that is run within Snowflake, dbt allows you to set up your models in a way where it optimizes this process before it is run.

You simply specify a special config called cluster_by either within the model itself or in its YAML documentation. If you wish to apply this to all models in a specific directory, you can even specify this within the dbt_project.yml file like we did the schemas.

The in-model config for table clustering would look something like this:

{{config(
    materialized='incremental',
    cluster_by=['session_id']
) }}

The field that you use in the cluster_by config will be the field that determines how your end table is sorted. Keep in mind that this can also include multiple fields rather than just one.

When this config is specified, dbt helps to take some of the load off of Snowflake’s automatic clustering functionality by ordering the data within the model execution itself. Because clustering is an expensive operation that should be used sparingly, dbt helps to lessen the load and make it more affordable.

Choose the correct dbt materialization

When using any data warehouse, Snowflake included, it’s important to be cognizant of how much you’re spending. Storage costs and computing costs are two things that you want to keep close tabs on, especially with a powerful warehouse like Snowflake.

Storage costs in Snowflake depend on how much data you are storing there. Because of this, it’s key that you understand dbt’s different materializations and how to use them to make the most of data warehouse storage.

Here are some rules I recommend following when writing your dbt models:

Always default to views

Because you must have a copy of your raw data somewhere, this will be stored in tables within your data warehouse. There is no way to get around paying for the storage costs of your raw data tables. However, you can be more strategic with the data that sits on top of those.

Views are transformations that sit on top of your raw data. They aren’t actually storing any data but simply queries that you can run to access the data stored in tables. In dbt, it’s a best practice to create all of your staging models as views, since these consist of basic castings and column renaming.

You can do this by specifying the model materialization under the models/staging directory in the dbt_project.yml file:

models:
    madison:
        staging:
        # defaults if not declared more specifically
            +schema: 'staging'
            +materialized: view

It’s always best to define your models as views unless you have a performance issue. Views don’t incur any storage costs and are always up-to-date with the latest data. This is because the transformations with the model are run every time the model is referenced.

However, if a model contains complex transformations or a lot of data, it’s usually better to materialize it as a table.

Utilize tables when a model has complicated transformations

Tables are naturally more performant because they store the actual data rather than simply accessing it. However, this is also why they are more expensive.

It makes sense to use a table materialization for core models that use joins, window functions, or metric calculations, and that you access frequently. Using tables with these models allows you to quickly retrieve data to power dashboards and reports, avoiding any delays in performance.

All of my core dbt data models are materialized as some type of table due to the constant querying they require. When you need an important piece of data, you shouldn’t have to wait! This is the whole point of using a tool like dbt and a data warehouse like Snowflake- they’re fast.

Similarly to staging models, I recommend materializing your core models as tables within the dbt_project.yml file:

models:
    madison:
        staging:
        # defaults if not declared more specifically
            +schema: 'staging'
            +materialized: view
        core:
            +schema: ‘core’
            +materialized: table

Now, by default, all core models will materialize as tables in Snowflake.

To save on Snowflake compute costs, use incremental models

It’s no secret that Snowflake can get really expensive, really fast. Using the correct materialization for your models can save not only storage costs, but compute costs as well.

If you have a huge table with complex transformations, there’s a good chance you will need high performance on that table. Frequently running these transformations on all of the data will quickly break the bank.

A best practice to save costs and reduce the computing power used by Snowflake is to take advantage of incremental models on large, load-intensive models. Incremental models allow you to run your transformations on new or updated data only, appending the results to the data that has already been transformed. This way, you aren’t running complex transformations on the same data over and over again. Instead, you are only running (and paying for) it once.

Because incremental models are used on a case-by-case basis, I recommend specifying the materialization for them directly within your YAML documentation. Wherever the model is documented, you can add a config similar to the one you add in dbt_project.yml.

For example, if you want to materialize a model web_sessions_derived as incremental, you could do so in the _core_models.yml file where it is documented:

models:
    - name: web_sessions_derived 
        config:
        materialized: incremental
        unique_key: id

However, with incremental models, you also need to make sure you are adding the is_incremental() macro to the code of the model itself in order to tell dbt how to handle the loading process.

It’s important that this macro contains a WHERE statement that compares a timestamp column with the latest instance of that field in the most recent run.

In our web_sessions_derived example, we would compare the timestamp session_started_at with the most recent session_started_at value in the previous run of the model. In dbt incremental models, {{ this }} represents the last run of the current model.

{% if is_incremental() %}

    -- this filter will only be applied on an incremental run
    -- (uses >= to include records arriving later on the same day as the last run of this model)
    where session_started_at >= (select max(session_started_at) from {{ this }})

{% endif %}

Now, when this model builds, it will build in a way that saves as many computing resources as possible, putting less load on the warehouse.

Conclusion

When using dbt and Snowflake together, your setup is key. You need to make sure you organize the data warehouse in a way that makes sense. It’s vital that you take advantage of users and roles so that you maintain good data governance practices. You must set up your models so that you optimize for cost savings.

Luckily, dbt allows you to do all of this through the profiles.yml and dbt_project.yml files. These files allow you to set yourself up for success when writing data to Snowflake. You can automate everything to be done the way you wish, right as your data models build. Table clustering and materializations make it possible to save on resources like cost and computation.

While I am now a Redshift user, I’d be lying to you if I said I didn’t miss using Snowflake, especially in combination with dbt. Redshift and dbt just don’t jive the way these two tools do. So, if you have the pleasure of working with them together on a daily basis, make the most of it! Follow these best practices and unlock the potential that these two tools offer you. There’s a reason they are two of the most talked about tools of the modern data stack.

If you enjoyed this tutorial but would prefer using a simpler UI to manage materializations, roles, etc., you should check out Y42. Read more about the tool in the Y42 docs section or schedule an onboarding call to get started for free and build powerful and efficient pipelines in an Integrated Data Development Environment (IDDE) right away.