SQL is hands down the most important skill every analytics engineer should know. It’s foundational, yet often neglected as a top skill of a data practitioner. Because it’s nothing “new” or “fancy”, it gets put on the back burner. But, in order to write high-quality data models, you need to be able to write clean, concise SQL code.
SQL is at the heart of data modeling. While it’s not the only aspect of data modeling, it makes up a large portion of it. You can also write data models without knowing SQL (for instance, by using drag-and-drop models that write the SQL functions for you under the hood), but SQL is still what most analytics engineers use to build their data models.
Writing clear, concise SQL code unlocks lots of benefits for your data models. For one, it can help your models run faster, using less computing power when your queries are written optimally. Second, it can help you piece together complex datasets and calculate key business metrics. Third, when written well, the code can scale with your business, decreasing your technical debt.
You may think knowing just the SQL basics will take you far, but you can know the basics and still write poor code. Analytics engineers need to constantly practice new functions and rewrite queries to better optimize them. Data models should always be optimized for scalability, run-time, and efficiency. In order to focus on these three things, you need to learn to write DRY code. This means you must think ahead and write your SQL queries in a way that can be scaled. Chances are, you’ll hit a few dead ends before you land on an ideal solution.
Your SQL code also needs to be written with readability in mind. Again, this takes practice. You’ll need to subject yourself to code reviews by team members in order to learn to write code that everyone understands. What is easy for you to read might not make sense to others. Every time you write a piece of code, add comments and make changes so that others know exactly what the code is doing. Readability over complexity, always.
Lastly, as analytics engineers, we are expected to calculate core metrics and piece together complex datasets. The business depends on us to solve these problems and deliver datasets they can rely on. When I was first looking to become an analytics engineer, I practiced tons of real-world business problems on Leetcode. This helped me learn SQL functions like window functions that helped me solve the problems in the way I needed. If you’re used to writing basic joins and aggregates, practicing more complex business-related problems can stretch your use of other types of functions.
As an analytics engineer, there are two types of SQL functions I use: those I can rely on all the time and those I’ve found super helpful in solving complex problems. You’ll naturally end up using some functions more often than others, but it’s always that unique one that you’re happy you know when a specific use case calls for it. Let’s dive into my top five SQL functions.
Validation is a large part of data modeling. Whether you’re rewriting a pre-existing data model or writing one completely from scratch, you need to validate it after you’ve written it. This involves best practices like comparing the model to its base data tables and counting the number of rows with certain characteristics. The COUNT() function comes in handy when doing this. It allows you to group columns in your model and base tables and count their values, making for easy comparison.
I often use COUNT() to count the number of rows generated in a certain month or year. This ensures the model functions correctly across all moments in time and not just during an isolated period.
Let’s say I need to validate an order_details model. I would like to compare the order count for each year in the model to that same count in the base model. If that count is nearly identical, then I know my model is written correctly. If not, there is probably some logic that I need to change so that the results match.
My query would look something like this:
SELECT year(ordered_at) AS year,
GROUP BY year(ordered_at)
Depending on the column you need to validate, you could replace
year(ordered_at) with another value as well.
COUNT() function is also a great one to help you better understand a dataset. It can tell you how your data is distributed and which values are the most popular in a certain column. There’s a reason why aggregate functions like
COUNT() are some of the first they teach you in a beginner’s course!
Now, this one may surprise you. Many people say that learning window functions is not a necessity, but I beg to differ. They’re not used for everyday work, but they come in handy when you come across a complex problem. Often, this problem can’t be solved without the use of a window function. Or, the query is extremely slow to execute and hard to understand.
ROW_NUMBER() arranges values based on a specified partition and order. You specify what to partition, or group the values by, as well as the order to sort the values within those groups.
For example, if you are ranking students by their math scores but splitting them up by gender, the query would look like this:
ROW_NUMBER() OVER(PARTITION BY gender ORDER BY score DESC) AS rank_by_gender
We are splitting them up by gender and then ordering them from high to low. You will see two of the same
rank_by_gender since the same values are generated for both genders.
I’ve used this function when piecing together web session sequences, determining a user’s most recent order, and discovering which warehouse an order shipped out of. It comes in handy when eliminating duplicate primary keys with slightly different values, which is best practice when joining datasets.
You can also use the
RANK() function, which serves a similar purpose of “ranking” values. However, I find that
ROW_NUMBER() is more intuitive and easy to understand.
There are a lot of business metrics that require you to calculate time periods, sometimes in days, other times in weeks, or maybe even months.
DATEDIFF() calculates the time between two dates or timestamps in the unit that you specify. This saves analytics engineers from having to convert time periods into the units needed.
DATEDIFF() simply returns whatever unit you are looking for.
Some of the values you can use are as follows:
I’ve personally found “quarter” to be really helpful when calculating quarterly metrics.
DATEDIFF() will return the number of quarters between the two date values that you pass, making it super easy to distinguish growth in a time period that the business will appreciate. I’ve also found
DATEDIFF() to be helpful in calculating the duration of a user’s web session. This often involves piecing together different events on a website and calculating the time between the first and last events.
Here’s an example of how I would use it to calculate a user’s session duration:
SELECT user_id, device, session_started_at, session_ended_at,
DATEDIFF(second, session_started_at, session_ended_at) AS session_duration
Now you can calculate the difference in seconds between the beginning and end of a user’s session without having to convert from one unit to another.
CAST() is a fairly simple SQL function, it is the one function I use nearly every day. It is a key function when it comes to writing your base models. Before using them in a data model that combines data from multiple sources, you want to standardize your dates and data types within your base models.
CAST() ensures all of your dates and timestamps are of the same type.
When I first started writing data models, I assumed all timestamp functions were the same. I quickly realized this wasn’t the case when records of web sessions were decreasing each day. It turned out that one of the models I had written was comparing two different types of timestamps, eliminating key sessions. Luckily, I discovered the issue and cast all my timestamp columns to be the same data type —
timestamp_ntz to be exact.
A base model using
CAST() typically looks like this:
SELECT ad_id, ad_name, platform,
CAST(date_created AS date) AS date_created,
CAST(created_at AS timestamp_ntz) AS created_at
And, if you like SQL shortcuts, this also casts a value to your specified data type:
CAST() function can also be used for other data types like the following:
While you don’t necessarily have to use the
CAST() function on every column in your base model, be sure to apply it to columns often involved in comparisons or joins. These are the columns where it really matters. You don’t want incorrect results when joining or filtering because the data types compare in a strange way.
SUM() is another classic aggregate function. It is one of the most used due to its practicality, especially in business. It’s common to use
SUM() to calculate spending, revenue, and the number of activations. However, it is also helpful in the validation of data models for analytics engineers.
SUM() can be used to compare values between models and base tables. For example, if you created a data model for company revenue, you want to make sure it equates to the profits and losses that can be calculated in the base model. If you calculate more money coming in from one of the base models than the model is predicting, you may have to double-check the logic.
SUM() comes in handy for comparing the actual values of numeric columns rather than just the row count. It is a different type of validation by one that is just as helpful.
Let’s go back to our
order_details model. I want to calculate the number of products sold across all my orders. I would find the sum of product quantity for orders placed within a certain month. I would then compare that number to the same query for the orders base table.
WHERE ‘2021-01-01 <= order_date <= ‘2021-01-31’
Now, even if the
COUNT() of rows in each table matches, we can check that the specific quantities match as well.
Strong SQL is a foundational skill that every analytics engineer needs to learn in order to succeed. It is one of the things you need to constantly practice in order to improve your data modeling skills. Even though it may be viewed as a “basic” skill, it is still essential if you want to continually improve throughout your career. Nobody got better at a skill they didn’t practice. I highly recommend doing one or two SQL problems on Leetcode each day, helping you to optimize your queries and learn new ways to solve business problems.
Learning better SQL practices will only help to optimize your data models, making them faster and more readable. Improved data models will then seep into other aspects of your work as a data team, fostering a better data culture within your company.
One way you can train and improve your SQL skills is by using a tool like Y42, which allows you to build models using no-code, low-code, or SQL queries. This way, you can gradually scale from not writing a single line of code, to using basic SQL queries, to fully writing your models in SQL. You can mix and match and build your models with whichever function you’re most comfortable with, but you can also jump straight into SQL models and put all the functions I just shared with you into practice.
Madison is an analytics engineer with a passion for data, entrepreneurship, writing, education, and wellness. Her goal is to teach in a way that everyone can understand — whether you’re just starting out in your career or you’ve been working in engineering for 20 years. She is an avid writer on Medium and shares her thoughts on analytics engineering in her weekly newsletter.