You receive an urgent request from a stakeholder to create a report on the success of the organization’s recent marketing campaigns. You’ve never worked with these data sources before, so you have no idea where to find them. After a lot of searching, you find the dataset’s owner. They can point you in the right direction. However, you notice some of the data isn’t up to date. And, to top things off, the data is messy and needs to be cleaned! After waiting days for the owner to clean the dataset you need, you end up using a deprecated dataset as part of your report, resulting in inaccurate information.
Does this story sound all too familiar? Unfortunately, not knowing where data is located, who owns it, and whether or not it’s reliable is a problem that’s way too common. Even if you can find the data you need, there’s rarely associated documentation to give you the column definitions or metadata, like datatype, units, or time zone. Luckily, there’s a solution to this problem that we’re all facing: data catalogs!
A data catalog is the inventory of all the data your company has available. It is automatically generated through integration with your data stack tools. A proper data catalog connects to all elements of your data stack, including the data warehouse, data models, and visualization tool.
Data practitioners and business users tend to think that data catalogs are only for raw data sources or complex data models. They think they just include a basic data definition for each column in a dataset. However, data catalogs are a lot more complex and cover many more use cases. They are utilized to increase confidence and transparency in your data, ensuring only the highest quality data is being used.
Data catalogs allow you to categorize your data based on its taxonomy, ensuring the right people are getting their hands on it. Catalogs also provide an owner and domain expert for every dataset. This way you know who to reach out to with any questions about the logic, definitions, or quality. Speaking of quality, data catalogs should also have a feature that lets you know which datasets are ready to use and which are broken. Lastly, they should specify the nitty-gritty of your data, such as specific definitions and column metadata.
Separating datasets by their domain helps to keep an organized inventory of data. Oftentimes, we get overwhelmed by looking at assets that don’t necessarily relate to our domain. By utilizing dataset categorization, different stakeholders can easily filter the data by their business function or application domain. This reduces the time spent searching for the appropriate dataset and reduces potential human errors.
Categorization, when combined with access control, can also be viewed as a form of data governance. This is because you are restricting the availability of the data based on domain. This means that those in finance can’t view datasets related to growth. In this way, categorization is a gatekeeper. It stops people accessing information you may not want them to. This also means there is less room for data quality mistakes which can arise when a key feature of a dataset is changed.
In my experience, organizations tend to benefit from assigning two types of owners to their data, and this ownership is exposed in a data catalog. One owner is usually the stakeholder who has an understanding of why this dataset is being generated and used. They may work with an outside vendor to get this data or input it themselves into a Google Sheet. Typically, this is the person that requested this dataset in the first place.
However, each dataset should also have a domain-level expert assigned to it. This is someone who understands the more technical aspects of the dataset, such as how the data is ingested into the data warehouse or cleaned. Depending on the data catalog platform you choose to use, there should be a feature that allows you to assign each type of owner to your dataset.
Having two types of owner is important to gain the full context of your data while still ensuring it is high quality and ready for use. Often, a data practitioner may not have a deep understanding of a column’s use within the business. However, a business stakeholder may not have an understanding of how that column’s value is used in analysis. There’s a fine balance between domain knowledge in a particular business area and ensuring the data is actionable in other ways.
For example, the product team may have a column in a dataset used for quality testing. The data practitioner may not understand what this column is used for when defining it, so they would need to reach out to the stakeholder. However, the business stakeholder may not understand that this value needs to be recorded as a two decimal integer to ensure it can be used the way they intend it to be used in reporting. Defining owners helps business and data teams work together to achieve the best possible data-driven results.
In my opinion, this has to be the most powerful feature of a data catalog. I can’t tell you how many times I’ve come across various datasets being used in data models, wondering if they were still relevant to the business. This usually involves hunting down the person who originally created a spreadsheet, or someone who at least knows about the spreadsheet. Most of the time, nobody knows about it and it hasn’t been updated — yet it’s still being used in a key data model!
Data catalogs should have a feature that instantly tells you a dataset’s current status. This could be a simple red or green dot, or a button indicating that a dataset is “broken” or “ready for use”. In addition, it should let you know when the dataset was last updated.
Data quality encompasses a lot of different things, but the top things that come to mind are:
Is the transformation code powering this dataset working?
When was the last time the dataset was updated?
Is the owner of the dataset still with the company?
Is the domain expert still with the company?
Do the columns follow our style guide/naming conventions?
Has the dataset passed the quality tests we have in place?
The data catalog should provide answers to all of these questions. After all, the goal is to avoid spending days hunting them down. They should be available to you right within the platform.
The metadata of the columns within a dataset is important as it allows data analysts and analytics engineers to do their best work. Before utilizing a data catalog, I would always have questions for our engineers on the timezone of certain date columns, how a column was generated, and how it related to other columns in different datasets. There was always a ton of back and forth, with each of us trying to figure out the answers to the questions I was asking.
Utilizing a data catalog to keep track of these more technical questions decreases the amount of time spent going back and forth between engineers and analysts. If metadata is properly documented upfront, there’s no need to bother one another with questions about it. This also goes hand in hand with data quality, since these questions typically relate to producing high-quality data.
For example, if a company has many different datasets, some of which contain date columns in one time zone and others that contain date columns in another time zone, there needs to be an efficient way to distinguish this. Properly documenting this within a catalog will allow analysts to create reports with the correct timezone conversions, ensuring reports are accurate.
Data catalogs are key for bringing awareness and transparency to the data available within an organization. They help categorize data, assign it an owner, mark it with a quality score, and document important metadata. Without the components of a data catalog, organizations contend with bottlenecks and poor-quality data. Implementing a catalog will help create a data culture within your organization that is both transparent and empowering.
Using an end-to-end data platform like Y42 makes integrating your stack with a data catalog tool frictionless, because everything is already in one place. What’s more, tools like Y42 already include a data ownership overview by offering a domain expert column within the platform that serves this purpose.
Madison is an analytics engineer with a passion for data, entrepreneurship, writing, education, and wellness. Her goal is to teach in a way that everyone can understand — whether you’re just starting out in your career or you’ve been working in engineering for 20 years. She is an avid writer on Medium and shares her thoughts on analytics engineering in her weekly newsletter.