With the expansion of cloud technology and cheap storage costs in the past decade, many organizations have amassed significantly larger volumes of data than previously imaginable. The pay-as-you go model offered by many cloud data warehouse providers (AWS, GCP, Azure) has decreased the need for up-front capital resources and consideration of digital infrastructure.
The good news is that this ultimately makes data science efforts more accessible for most organizations.
The bad news is that Data Lakes are turning into more like Data Swamps.
What is Data Modeling? And What Challenges Surround it?
It’s often difficult for Data Engineers to communicate the value of a well-modeled ecosystem to upper management. This is because all that is visible to stakeholders are the BI tools and predictive models that get presented. However, poorly modeled data causes major setbacks to analytics teams from a data governance perspective. This inevitably slows down workflow, introduces repetitive tasks, decreases reporting accuracy, as well as many other negative side effects.
Defining “Well-Modeled” data is a topic of its own. But you can think of it by the following concepts in your Data Warehouse:
- A clear pattern exists on how to find data tables that relate to business entities.
- An intentional / known modeling technique is used, such as a dimensional model, entity relationship model, data vault, etc.
- Table and field naming conventions are consistent, well documented, and hold business value.
It should also be noted that there is a holistic and multi-system approach to data modeling. It starts in your OLTP (Online Transaction Processing) system, which is where data gets initially recorded. Here are some examples:
Ideally, your data should be normalized in 3rd normal form when collected through a source system. Then it should be ingested into an analytics environment, otherwise known as an OLAP (Online Analytical Processing) system where an analytical modeling technique is applied. In the context of this article, the OLAP system is synonymous with a cloud data warehouse. But OLAP systems can also include independently hosted tools like SQL Server, MySQL, or PostgreSQL.
While data analysts and data scientists only interact with the OLAP system, an organization’s data modeling strategy needs to factor in both OLTP and OLAP to be sustainable.
1.) Tribal Knowledge is Required to Understand Where to Find Data
In order for a new analyst to be successful when hired, they need a clear roadmap as to what data is available in the Data Warehouse, where it is sourced, and what its business context is. However, teams with poorly modeled data often struggle to onboard new candidates, not understanding why it is taking new hires so long to answer basic business questions. Without the proper mentorship, analytics teams can experience high churn rates because new members are not given the tools they need to succeed.
Data Analysts and Data Scientists should be focused on answering business problems and not wasting time to find where business entities live. The faster teams can become familiar with what data is available, the quicker that dashboards and predictive models can be completed. This ultimately boosts the team’s productivity.
If there are only a handful of analysts who know how to answer basic business questions, that is a problem. Working in such a siloed approach isn’t scalable and will only limit the number of problems that the team is able to solve.
2.) Different Analysts are Producing Different Results For The Same Metrics
If there is no single source of truth, it can be easy for different team members to calculate the same metric in different ways. For example, how is Revenue defined? And what tables are used to calculate this? There needs to be a clear path to define business logic, which all starts with an intentional data model.
I’ve worked in environments where there were 3 different tables that have represented the same business entity, which all utilized different SQL logic to arrive at a similar, but not the same record output. Couple this scenario with a poorly managed reporting request queue, and you have two different analysts answering the same question with different results. This not only causes stakeholders to lose trust in the data, but also requires tedious and unneeded reconciling work across teams.
3.) Teams Need to Reuse Redundant Code Blocks for Business Logic
I’ve seen teams have a google sheet of SQL CASE statements which outline specific business metrics. These were long in nature & difficult to read through. While it attempts to offer consistency across teams, the problem with this is that it violates DRY (Don’t Repeat Yourself) principles within the organization.
For many teams with this type of issue, using a transformation tool such as DBT allows Analytics Engineers to define business logic in one place and then have analysts reference it in many places.
Think about the following example— if you’re an ecommerce company and there is a complex way to calculate page views (which is okay), why would you be distributing & duplicating that business logic to happen in your BI tool? Not only is this risky in case logic isn’t copied and pasted in the exact same way every time, but it is a waste of compute, which is the largest expense from most cloud providers.
To solve for this, consider mapping out common aggregations and business logic that needs to take place, run a transformation job daily (or as frequently as needed) and write it to a table. Then have your BI layer sit on top of that.
4.) Your Data Warehouse Performs Poorly
As pointed out above, poorly modeled data introduces redundancy. But it also creates unnecessary complexity. Excess computing resources is a biproduct of this, and all cloud data warehouses have limits up to a certain pricing threshold. Once that limit is reached, executing new queries may become extremely slow and not even feasible in some cases.
Any Data Engineer will tell you that just purchasing additional resources is not a sustainable solution to this problem.
Long and complex queries not only take long to execute on their own, but will diminish the available resources in your environment. Consider an example where you need to run a query that involves 20 joins. There are very few scenarios where this is an ideal solution, as it illustrates that the data needed to answer a business problem is not stored in a format that is easily accessible. This many joins can be computationally expensive, especially when the related tables are large in volume or if the ON clause involves multiple columns. If you are implementing a dimensional model, your team might want to consider creating a new fact table in your database in these scenarios.
Resources are measured in different ways depending on what cloud provider you’re using, but they all follow the same concept of using a dedicated number of virtual CPUs. For example, BigQuery uses the concept of slots, which effectively is the number of available computing resources used to execute a query. Organizations with on demand pricing receive 2,000 slots to be used at any given point in time. So, if one query is highly complex & takes up more than the available number of slots, other queries will sit in the queue before they can even be executed.
5.) You Often Have to Hard Code Values in SQL
Hard coded values are often a tell-tale sign that there is required data missing in your Data Warehouse. In the context of a dimensional model, this usually means that a new dimension table needs to be created to source additional columns.
Zach Quinn wrote an article which outlines this concept really well, demonstrating how to eliminate long CASE statements with a lookup table. Putting this example in the context of a dimensional model — suppose your organization needs to do a lot of geospatial analysis. You have a customer_dimension table that gives the state abbreviation, but you want to display it as a full state name. You could write something like this:
, state AS state_abrevaition
WHEN state = 'NJ' THEN 'New Jersey'
WHEN state = 'NY' THEN 'New York'
WHEN state = 'PA' THEN 'Pennsylvania'
END AS state_full_name
But this type of CASE statement isn’t sustainable. If we want to improve upon this solution in greater detail, we need to join a zip_code_dimension table to the customer_dimension table. You’ll see below that a zip_code_dimension will give us even greater granularity in an analysis. The table might look something like this: