How to improve the code quality of your dbt models with unit tests and TDD | by Pablo Porto | May, 2023


The following model enriches a weight measurement with the latest height measurement recorded prior to the weight one.

Example of a dbt intermediate model

Let’s now create a dbt unit test to prove the transformation logic is correct.

Example of a dbt unit test for the an intermediate model

Looking at the previous test, we can see several of the dbt-unit-testing macros in use:

  • dbt_unit_testing.test: This macro allows us define the model under test and the name of the test. In our example we reference to int_weight_measurements_with_latest_height.
  • dbt_unit_testing.mock_ref: This macro allows us to mock references to other dbt models. In our example, we are mocking the weight (stg_gym_app__weight) and the height (stg_gym_app__height) staging data.
  • dbt_unit_testing.expect: And this macro allows us to assert on the result of the transformation. In the example, we assert that the weight measurement gets enriched with the latest height.

Let’s now run our model’s unit test. We can call the usual dbt test command:

dbt test

Opps, that commands runs the whole test suite including other dbt data quality tests. But we want to only run our unit tests. No problem, we can leverage dbt tags functionality to isolate our unit tests. In the example, we tagged our test with two tags:

{{ config(tags=['unit-test', 'unit-tests']) }}

The first one is a boilerplate tag required by the dbt-unit-testing library. The second one is the one we will use to execute our unit test.

dbt test --select tag:unit-tests

So far we have seen how we can write a unit test to verify the logic of a single model. After creating a few of these, our team started discussing the possibility of implementing new types of tests like we usually do for operational software like microservices.

“A component test limits the scope of the exercised software to a portion of the system under test, manipulating the system through internal code interfaces and using test doubles to isolate the code under test from other components.” — Toby Clemson

In a microservices context, a component is a service that exposes certain functionalities. If we apply the same concept to the data context, component tests for dbt apps can be implemented as tests that validate whether the dbt app provides the functionality that it promises to by mocking the data sources.

The usual test pyramid for an operational app

When implementing the component test, the scope of the test increases. We test our dbt app as a whole mocking only its sources.

The component test scope

This type of test ensures that the different models integrate correctly and the expected data transformation outcome is created. Let’s look at an example:

In the component test above, we are testing our output model body_mass_indexes. The model uses the enriched weight measurements to calculate the body mass of the user. We mock the sources directly (raw_weight and raw_height) with the dbt_unit_testing.mock_source macro. Finally, we assert the final transformation of the output model verifying that the body mass index (BMI) is calculated correctly.

We can also run this type of tests in isolation using the tag name we specified in the test configuration.

dbt test --select tag:component-test

Now that we have the ability to test our models in isolation, what if we start by writing the tests before we write any transformation logic?

Test-driven development or TDD is a software engineering practice that helps improve the design of the code by forcing the developers to write a test first and then write the minimum amount of code to make that test pass.

Our data team had experience in applying TDD in operational systems so we decided to give it a try.

Starting by defining the outcomes of a given transformation in a test felt quite natural. Oh, what is the BMI I would expect if I have this weight and this height as an input? Let’s write a test for that. After practicing TDD for a while, the team still continue to use this technique when adding new business logic into the transformations.

I just pictured a perfect scenario where you can just add the unit testing dbt package and start creating tests straight away. The truth is that the library is still under development and we found some gotchas that you should also probably be aware of:

  • The dbt-unit-testing macros break the principle of not allowing testing code to pollute production code. There is an easy hack to fix this. You can create a macro to patch the original ref() and source() and call the testing macros. You can see an example here.
  • We found that sometimes seems like changes in a test are not picked up. There is an option to disable caching but we haven’t try it.
  • When mocking sources you need to define the source in the dbt .yml file if not it doesn’t compile.
  • Sometimes the test error messages are quite cryptic. In this situation, we found ourselves looking at the compiled SQL code in the build folder.
  • Be also aware of other limitations listed in the library docs.

We have seen how we can add unit and component tests to our dbt projects to increase the code quality and therefore the maintainability of our transformation logic. We also saw how we can tag the different types of tests so that we can run them in isolation both locally and in the CI/CD pipeline. Finally, we also look at how we could also practice TDD

Hope this article helps you and your team start adopting unit tests and creating more maintainable and scalable dbt apps as your codebase scale to fulfill new data use cases.

If you are curious, you can check a fully functional example in this Github repo. I also prepared some katas in case you want to practice TDD and unit testing with a simple example.

Are you ready to give it a try?



Source link

Leave a Comment