Implement Behaviour Driven Development in data pipelines using Mage | by Xiaoxu Gao | Jul, 2023


What is Behaviour Driven Development (BDD)?

When building data pipelines for business, it’s highly likely that we will encounter complicated and tricky business logic. One example is to define customer segmentation based on a combination of age, income, and past purchases. The following example represents only a fraction of the complexity that business logic can entail. It can become progressively complicated as there are more attributes and granularity within each attribute. Think about one example in your daily job!

1. People between 19 and 60
AND with high past purchases are "premium".

2. People between 19 and 60
AND with high income are "premium".

3. People above 60
AND with high income
AND with high past purchases are "premium".

4. Others are "basic".

So the question is where should the business rules be documented and how to ensure the synchronization between the documentation and the code. One common approach is to include comments alongside the code or strive to write code that is self-explanatory and easily understandable. But there is still a risk of having outdated comments or code that stakeholders find challenging to comprehend.

Ultimately, what we are looking for is a “documentation-as-code” solution that can benefit both engineers and business stakeholders and this is exactly what BDD can provide. If you are familiar with the concept of “data contract”, BDD can be seen as a form of data contract, but with a focus on the stakeholders rather than the data source. It can be very beneficial, particularly for data pipelines with complicated business logic, and it helps prevent debates regarding “feature or bug”.

BDD is essentially a software development approach that emphasizes collaboration and communication between stakeholders and developers to ensure that software meets the desired business outcomes. The behavior is described in scenarios that illustrate the expected inputs and outcomes. Each scenario is written in a specific format of “Given-When-Then” where each step describes a specific condition or action.

Let’s see what the scenarios may look like for the customer segmentation example. Since the feature file is written in English, it can be well understood by business stakeholders and they can even contribute to it. It works like a contract between stakeholders and engineers, where engineers are responsible for accurately implementing the requirements, and stakeholders are expected to provide all the necessary information.

Having a clear contract between stakeholders and engineers helps correctly categorize the data issue, distinguishing between “software bugs” resulting from implementation errors and “feature requests” due to missing requirements.

Feature file (Created by Author)

The next step is to generate test code from the feature and that’s where the connection happens. The Pytest code acts as a bridge between the documentation and the implementation code. When there is any misalignment between them, the tests will fail, highlighting the need for synchronization between documentation and implementation.

Test code acts as a bridge (Created by Author)

Here is what the test code looks like. To keep the example short, I only implement the test code for the first scenario. The Given steps set up the initial context for the scenario which in this case gets customer age, income, past purchases data from the examples. The When step triggers the behavior being tested which is get_user_segment function. In the Then step, we compare the result from the When step with the expected output from the scenario example.

Test code of the 1st scenario in the feature file(Created by Author)

Imagine a change to the age range specified in the first scenario where an example age of 62 is added without updating the code. In such as case, the test would immediately fail because the code has conflicting expectations.



Source link

Leave a Comment