How to Create Valuable Data Tests | by Xiaoxu Gao | Jul, 2023

Data Quality dimensions

Taking a consumer viewpoint of data quality is undoubtedly a valuable initial step. But it might not cover the completeness of the test scope. Extensive literature reviews have addressed this issue for us, offering a range of data quality dimensions that are relevant to most use cases. It’s advisable to review the list with data consumers and collectively determine which dimensions are applicable and create tests accordingly.

| Accuracy     | Format           | Comparability     |
| Reliability | Interpretability | Conciseness |
| Timeliness | Content | Freedom from bias |
| Relevance | Efficiency | Informativeness |
| Completeness | Importance | Level of detail |
| Currency | Sufficiency | Quantitativeness |
| Consistency | Usableness | Scope |
| Flexibility | Usefulness | Understandability |
| Precision | Clarity | |

You might find this list too long and wonder how to start with it. Data products or any information system can be observed or analyzed from two perspectives: external view and internal view.

External view

Dimensions of external view (Created by Author)

The external view is about the use of the data and its relation with the organization. It’s often considered a “black box” with functionality to represent the real-world system. The dimensions that fall into the external view are highly business-driven. Sometimes, the evaluation of those dimensions can be subjective, so it’s not always easy to create automated tests for them. But let’s check out a few well-known dimensions:

  • Relevancy: The extent to which data are applicable and helpful for the analysis. Considering a market campaign aimed at promoting a new product. All data attributes should directly contribute to the success of the campaign such as customer demographic data and purchase data. Data like city weather or stock market prices are irrelevant data in this case. Another example is the level of detail (granularity). If the business wants the market data to be on the day level, but it’s delivered on the weekly level, then it’s not relevant and useful.
  • Representation: The extent to which data is interpretable for data consumers and the data format is consistent and descriptive. The importance of the representation layer is often overlooked when accessing data quality. It includes the format of the data — being consistent and user-friendly, and the meaning of the data — being understandable. For instance, consider a scenario where data is anticipated to be available in a CSV file with descriptive column descriptions, and the values are expected to be in EUR currency rather than in cents.
  • Timeliness: The extent to which data is fresh for data consumers. For example, the business needs the sales transaction data with a maximum delay of 1 hour from the point of sale. It indicates that the data pipeline should be refreshed frequently.
  • Accuracy: The extent to which data is compliant with business rules. Data metrics are often associated with complicated business rules such as data mapping, rounding modes, etc. Automated tests on data logic are highly recommended and the more, the better.

Out of the four dimensions, when it comes to creating data tests, timeliness and accuracy are more straightforward. Timeliness is achieved by comparing the timestamp column with the current timestamp. Accuracy tests are feasible through customer queries.

Internal view

Dimensions of internal view (Created by Author)

In contrast, the internal view is concerned with the operation that remains independent of specific requirements. They are essential regardless of the use cases at hand. Dimensions in the internal view are more technical-driven as opposed to business-driven dimensions in the external view. It also means that data tests are less dependent on consumers and can be automated most of the time. Here are a few key perspectives:

  • Quality of data source: The quality of the data source significantly impacts the overall quality of the final data. The data contract is a great initiative to ensure source data quality. As data consumers of the source, we can employ a similar approach to monitor the source data as data stakeholders do when evaluating the data products.
  • Completeness: The extent to which information is retained in its entirety. As the complexity of the data pipeline increases, there is a higher likelihood of information loss occurring within the intermediate stages. Let’s consider a financial system that stores customer transaction data. The completeness test ensures that all transactions successfully traverse the entire lifecycle without being omitted or left out. For example, the final account balance should accurately mirror the real-world situation, capturing every transaction without any omissions.
  • Uniqueness: This dimension goes hand-in-hand with the completeness test. While completeness guarantees that nothing is lost, uniqueness ensures that no duplication occurs within the data.
  • Consistency: The extent to which data is consistent across internal systems on a daily basis. The discrepancy is a common data issue that often stems from data silos or inconsistent metric calculation methods. Another aspect of the consistency issue occurs between days when data is anticipated to have a steady growth pattern. Any deviation should raise a flag for further investigation.

It’s worth noting that each dimension can be associated with one or more data tests. What’s crucial is understanding the appropriate application of dimensions to specific tables or metrics. Only then, the more tests employed, the better.

Thus far, we’ve discussed the dimensions of external views and internal views. In future data test designs, it’s important to consider both the external and internal perspectives. By asking the right questions to the right people, we can enhance efficiency and reduce miscommunication.

Source link

Leave a Comment