DATA | MACHINE LEARNING | QA
In the dazzling world of machine learning (ML), it’s quite effortless to get engrossed in the thrill of devising sophisticated algorithms, captivating visualisations, and impressive predictive models.
Yet, much like the durability of a building depends not just on its visible structure but also its hidden foundations, the effectiveness of machine learning systems pivots on an often-overlooked but entirely crucial aspect: the data quality.
Think of your ML training and inference pipelines as the journey of a steam train.
It’s critical to maintain the health of the train itself — the ML system — but what if the tracks are compromised?
If the quality of data feeding your system is not ensured upstream, it’s akin to a damaged rail track — your train is destined to derail, sooner or later, especially when operating at scale.
Therefore, it’s paramount to monitor data quality from the get-go, right at the source.
Like a train inspector examining the tracks ahead of a journey, we must scrutinise our data at its point of origin.
This can be achieved through a concept known as ‘Data Contracts’.
Imagine being invited to a potluck dinner, where each guest brings a dish.
Without any coordination, you could end up with a feast entirely composed of desserts!
Similarly, in the vast landscape of data, there must be an agreement (i.e., the Data Contract) between data producers and consumers to ensure the produced data meets specific quality standards.
This contract is essentially a blueprint, encompassing a non-exhaustive list of metadata, such as:
- Schema Definition: Details of the data structure, like fields, data types, etc.