Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring | by Paul Iusztin | Jun, 2023

Theoretical Concepts & Tools

Data Validation: Data validation refers to the process of ensuring data quality and integrity. What do I mean by that?

As you automatically gather data from different sources (in our case, an API), you need a way to continually validate that the data you just extracted follows a set of rules that your system expects.

For example, you expect that the energy consumption values are:

  • of type float,
  • not null,
  • ≥0.

While you developed the ML pipeline, the API returned only values that respected these terms, as data people call it: a “data contract.”

But, as you leave your system to run in production for a 1 month, 1 year, 2 years, etc., you will never know what could change to data sources you don’t have control over.

Thus, you need a way to constantly check these characteristics before ingesting the data into the Feature Store.

Note: To see how to extend this concept to unstructured data, such as images, you can check my Master Data Integrity to Clean Your Computer Vision Datasets article.

Great Expectations (aka GE): GE is a popular tool that easily lets you do data validation and report the results. Hopsworks has GE support. You can add a GE validation suit to Hopsworks and choose how to behave when new data is inserted, and the validation step fails — read more about GE + Hopsworks [2].

Screenshot of GE data validation runs inside Hopswork [Image by the Author].

Ground Truth Types: While your model is running in production, you can have access to your ground truth in 3 different scenarios:

  1. real-time: an ideal scenario where you can easily access your target. For example, when you recommend an ad and the consumer either clicks it or not.
  2. delayed: eventually, you will access the ground truths. But, unfortunately, it will be too late to react in time adequately.
  3. none: you can’t automatically collect any GT. Usually, in these cases, you have to hire human annotators if you need any actuals.
Ground truth/targets/actuals types [Image by the Author].

In our case, we are somewhere between #1. and #2. The GT isn’t precisely in real-time, but it has a delay only of 1 hour.

Whether a delay of 1 hour is OK depends a lot on the business context, but let’s say that, in your case, it is okay.

As we considered that a delay of 1 hour is ok for our use case, we are in good luck: we have access to the GT in real-time(ish).

This means we can use metrics such as MAPE to monitor the model’s performance in real-time(ish).

In scenarios 2 or 3, we needed to use data & concept drifts as proxy metrics to compute performance signals in time.

Screenshot with the observations and predictions overlapped over time. As you can see, the GT isn’t available for the latest 24 hours of forecasts [Image by the Author].

ML Monitoring: ML monitoring is the process of assuring that your production system works well over time. Also, it gives you a mechanism to proactively adapt your system, such as retraining your model in time or adapting it to new changes in the environment.

In our case, we will continually compute the MAPE metric. Thus, if the error suddenly spikes, you can create an alarm to inform you or automatically trigger a hyper-optimization tuning step to adapt the model configuration to the new environment.

Screenshot with the mean MAPE metric between all the time series computed over time [Image by the Author].

Source link

Leave a Comment