From Evaluation to Enlightenment: Delving into Out-of-Sample Predictions in Cross-Validation | by Ning Jia | Jun, 2023

Uncovering Insights and Overcoming Limitations through Out-of-Fold Predictions.

Understanding cross-validation and applying it in practical daily work is a must-have skill for every data scientist. While the primary purpose of cross-validation is to assess model performance and fine-tune hyperparameters, it offers additional outputs that should be noticed. By obtaining and combining predictions for each fold, we can generate model predictions for the entire training set, commonly known as out-of-sample or out-of-fold predictions.

It is crucial not to dismiss these predictions, as they hold a wealth of valuable information about the modelling approach and the dataset itself. By thoroughly exploring them, you may uncover answers to questions such as why the model is not working as expected, how to enhance feature engineering, and whether there are any inherent limitations within the data.

The general approach is straightforward: investigate the samples where the model exhibits high confidence but makes mistakes. In the post, I will show how these predictions help me in three real-world projects.

Finding data limitations

I worked on a predictive maintenance project where the goal was to predict vehicle failures in advance. One of the approaches I explored was training a binary classifier. It was a relatively simple and direct method.

After training using time series cross-validations, I examined the out-of-sample predictions. Specifically, I focused on the false positives and negatives, the samples the model struggled to learn. These incorrect predictions are not always caused by the model’s fault. It’s possible that some samples have conflicts with each other and confuse the model.

I found several false negative cases labelled failures, and the model rarely treats them as failures. This observation piqued my curiosity. Upon further investigation, I discovered many accurate negative samples nearly identical to them.

Figure 1 below compares false and true negatives by data visualization. I won’t go into details. The idea is to run the nearest-neighbours algorithms based on Euclidean distance or Mahalanobis distance in the raw data space; I found samples extremely close to those false negative samples are all true negatives. In other words, these failure instances are surrounded by many good instances.

Figure 1. Comparison of one false negative and one true negative. (Image by the Author)

We now face a typical limitation of a dataset: confusing samples. Either the labels are wrong, or we need more info (more dimensions) to separate them. There is a possible third way: how about transferring the entire space to a new space where confusing samples can be distinguished easily? It won’t work here. First, the confusion happened in the raw input data. It’s like for an image classification dataset, one image is labelled dog, and the other almost identical one is labelled cat. Second, the way of thinking is model-centric and generally increases model complexity.

After bringing these up to the client, they confirmed labels were correct. However, they also admitted that some vehicles that appeared to be functioning well could unexpectedly experience failures without any preceding symptoms, which is quite challenging to forecast. The false negative samples I found perfectly showcased these unexpected failures.

By conducting this analysis of the out-of-sample predictions from cross-validations, I not only gained a deeper understanding of the problem and the data but also provided the clients with tangible examples that showcased the limitations of the dataset. This served as valuable insight for both myself and the clients.

Inspiring feature engineering

In this project, the client wanted to use the vehicle’s on-road data to classify certain events, such as lane changes by the vehicle itself or acceleration and lane changes by the proceeding vehicles. The data is mainly time series data collected from different sonar sensors. Some critical info is the relative speed of surrounding objects and the distances (in x and y directions) of the own vehicle to the surrounding vehicles and lanes. There are also camera recordings by which the annotators label the events.

When performing the classification on events of ahead vehicle changing lane, I encountered a couple of interesting instances that the model labelled as the event was happening, but the ground truth disagreed. In data science terms, they were false positives with very high probability predictions.

To provide the client with a visual representation of model predictions, I presented them with short animations, as depicted in Figure 2. The model would mistakenly label the ahead vehicle ‘changing lane’ around 19:59 to 20:02.

Figure 2. Animation of event detections. (Image by the Author)

To solve this mystery, I watched the video associated with these instances. It turned out the roads were curved at those moments! Suppose the lanes were straight, then the model was correct. The model made wrong predictions because it had never learned that the lanes could be curved.

The data didn’t contain any information on the distance of surrounding vehicles to the lances. Therefore, the model was trained to use surrounding vehicles’ distances to the own vehicle and the distance of the own vehicle to the lanes to figure out their relative position to the lanes. To fix these situations, the model must know the curvature of the lanes. After talking to the client, I uncovered the curvature info in the dataset and built explicit features measuring the distances of the surrounding vehicles and lanes based on geometry formulas. Now the model performance boosts and it won’t make such false positives.

Correcting label errors

In the third example, we aimed to predict specific machine test results (pass or fail), which can be framed as a binary classification problem.

I developed a classifier with very high performance, suggesting the dataset should have enough relevant information to predict the target. To improve the model and understand the dataset better, let’s focus on the out-of-sample predictions from cross-validations where the model makes mistakes. The false positives and negatives are gold mines worth exploring.

Figure 3. A Confusion Matrix. (Image by the Author)

Figure 3 is a confusion matrix with a relatively high threshold. The three false positives imply the model will label them failures, but the ground truth labels them good. We may improve feature engineering to fix them like in the above example, or ask this question: what if the given labels are wrong and the model is actually correct? People make mistakes. Just like values from other columns could be outliers or missing, the target column itself could also be noisy and prone to inaccuracies.

I couldn’t easily show these three samples are wrong with the evidence from the nearest-neighbours approach because the data space was sparse. Then I discussed how the data were labelled with the client. We agreed that some criteria to determine the test results were flawed and that some samples’ labels were potentially wrong or unknown. After the cleaning, these three samples’ labels were corrected, and the model performance was boosted.

We cannot always blame the data quality. But remember, these two things are equally important for your data science jobs: improving the model and fixing the data. Don’t spend all your energy on modelling and assume all the data provided is error-free. Instead, dedicating attention to both aspects is crucial. Out-of-sample predictions from cross-validation are a powerful tool for finding problems in the data.

For more information, lists label errors from popular benchmarking datasets.


Cross-validation serves multiple purposes beyond just providing a score. Apart from the numerical evaluation, it offers the opportunity to extract valuable insights from out-of-fold predictions. By closely examining the successful predictions, we can better understand the model’s strengths and identify the most influential features. Similarly, analyzing the unsuccessful predictions sheds light on the limitations of both the data and the model, inspiring ideas for potential improvements.

I hope this tool proves invaluable in enhancing your data science skills.

If you think this article deserves a clap, I’d love it. You can clap multiple times if you like; thanks!

Ning Jia

Data Science for Time Series

Source link

Leave a Comment