Finding data slices in unstructured data | by Stefan Suwelack | Aug, 2023

A short introduction to data-slicing methods including hands-on examples on the CIFAR-100 dataset.

Data slices on CIFAR100. Source: created by the author.

Data slices are semantically meaningful subsets of the data, where the model performs anomalously. When dealing with an unstructured data problem (e.g. images, text), finding these slices is an important part of every data scientist’s job. In practice this task involves a lot of individual experience and manual work. In this post, we present some methods and tools to make finding data slices more systematic and efficient. We discuss current challenges and demonstrate some hands-on example workflows based on open-source tooling.

There is an interactive demo based on the CIFAR100 dataset available.

Debugging, testing and monitoring artificial intelligence (AI) systems is hard. Most efforts in the software 2.0 development process is spent on curating high-quality data sets.

An important strategy for developing robust machine learning (ML) algorithms is to identify so called data slices. Data slices are semantically meaningful subsets where the model performs anomalously. Identifying and tracking these data segments is at the heart of every data-centric AI development process. It is also a core aspect for deploying safe AI solutions in domains such as healthcare and automated driver assistance systems.

Traditionally, finding data slices has been an integral part of a data scientist’s work. In practice, finding data slices heavily relies on the individual experience and domain knowledge of the data scientist. In the wake of the data-centric AI movement, there is a lot of current work and tooling that seek to make this process more systematic.

In this article, we give an overview over the current state of data slice finding on unstructured data. We specifically demonstrate some hands-on example workflows based on open-source tooling.

Data scientists use simple manual slice finding techniques all the time. The most famous example is probably the confusion matrix, a debugging method for classification problems. In practice, the slice finding process relies on a combination of pre-computed heuristics, the individual experience of the data scientist and a lot of interactive data exploration.

A classical data slice can be described by a conjunction of predicates on tabular features or metadata. In a people dataset this might be persons in a certain age range who are male and above 1.85m tall. In an engine condition monitoring dataset, a data slice might consist of data points in a certain RPM, operating hour, and torque range.

In the case of unstructured data, the semantic data slice definition can be more implicit: It can be a human understandable description such as “driving scenarios in light rain on a curvy road with heavy traffic in the mountains”.

Identifying data slices on unstructured dataset can be done in two different ways:

  1. Metadata can be extracted from the unstructured data either with classical signal processing algorithms (e.g. dark images, low SNR audio), or pre-trained deep neural networks for auto-tagging. Slice finding can then be done on this metadata.
  2. Latent representations in the embedding space can be used to group data clusters. These clusters can then be inspected to identify relevant data slices directly.
Workflow to identify data slices on unstructured data. Source: created by the author.

Automated slice finding techniques always seek to balance the support of the slice (should be large) with the severity of the model performance anomaly (should also be large).

Slice finding methods on tabular data share a lot of similarities with decision trees: In the context of ML model analysis, both techniques can be used to formulate rules that describe where model errors exist. However, there is one important difference: The slice finding problem allows for overlapping slices. This makes the problem computationally hard because it is more difficult to prune the search space.

Especially within the last decade, the machine learning community did benefit tremendously from benchmark datasets: Starting with ImageNet, such datasets and competitions have been a big success factor for deep learning algorithms on unstructured data problems. In this context, the quality of a new algorithm is typically judged based on very few quantitative metrics such as F1-score or mean average precision.

With more and more ML models being deployed into production, it has become apparent that real-world datasets are very different from their benchmark peers: Real data is typically very noisy and imbalanced, but also rich in metadata information. For some use cases, cleaning and annotating these datasets can be prohibitively expensive.

Many teams have found that iterating the training dataset and monitoring drift in production is necessary to build and maintain safe AI systems.

Finding data slices is a core part of this iteration process. Only by knowing where the model fails, it becomes possible to improve the system performance: By collecting more data, by correcting false labels, by selecting the best features or by simply restricting the operation domain of the system.

A critical aspect of slice finding is its computational complexity. We can illustrate this with a small example: Consider n binary features with one-hot encoding (can be obtained by binning or recoding, for example). Then the search space of all possible feature combination is O(2^n). This exponential nature means that heuristics are typically used for pruning. Consequently, automated slice finding not only takes quite long (depending on the number of features), but the output will not be an optimal stable solution, but some heuristics.

During the AI development process, poor model performance often stems from different root causes. Given the inherent stochastic nature of ML models, this can easily lead to spurious findings that have to be manually inspected and verified. Thus, even if a slice finding technique can produce a theoretically optimal result, it’s results must be manually inspected and verified. Building tools that allow cross-functional teams to this efficiently is a bottleneck for many ML teams.

We already stated that it is typically desirable to find slices with a large support, but also a distinct gap in model performance from the dataset baseline. Often, the relationships between different data slices are hierarchical in nature. Handling these hierarchies both during the automated slice finding process and during the interactive review phase is quite challenging.

Automated slice finding methods are most effective on metadata-rich problems. This is often the case for real-world problems. In contrast, Benchmark datasets are always quite sparse in metadata. Two primary reasons for this are data protection and anonymization requirements. With the lack of suitable example datasets, it is very difficult both to develop and to demonstrate effective slice finding workflows.

We (unfortunately) must deal with this challenge in the following example section.

The CIFAR-100 dataset is an established computer vision benchmark. We use it for this tutorial as its small size makes it easy to handle and keeps computational requirements low. The results are also easy to understand as they don’t require special domain knowledge.

Unfortunately, CIFAR-100 is already perfectly balanced, highly curated and lacks meaningful metadata. The results of the slice finding workflows we produce in this section are thus not as meaningful as in a real-world setting. However, the presented workflows should be sufficient to understand how to quickly use them on your real-world data.

In a preparation step we compute image metadata with the Cleanvision library. More information on this enrichment can be found in our data-centric AI playbook.

We also define some important variables for our data slice analysis: The features to be analyzed as well as the names of the label and prediction columns:

Most slicing techniques only work on binned features. As the SliceLine and WisePizza libraries do not provide binning functionality themselves, we perform this as a pre-processing step:

The Sliceline algorithm was proposed by Sagadeeva et al- in 2021. It is intended to work with large tabular datasets that contain many features. It leverages a novel pruning technique based on sparse linear algebra techniques and allows to find data slices quickly even on a single machine.

In this tutorial, we use the SliceLine implementation from the DataDome team. It runs very stable, but currently only supports Python versions <=3.9.

Most parameters of the SliceLine algorithm are very straight forward: The minimal support of the slice (min_sup), the maximum number of predicates to define a slice (max_l) and the maximum number of slices to be returned (k). The parameter alpha assigns a weight to the importance of the slice error and essential controls the trade-off between the size and the error drop-off of the slice.

We call the SliceLine library to get the 20 most interesting slices:

To interactively explore the slices, we enrich the description of each data slice:

We start Spotlight to explore the data slices interactively. You can directly experience the results in the Huggingface space.

Source link

Leave a Comment