Book a demo

Beyond model accuracy: How model blind spot discovery helps to build better models

Building high quality and trustworthy AI models is one of the biggest challenges faced by machine learning engineers today. While key metrics such as aggregate performance are useful, just because an AI model excels in the test environment, doesn't guarantee its success in the real world. Instead, how can you be certain that your models will continue to perform as expected when deployed to production?

Discovered model blind spots across multiple dimensions, none of which is included in the original dataset. Try our interactive demo to learn how these can be found automatically.

The process of systematically identifying and fixing critical model errors can be overwhelming, especially when dealing with a constant stream of large and intricate datasets. "Real datasets always have lots of data biases that confuse models. It is painstakingly difficult to find and fix these issues!" and "It takes us weeks to get to the root cause of systematic model failures." are illustrative quotes from machine learning practitioners who have experienced the pivotal importance of data and model quality firsthand.

As an example, consider a yolov5 model trained to detect waste, for which using LatticeFlow we discovered blind spots such as: (i) precision degradation to 40% for dark-colored bottles, (ii) precision degradation to 43% for occluded or broken bottles, and (iii) precision degradation to 38% for smaller bottles of non-standard type.

So, how do you know what is causing your AI models to fail? And how do you fix these issues to improve model performance? Let us illustrate how LatticeFlow’s AI platform can detect model blind spots through a real-world case study for waste detection.

Case Study: Model Blinds Spots in Waste Recycling

In waste recycling, our task is to detect and classify different types of waste on a moving conveyor belt such as plastic bottles, cans, glass, detergent, paper and more. The ability to build a precise AI model is at the core of materials recovery from facilities where the collected recyclable waste is sorted and processed.

Examples of ground-truth objects and their labels that the AI model should detect from the Waste Recycling Plant Dataset. Note, the data is post-processed to highlight individual objects.

As can be seen from the examples above, the task of detecting waste in real-world environments is highly non-trivial. Similar objects vary greatly due to deformations, squashing, shrinking, occlusions, surface defects, textures (e.g., different brands), dirt or lighting conditions. Some objects are transparent while others are not. Not to mention that the set of objects (i.e., bags, cans, glasses), and their intra-class variety, is inherently dynamic — new product designs are released all the time, and we would like the AI model to generalize well to such new product designs.

What is the best way to assess the quality of an AI model, if not aggregate metrics? A typical approach is to compute aggregate performance statistics and to explore individual model mistakes, as illustrated below.

A subset of bottles for which the model makes a mistake; it does not detect the objects.

However, by looking at the mistakes, would you be able to tell for which situations does the model already work well and where is the model likely incorrect? Unfortunately, in most cases the answer is "no". Instead, engineers working extensively with the data and model would over time typically gain an intuition of where such failures do happen, and use such expert knowledge to improve the model. It typically takes significant time to find systematic errors by trying to figure them out manually. Engineers build some intuition over time why and where such failures happen , but it still is cumbersome and error-prone.

Try for Yourself: Model Blind Spot Discovery

What if we could take the non-transferable intuition of the few expert engineers and instead, make it structured, quantifiable, transparent and repeatable across the whole team?

This is exactly where model blind analysis empowers teams; by converting intuition into formal hypotheses such as "how does X impact the model performance?", where X can be customized for your task. Furthermore, with LatticeFlow, blind spots can also be discovered automatically for you, by inferring relevant attributes (i.e., the 'X') directly from the data.

Analyzing Model Failures at Scale

To analyze model failures, we first run the model on all the dataset samples. As the datasets are large, let us visualize each sample as a circle with either gray or red color.

Correct model predictions are denoted as grey circles.

Grey Circle

Model mistakes are denoted as red circles.

Red Circle

Color Model Blind Spots

As a first step, we would like to explore the model performance with respect to the color of the waste. Because such attribute is not available as part of the dataset, we first need to automatically infer it from the raw images.

Extracting colors based on the bounding box only is insufficient as often, most of the area is in fact background. Instead, we need to reason about the object itself.

Infer Color
Legend Bottle

Shape Model Blind Spots

Having found our first blind spot, we turn our attention to variations of object shapes and occlusions. Similar to color, such attributes are not available as part of the dataset, therefore we first need to automatically infer them from the raw images.

Infer Shape
Legend Shape

Size & Type Model Blind Spots

Next, we analyze object size and intra-class variety. This is especially interesting since due to labeling costs, the available label categories are often very coarse grained.

After the analysis is performed, notice how the visualization changed from analysing a single targeted hypothesis to testing the hypothesis across the whole attribute domain.

Infer Size

Credits

LatticeFlow Team // November 2023

What our customers say

Having trouble finding blind spots?

Let's talk!​