Announcing model blind spots powered by intelligent workflows.🚀

Is your model good or great?

November 30, 2023

Aggregate Performance

Building high quality and trustworthy AI models, both with small and large datasets, is a challenging task at the core of what the best machine learning teams tackle every day. Yet, due to the lack of suitable tooling, all too often teams resort to making decisions only based on aggregate metrics such as accuracy, AUC or mAP. While useful, such aggregate metrics do not provide insights into systemic model failures, which type of input fails on more often, for what reason, or how to improve it.

Aggregate Performance

Machine learning engineers are all too familiar with model training curves like the one visualized here. After the initial improvement, the model performance reaches 92% accuracy and begins to stagnate.

At this point, the engineers start pulling tricks based on their experience, including hyperparameter optimization, tuning data augmentations, custom loss functions and more.

Or, ...

...if nothing seems to help, select the best model based on the aggregate accuracy. As long as the improvement also translates to the test set, we have built a better model than we had before 🎉.

Despite having a model with better accuracy, two fundamental questions arise:

Question 1: Is the model truly better as long as the aggregate accuracy is higher?

Question 2: How to systematically improve the model further without getting stuck?

Beyond Aggregate Performance

To gain insights into both of these questions, we start by understanding where do the aggregate metrics come from -- they come from the individual samples in the dataset.

Here, we visualize each sample as a circle and color correct model predictions as grey while coloring incorrect model predictions as red.

As you can see, the aggregate model accuracy is in fact made up from all the individual samples. To understand when and why the model fails, we need to deep dive into the individual model errors.

Systematically Accessing Model Failures at Scale

However, accessing every failure individually is typically infeasible as there are simply too many. Instead, we need a systematic approach to help us identify groups of samples that fail due to the same reason.

For example, ...

...we can identify different causes for the model failures such as:

Spurious Correlations where the model incorrectly uses undesirable correlation between input and labels.

Incorrect Labels where annotators assigned wrong, missing or inconsistent labels.

Rare Samples where the model have not yet seen enough diversity to learn the right concept.

Ambiguous Samples where predicting the correct label is ambiguous even for domain experts.

and more.

Such a fine-grained analysis helps us not only to better understand the reasons behind the model failures, but more importantly, gives us guidance on how to improve the model further.

For example, we can perform targeted data collection to specifically collect (or augment existing) samples that are rare. Or, we can improve the existing data by adding a data quality layer to the AI pipeline, designed to flag and fix incorrect labels before they are used in training and validation.

Try our interactive demo to find out how you can detect blind spots and help improve your model performance.

Try it yourself

Try LatticeFlow on your custom models

We know that it is painstakingly difficult to find and fix model issues. Whether it is resnet, yolov8 or deeplabv3, we have you covered. Get to solutions for your data and model issues quicker, avoid pitfalls, and avoid doing the same mistake over and over again.

Related Posts

Is your model good or great?

LatticeFlow announces intelligent workflows for fixing AI blind spots

LatticeFlow Brings Safe and Trustworthy AI to the U.S.


LatticeFlow is an ETH spinoff company made in Switzerland.

2261 Market Street #4709, San Francisco, CA 94114 | Talstrasse 9, 8001 Zurich, Switzerland | Synergy Tower, Sofia Tech Park, Sofia, Bulgaria

LatticeFlow AG. Copyright © 2023