Is your model good or great?
November 30, 2023
Building high quality and trustworthy AI models, both with small and large datasets, is a challenging task at the core of what the best machine learning teams tackle every day. Yet, due to the lack of suitable tooling, all too often teams resort to making decisions only based on aggregate metrics such as accuracy, AUC or mAP. While useful, such aggregate metrics do not provide insights into systemic model failures, which type of input fails on more often, for what reason, or how to improve it.
Machine learning engineers are all too familiar with model training curves like the one visualized here. After the initial improvement, the model performance reaches 92% accuracy and begins to stagnate.
At this point, the engineers start pulling tricks based on their experience, including hyperparameter optimization, tuning data augmentations, custom loss functions and more.
...if nothing seems to help, select the best model based on the aggregate accuracy. As long as the improvement also translates to the test set, we have built a better model than we had before 🎉.
Despite having a model with better accuracy, two fundamental questions arise:
Question 1: Is the model truly better as long as the aggregate accuracy is higher?
Question 2: How to systematically improve the model further without getting stuck?
Beyond Aggregate Performance
To gain insights into both of these questions, we start by understanding where do the aggregate metrics come from -- they come from the individual samples in the dataset.
Here, we visualize each sample as a circle and color correct model predictions as grey while coloring incorrect model predictions as red.
As you can see, the aggregate model accuracy is in fact made up from all the individual samples. To understand when and why the model fails, we need to deep dive into the individual model errors.
Systematically Accessing Model Failures at Scale
However, accessing every failure individually is typically infeasible as there are simply too many. Instead, we need a systematic approach to help us identify groups of samples that fail due to the same reason.
For example, ...
...we can identify different causes for the model failures such as:
Spurious Correlations where the model incorrectly uses undesirable correlation between input and labels.
Incorrect Labels where annotators assigned wrong, missing or inconsistent labels.
Rare Samples where the model have not yet seen enough diversity to learn the right concept.
Ambiguous Samples where predicting the correct label is ambiguous even for domain experts.
Such a fine-grained analysis helps us not only to better understand the reasons behind the model failures, but more importantly, gives us guidance on how to improve the model further.
For example, we can perform targeted data collection to specifically collect (or augment existing) samples that are rare. Or, we can improve the existing data by adding a data quality layer to the AI pipeline, designed to flag and fix incorrect labels before they are used in training and validation.
Try LatticeFlow on your custom models
We know that it is painstakingly difficult to find and fix model issues. Whether it is resnet, yolov8 or deeplabv3, we have you covered. Get to solutions for your data and model issues quicker, avoid pitfalls, and avoid doing the same mistake over and over again.