Model assessment beyond samples in your dataset

Pavol Bielik
March 24, 2021

Despite impressive progress over the last decade, current AI models can perform poorly and unpredictably when deployed in the real world. In this blog series, we present the latest advances in assessing models, identifying their failure modes, and gaining insights into building quality datasets and models. This first post motivates the need for assessing models beyond samples in the test dataset.

Deep neural networks have become the core building block for developing artificial intelligence (AI) systems for disease prediction, preventive maintenance, fraud detection, physical security, and other domains. Yet, they can be brittle when deployed in the real world and have been shown to “cheat” by exploiting the particular dataset and metrics they were trained to optimize [1, 2, 3, 4]. As a concrete example, consider a commercially available object detection model that is trained to predict different types of animals and is presented with the following two pictures [2]:

A computer vision model has learned to detect cows by associating them with a particular background and context, in this case, grass and mountains [2]. As a result, the model fails to identify cows in a different environment (water).

While both examples depict a cow, the model makes a correct prediction only for the left image. Further, the model's failure to recognize cows in the water is not restricted to the single image on the right but reveals a common failure mode. One of the reasons behind this failure is that the model learns to condition its predictions by exploiting easy-to-discover features correlated with the correct labels, such as the presence of grass and mountains. This is in contrast to the harder task of learning generalizable features that would describe the true cause of what “makes” a cow, independently of the background or other correlated features. 

Just as importantly, the model can learn to “cheat” in many different ways by relying on subtle features that are difficult to distinguish by humans. Examples include strong reliance on texture [5], domain-specific artifacts such as surgical ink markings for skin lesion diagnosis [6], or hospital token included as part of the image [7]. 

Examples of common computer vision tasks, problems discovered in existing models, and the issues causing these problems [3, 7, 9, 10].

Similar issues arise in virtually all domains, including natural language processing. For example, the latest analysis of three commercial systems reveals significant failures even for common mistakes, such as misspelling a single letter. Here, both Google Cloud’s Natural Language and Amazon’s Comprehend fail in more than 10% of the cases, while Microsoft’s Text Analytics is slightly better but still has a failure rate of 5.6%.

Failure rates of three deployed systems (Microsoft’s Text Analytics, Google Cloud’s Natural Language, and Amazon’s Comprehend) to common variations of the input sentences such as typos or changing locations [8]. 

Therefore, the critical question is: How can we gain confidence and trust in the state-of-the-art deep learning models deployed in the real world? In other words, what arguments would an ML engineer, as well as the user, make to quantify and understand the model performance, where it is expected to work (including where it is not expected to work), and why? For example, let us consider a deep learning model trained to detect traffic signs, which is presented with the following two images:

A critical question that both users (for transparency) and ML developers (for confidence) should be able to answer: is the model supposed to and will it work when deployed?

Given our earlier example with the cows, it is natural to ask whether the model does not exploit the difference in the environments. In this case, we would expect stop signs to be more common in an urban environment than in a forest and therefore the model might not detect the stop sign in the forest. Similarly, to gain confidence, we would like to assess whether the model works correctly under different lighting conditions, camera angles, weather conditions, and so forth.

Unfortunately, the current practice of evaluating the average case model performance on a fixed test dataset cannot answer such questions. Is it easy to see why evaluating only average model performance is limiting; knowing that the model has 90% or 95% accuracy does not help us to answer any of the questions above. Instead, in many cases, we are interested in the worst-case model performance (and the corresponding samples) rather than the average model performance over the test set.

Moreover, the fact that the dataset is fixed means that the problem is typically underspecified, especially when training state-of-the-art machine learning models and the dataset itself is not enough to differentiate between models that generalize well. To illustrate this point on a concrete example, the plot shown below shows multiple models that achieve similar average test accuracy (± 0.11 top-1 variance) when trained on the same dataset [11], yet their generalization to different environmental conditions such as brightness, blur, and pixelate varies up to 60%. 

Illustration of the dataset underspecification problem and why simply obtaining larger datasets is not enough. Here, the plot shows multiple models that achieve similar test accuracy on ImageNet (± 0.11 top-1 variance), yet their generalization to different environmental conditions such as brightness, blur, and pixelate varies up to 60% [11]. 

In other words, we can train many models that are indistinguishable when evaluated on the original test dataset but differ substantially in how well they generalize. As a consequence, only scaling up the existing datasets is not going to help us gain confidence in deploying the models in real-world. 

What can we do about this? 

To cope with the underspecification challenge and tackle the questions above, we need to be able to systematically assess model quality, explicitly control for various biases, opportunities to “cheat,” and incorporate ways to define and assess the model behavior when deployed in the wild. Clearly, we cannot improve a model or even differentiate a better, more generalizable model out of multiple models if we cannot “see” these improvements using suitable tests and metrics. The example above also illustrates a concrete approach to doing this: By looking beyond the samples in the dataset, we can indeed select a model that improves accuracy with 60% on pixelated images, 10% on images with varying contrast, and 3% on blurred images. This shows the need to improve how models are evaluated and motivates why it is important to look beyond the fixed dataset and using average accuracy as the only criterion to determine model quality.

In the following blog posts, we will dive deeper into specific areas for model assessment and illustrate how this can help us build and deploy better models. Concretely, we will show how to:

  • Define common specifications that capture expected model behavior and how to check it in practice (e.g., for object detection models);
  • Define advanced and task-specific robustness specifications and demonstrate their importance;
  • Consider different ways of combining specifications and how this affects the model’s quality assessment;
  • Identify failure modes, i.e., data slices where the model performs poorly, based on the robustness assessment results.

To receive updates about our new blog posts, you can follow us on LinkedIn and Twitter.


LatticeFlow is a company founded by leading researchers and professors from ETH Zurich. The company is building the world’s first product that enables AI companies to deliver trustworthy AI models, solving a fundamental roadblock to AI’s wide adoption. To learn more, visit https://latticeflow.ai.


[1] CLIP: Connecting Text and Images, Radford et al., [online
[2] Recognition in Terra Incognita, Beery et. al, ECCV'18
[3] Shortcut Learning in Deep Neural Networks, Geirhos et. al., Nature Machine Intelligence'20
[4] Invariant Risk Minimization, Arjovsky et. al., ArXiv'19
[5] Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, Geirhos et. al., ICLR'19
[6] Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition, Winkler et al., JAMA Dermatol'19
[7] Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Medicine’18
[8] Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, Ribeiro et al., ACL 2020
[9] Label-Consistent Backdoor Attacks, Turner et. al., ArXiv’19
[10] Do neural nets dream of electric sheep? [online]
[11] Underspecification Presents Challenges for Credibility in Modern Machine Learning. D'Amour et. al., ArXiv’20