Understanding and assessing the quality of machine learning systems

Gregory Bonaert
April 15, 2021

In the first post of this series, we discussed the underspecification challenge of state-of-the-art AI models and how they often learn to "cheat" by exploiting the particular dataset and metrics they were trained to optimize. In particular, we have seen how the standard approach of training using a fixed dataset can lead to many models that are indistinguishable, yet in fact differ substantially in their generalization. In this part, we will discuss one way to improve model quality and gain trust -- by specifying the expected model behavior (beyond standard accuracy) and by automatically assessing whether the model indeed satisfies it.

A better way to assess model performance. To improve upon the standard practice of only using average model accuracy, we perform the following two steps:

  1. Specify classes of new inputs beyond those in the original dataset for which the model is supposed to work well.
  2. Define the expected model behavior on these new inputs, which implicitly defines how the model’s performance should be evaluated

As a concrete example, using the above two steps we can express statements such as: 

(1) slight changes to the image brightness should not affect whether (2) existing objects of type “stop sign” are detected. 

As we will see next, there are a number of advantages of this formulation that are generally useful to building better models. However, in terms of building trust and confidence, the two key benefits are that: (i) we are able to explicitly define and assess the expected model behavior and keep track of it across the entire model lifecycle, and (ii) the specification is executable and can be evaluated once the model is deployed on a per-sample basis. That is, we can evaluate the properties on unseen examples, check if they hold and potentially prevent the model from making an unsafe prediction. This is in contrast to evaluating the model using average accuracy that is computed in expectation over the whole dataset and can not be used to access individual samples.  

Figure 1: While traditional assessment only computes global metrics and is done only during training & validation, more advanced model assessment through LatticeFlow covers the entire model lifecycle (training, validation and deployment) and allows for advanced correctness specifications.

In what follows, we briefly describe how to specify new classes of inputs (1) and then focus on defining the desired model behavior (2), with object detection as an example application.

Defining the input specification

In the first step, we specify environmental conditions for which the model is expected to be invariant, i.e. changes to the inputs that should not affect the model’s decisions. For example, a person detector must correctly identify people regardless of their clothing, camera angle, and so forth. More generally, we can also define expected behaviour other than invariance, such as, how the model’s decision is changed when negating a sentence or flipping a turn-left road sign.

The key idea is that we can specify such changes using image transformations. For instance, we can define a transformation that takes an image as input and recolors a person’s clothing to generate a new test image:

Figure 2: A transformation takes the original test input (left image), and recolors the T-shirt of the right person from yellow to light-blue (right image)

The most common image transformations which should not affect the model behavior include geometric transformations, color transformations and noise transformations (Figure 3). These transformations are general and can be directly applied to a broad range of different computer vision tasks, often already included as part of training time data augmentation. Beyond the common transformations, there are also many task-specific transformations that have to be defined for a given task at hand, such as the recoloring of the clothes illustrated in Figure 2. We explore these common and advanced transformations, as well as how to use them for model evaluation, in the next blog post.

Figure 3: Common transformations which should not affect model correctness include geometric changes, image (color) transformations, and the presence of noise.
How should the model be evaluated?

To assess the model quality, it is essential to define how its quality should be measured. While evaluating quality may appear simple at first glance, it can quickly become complex if the application is non-trivial and a fine-grained evaluation of the model is desired. We focus on object detection to illustrate this complexity and describe the evaluation challenges.

Challenge 1. Model quality has many different aspects. To obtain a complete picture of the model quality, many aspects of the model predictions must be evaluated, including :

  • How well does the model detect known objects (e.g., “person”)?
  • How precise is the model at predicting the correct object class (e.g., can a “car” be misclassified as “fire truck”)?
  • How often does the model detect non-existing objects?

As an example, using the LatticeFlow platform, we evaluate the SSD MobileNet v2 320x320 model from the TensorFlow Model Zoo, checking whether the properties above are respected for two classes of inputs: (1) images with slightly different brightness and (2) images which have been blurred. Using the model predictions for the original image as reference, we observe that the model’s performance varies substantially within the input specification, especially for object disappearance, where roughly ⅓ to ¼ of the objects stop being detected (on average).

Table 1: Performance of the SSD MobileNet v2 model when evaluated against a number of interesting properties and input specifications.

More advanced model quality properties can be defined, such as:

  • How precise are the bounding boxes coordinates?
  • Are there objects that are detected multiple times?
  • Are closeby objects detected only once (e.g. in a crowd only one of two closeby people is detected?)

Challenge 2. Practical considerations. On a more practical note, implementing these quality evaluations is nontrivial. Among others, we note two particular difficulties that need to be handled in order to perform representative model assessment:

Handling geometric transformations. When applying rotations on images, the bounding boxes will rotate too and can grow very large (since they need to remain axis-aligned), especially for bounding boxes with a very uneven aspect ratio, which makes evaluation less precise. Additionally, some transformations such as translations can shift the objects on the edge and make them partially visible (or even outside the image), forcing the user to define a specification for partially included objects, a non-typical task.

Matching bounding boxes. To evaluate model predictions, we need to evaluate how well the predicted bounding boxes match the true bounding boxes. Using naive approaches, such as using proximity alone (as shown below), leads to misleading evaluations and instead more advanced matching algorithms must be used. Additionally, the evaluation must also handle the case where the model predicts too many or not enough bounding boxes.

Challenge 3. Model quality can vary highly for different object categories. Even for a single quality aspect, measuring it globally for an image is often too simplistic and not detailed enough. For example, the performance of object detection models changes substantially depending on the object size (Figure 5). Considering only a single global performance measure, such as the standard average-precision (AP) metric, would fail to capture this distinction.

Table 2: performance of the top 6 object detection models on the Microsoft COCO benchmark. Higher average-precision (AP) indicates better performance. Model performance is worse for small objects (APsmall) than medium-size objects (APmedium) and large objects (APlarge)

Therefore, in addition to a global evaluation, it is useful to evaluate the model quality for different categories of objects (small / large, common / rare, in center / near image edge, by class, etc). This is useful both to detect category-dependent model brittleness, and to ensure the model is high quality for critically important categories, such as pedestrians in self-driving.

Summary: Benefits of fine-grained evaluation

In this blog post, we have discussed how one can perform a better model assessment by defining properties and specifications under which the model is expected to perform well. Making this step explicit and part of the ML evaluation pipeline brings a number of benefits:   

Deeply understand the model performance. By evaluating a model in detail (instead of relying on a single global metric) and more systematically assessing its performance, we gain confidence that the model will perform well during deployment and we increase the model’s trustworthiness. 

Focus improvement efforts on model weaknesses. Given a detailed model evaluation, model improvement efforts can be targeted on the weak points of the model. By focusing model improvement efforts on the weaker parts of the model, we can speed up model improvement, make better use of engineering time and reduce data collection and labelling costs.

Select the model that has the best quality trade-offs. Fine-grained evaluation exposes the trade-offs between models, allowing us to pick the most appropriate model depending on the project use case and priorities. This informed choice could not have been made if a single global evaluation was performed, highlighting again the usefulness of going beyond the standard practice of computing a single global quality metric for the model.

Interested in assessing quality of your models? Contact us to gain early access to our platform.

What’s Next? In the following blog posts, we will dive deeper into model assessment by exploring the following topics:

  • Define advanced and task-specific robustness specifications and demonstrate their importance;
  • Consider different ways of combining specifications and how this affects the model’s quality assessment;
  • Identify failure modes, i.e., data slices where the model performs poorly, based on the robustness assessment results.

To receive updates about our new blog posts, you can follow us on LinkedIn and Twitter.

LatticeFlow is a company founded by leading researchers and professors from ETH Zurich. The company is building the world’s first product that enables AI companies to deliver trustworthy AI models, solving a fundamental roadblock to AI’s wide adoption. To learn more, visit https://latticeflow.ai.