πŸ¦‘ Ambiguous Images With Human Judgments for Robust Visual Event Classification

Why ambiguous images?

In our day to day lives, we encounter a wide range of visual input. Much of this input is ambiguous. For example, consider the three images shown below: These are three frames randomly selected from a YouTube video of a birthday party.

If asked if these images depict a birthday party, most humans could confidently give an answer. However, if a human was given just one of these images, they might need to give a more detailed answer to effectively convey their judgments. One option is to provide a quantitative "certainty score" that an image belongs to a specific classification:

Hover over each image to see the median reported human confidence that each image belongs to a video of a birthday party.
6%
81%
100%
.
Vision models do not typically have this capacity. Models are trained on tasks that assume a human could achieve perfect performance, and so these models are not robust to ambiguous visual input.

Ignoring ambiguity in data can lead to consequences in downstream applications. For example, autonomous agents that collaborate with humans in tasks like manufacturing must accurately assess ambiguous event-based data (such as visual input showing what its human partner is doing) to make behavioral decisions and ensure user safety. This requires an ability to produce reliable outputs under perceptual data-driven uncertainty.

The SQUID-E Dataset

To address ambiguous images in computer vision, we propose a novel dataset construction method for uncertainty-aware image recognition. We use this framework to build SQUID-E (the Scenes with Quantitative Uncertainty Information Dataset for Events), a collection of noisy images extracted from videos.

Datset Statistics

  • 12,000 ambiguous images
  • 20 distinct event types
  • 2,000 online videos scraped
  • 1,800 human-labeled images
  • 10,800 human judgments
  • 6 human-labeled event types

Datset Construction

β‘  VIDEO COLLECTION

The distribution of still images found in typical corpora are often intentionally selected to maximize saliency. To produce a dataset of noisier, more ambiguous visual data, we extracted images from videos. First, we scraped YouTube for videos that depict target event types.

⬇

β‘‘ AMBIGUOUS FRAME EXTRACTION

Six frames were extracted from each video using a combination of frame sampling and clustering to produce a collection of visually diverse images from each video.

⬇

β‘’ UNCERTAINTY JUDGMENT LABELING

Annotations were collected for a set of six event types using Amazon Mechanical Turk. Annotators were provided with an event prompt and six images from the dataset. They were then told to rate their confidence that each image belonged to a video depicting the prompted event type using sliding bars ranging from 0% to 100%.

πŸ‘€πŸ’¬ 04%
πŸ‘€πŸ’¬ 14%
πŸ‘€πŸ’¬ 27%

πŸ‘€πŸ’¬ 20%
πŸ‘€πŸ’¬ 50%
πŸ‘€πŸ’¬ 68%

πŸ‘€πŸ’¬ 97%
πŸ‘€πŸ’¬ 100%
πŸ‘€πŸ’¬ 100%

Human Uncertainty Judgments

The Spearman correlation of human uncertainty judgments collected for SQUID-E is 0.673, and the correlation of annotator scores when the same annotator was asked to re-rate the same image months later is 0.788.

Possible reasons for inter-annotator variance include:



Visual Attention

A person’s visual attention can affect their perceptual input and uncertainty calculations. We hypothesize that this phenomenon affected the human judgments in our task, since the images in SQUID-E can often be classified as multiple event types depending on where an annotator’s visual attention is focused.



Background Knowledge

Many images require an annotator to hold specific knowledge to classify them accurately, and so people may annotate these images differently depending on their personal knowledge bases. Necessary background knowledge is often cultural, or otherwise related to current events or history.



Quantification Strategies

The way humans calculate probability is inherently imperfect: Studies detail heuristics and psychological biases that influence human judgments that are not necessarily caused by external factors. This type of internal factor, divorced from visual input and knowledge bases, may affect annotator score discrepancy.

Practical Applications

Here, we show a few examples of what SQUID-E can be used for in the context of computer vision. More details regarding experiments can be found in the paper.

Training on SQUID-E for Robust Event Classification

We found that, compared to training on standard, "high certainty data" (with and without data augmentation), training on SQUID-E can result in models that are more robust to other ambiguous images at test time. However, the model trained on SQUID-E underperformed on high-certainty images, indicating a tradeoff between high- and low-certainty performance.

Evaluating Verb Prediction Models

Evaluation on SQUID-E allows for assessment of how models handle images with varying levels of certainty. We binned accuracy scores of SoTA situation recognition models on human certainty judgment scores for detailed analysis.

Model 0-20% 20-40% 40-60% 60-80% 80-100% Avg.
JSL .00 .07 .17 .22 .52 .35
GSRTR .02 .09 .22 .25 .59 .41
CoFormer .02 .13 .22 .23 .58 .40


Evaluating Uncertainty Quantification Methods

SQUID-E can also be used to directly evaluate uncertainty quantification approaches by comparing model confidence scores to human uncertainty judgments. Experiment results indicate that direct confidence comparisons align with traditional calibration assessment metrics.

Additional Info

Future Work

The illustrated benefits of a model like SQUID-E motivates the creation of larger-scale ambiguous image datasets. It also prompts future work such as training models to learn individual annotators' uncertainty scoring functions, and developing human-centric model calibration methods using human judgments.

Citation & Limitations

If you find this work useful in your research, please consider citing the paper:

@inproceedings{
     sanders2022ambiguous,
     title={Ambiguous Images With Human Judgments for Robust Visual Event Classification},
     author={Kate Sanders and Reno Kriz and Anqi Liu and Benjamin Van Durme},
     booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
     year={2022},
     url={https://openreview.net/forum?id=6Hl7XoPNAVX}
}

We encourage researchers to review the limitations and ethical considerations regarding this dataset discussed in Section 6 of the paper.