Why ambiguous images?
If asked if these images depict a birthday party, most humans could confidently give an answer. However, if a human was given just one of these images, they might need to give a more detailed answer to effectively convey their judgments. One option is to provide a quantitative "certainty score" that an image belongs to a specific classification:
Ignoring ambiguity in data can lead to consequences in downstream applications. For example, autonomous agents that collaborate with humans in tasks like manufacturing must accurately assess ambiguous event-based data (such as visual input showing what its human partner is doing) to make behavioral decisions and ensure user safety. This requires an ability to produce reliable outputs under perceptual data-driven uncertainty.
The SQUID-E Dataset
Datset Statistics
- 12,000 ambiguous images
- 20 distinct event types
- 2,000 online videos scraped
- 1,800 human-labeled images
- 10,800 human judgments
- 6 human-labeled event types
Datset Construction
β¬
β’ UNCERTAINTY JUDGMENT LABELING
π€π¬ 14%
π€π¬ 27%
π€π¬ 50%
π€π¬ 68%
π€π¬ 100%
π€π¬ 100%
Human Uncertainty Judgments
Possible reasons for inter-annotator variance include:
Visual Attention
A personβs visual attention can affect their perceptual input and uncertainty calculations. We hypothesize that this phenomenon affected the human judgments in our task, since the images in SQUID-E can often be classified as multiple event types depending on where an annotatorβs visual attention is focused.
Background Knowledge
Many images require an annotator to hold specific knowledge to classify them accurately, and so people may annotate these images differently depending on their personal knowledge bases. Necessary background knowledge is often cultural, or otherwise related to current events or history.
Quantification Strategies
The way humans calculate probability is inherently imperfect: Studies detail heuristics and psychological biases that influence human judgments that are not necessarily caused by external factors. This type of internal factor, divorced from visual input and knowledge bases, may affect annotator score discrepancy.
Practical Applications
Training on SQUID-E for Robust Event Classification
We found that, compared to training on standard, "high certainty data" (with and without data augmentation), training on SQUID-E can result in models that are more robust to other ambiguous images at test time. However, the model trained on SQUID-E underperformed on high-certainty images, indicating a tradeoff between high- and low-certainty performance.Evaluating Verb Prediction Models
Evaluation on SQUID-E allows for assessment of how models handle images with varying levels of certainty. We binned accuracy scores of SoTA situation recognition models on human certainty judgment scores for detailed analysis.Model | 0-20% | 20-40% | 40-60% | 60-80% | 80-100% | Avg. |
JSL | .00 | .07 | .17 | .22 | .52 | .35 |
GSRTR | .02 | .09 | .22 | .25 | .59 | .41 |
CoFormer | .02 | .13 | .22 | .23 | .58 | .40 |
Evaluating Uncertainty Quantification Methods
SQUID-E can also be used to directly evaluate uncertainty quantification approaches by comparing model confidence scores to human uncertainty judgments. Experiment results indicate that direct confidence comparisons align with traditional calibration assessment metrics.Additional Info
Future Work
The illustrated benefits of a model like SQUID-E motivates the creation of larger-scale ambiguous image datasets. It also prompts future work such as training models to learn individual annotators' uncertainty scoring functions, and developing human-centric model calibration methods using human judgments.Citation & Limitations
If you find this work useful in your research, please consider citing the paper:@inproceedings{
sanders2022ambiguous,
title={Ambiguous Images With Human Judgments for Robust Visual Event Classification},
author={Kate Sanders and Reno Kriz and Anqi Liu and Benjamin Van Durme},
booktitle={Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2022},
url={https://openreview.net/forum?id=6Hl7XoPNAVX}
}