Kate Sanders

Site last updated November 2025

⭐️ I'm currently on the industry job market! I am looking for a full-time research role in industry that involves reasoning systems and NLP. I am particularly interested in multimodality (image and video), AI for science, evaluation, factuality, and retrieval. If you think there might be a match, please feel free to reach out to me via email (ksande25@jhu.edu) or LinkedIn!

Hello! I am a final year AI Ph.D. student at the JHU Center for Language and Speech Processing at Johns Hopkins University working on reasoning, natural language processing, and multimodality. I am advised by Professor Benjamin Van Durme. During my Ph.D., my thesis work has centered on multimodal event semantics, but I also do research on interpretable and reliable reasoning systems, model evaluation, uncertainty/subjectivity, and information retrieval.

I spent this past summer as an intern at AWS designing reward functions to train reasoning models, where I was mentored by Nathaniel Weir. The previous summer, I co-organized and facilitated the 10-week SCALE 2024 Summer Research Workshop at the HLTCOE. Before starting my Ph.D., I received my BA in Computer Science from UC Berkeley where I did robotics research at the UC Berkeley AUTOLab, advised by Professor Ken Goldberg.

Research

Please visit my Google Scholar profile for a full list of publications.

My ongoing work focuses on reasoning model design and evaluation and AI for scientific reasoning and discovery. Going back, these are a few main areas I've been publishing in during my Ph.D., alongside some representative papers:

▷ Benchmarks for complex visual event understanding

I have spent the last few years working with collaborators to develop ways of thinking about and formulating events in visual data. This has inspired the development of a collection of benchmarks designed to evaluate an agent's ability to recognize and explain these events across different languages and regions. This effort began as a small video retrieval task [1] that was then extended to a full event extraction benchmark [2]. We emphasize the notion of "partially-defined events" in this work: In visual content, we argue that it is critical to model the epistemic and aleatoric uncertainty associated with identifying events that are more commonly described with natural language [3]. More recent extensions to this line of work include a much larger video retrieval dataset built on these initial benchmarks [4] and a report generation benchmark [5] that highlights the difficulty of piecing together multiple videos that only together describe some partially-defined event.

Sanders, K.*, Etter, D.*, Kriz, R.*, Van Durme, B. MultiVENT: Multilingual Videos of Events with Aligned Natural Text. NeurIPS 2023 D&B.
Sanders, K.*, Kriz, R.*, Etter, D.*, Recknor, H., Martin, A., Carpenter, C., Lin, J., Van Durme, B. Grounding Partially-Defined Events in Multimodal Data. EMNLP 2024 Findings.
Sanders, K., Van Durme, B. (2024). A Survey of Video Datasets for Grounded Event Understanding. CVPR 2024 Workshops.
Kriz, R.*, Sanders, K.*, Etter, D., Murray, M., Carpenter, C., Recknor, H., Blasco, J., Martin, A., Yang, E., Van Durme, B. "MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval. CVPR 2025.
Martin, A., Kriz, R., Walden, W., Sanders, K., Recknor H., Yang, E., Ferraro, F., Van Durme, B. WikiVideo: Article Generation from Multiple Videos. 2025 arXiv preprint.

▷ Interpretable and factually accurate reasoning

I am very lucky to be able to work on factuality and transparency research with so many of my labmates. We have extended ideas in informal logic to assess the quality of compositional entailment in neuro-symbolic reasoning systems [1], and extended these reasoning systems to transparently verify content in the video-language domain [2]. I built on this multimodal framework to incorporate uncertainty modeling and to generalize to modalities beyond vision and text [3]. I also worked with labmates to develop a subclaim selection comoponent to improve the trustworthiness of factuality metrics like FActScore [4], and evaluate LLM's abilities to verify claims in scientific reports [5].

More publications on trustworthy reasoning are coming soon!

Weir, N., Sanders, K., Weller, O., Sharma, S., Jiang, D., Jiang, Z., ..., Van Durme, B. Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic. EMNLP 2024.
Sanders, K., Weir, N., Van Durme, B. TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning. EMNLP 2024.
Sanders, K., Van Durme, B. Bonsai: Interpretable Tree-Adaptive Grounded Reasoning. 2025 arXiv preprint.
Jiang, Z., Zhang, J., Weir, N., Ebner, S., Wanner, M., Sanders, K., Khashabi, D., Liu, A., Van Durme, B. (2024). Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification. ACL 2025 Findings.
Ou, J.*, Walden, W.*, Sanders, K., Jiang, Z., Sun, K., Cheng, J., ..., Van Durme, B. CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? EMNLP 2025 Findings.

▷ Video retrieval for real-world queries

I spent a summer at SCALE 2024 hacking on the MultiVENT video retrieval dataset to find new ways to perform video retrieval on more realistic, real-world queries pertaining to visual events. In addition to the MultiVENT 2.0 benchmark (which served as a shared task for our ACL 2025 workshop on multimodal retrieval and generation) we came up with some methods papers for tackling these sorts of problems: Video-ColBERT [1] draws inspiration from late interaction retrieval methods to combine token-wise interaction on static frame features and temporally contextualized video features for improved retrieval. MMMORRF [2] takes a separate approach, balancing the contributions of both vision and audio by establishing distinct data processing pipelines for each and fusing them through modality-aware weighted reciprocal rank fusion. FORTIFY [3] addresses the challenging problem of modeling OCR content for retrieval by leveraging generative models to rewrite and synthesize these fragments as noisy, multilingual documents.

Reddy, A., Martin, A., Yang, E., Yates, A., Sanders, K., Murray, K., Kriz, R., M de Melo, C., Van Durme, B., Chellappa, R. Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval. CVPR 2025.
Samuel, S., DeGenaro, D., Guallar-Blasco, J., Sanders, K., Eisape, O., ..., Kriz, R. MMMORRF: Multimodal Multilingual MOdularized Reciprocal Rank Fusion. SIGIR 2025 Demo.
DeGenaro, D., Yang, E., Etter, D., Carpenter, C., Sanders, K., Martin, A., Murray, K., Kriz, R. FORTIFY: Generative Model Fine-tuning with ORPO for ReTrieval Expansion of InFormal NoisY Text ACL 2025 Workshops.

▷ Miscellaneous

I recently worked on a fun collaboration with MIT CSAIL's Computer-Aided Programming Group where we explored how well LLMs and reasoning models can learn low-level language syntax via in-context learning [1]. I also have experience developing benchmarks for web agents [2] and for assessing the calibration of vision classification models via human judgment solicitation [3]. I published a handful of robotics papers while in undergrad: My two favorites concern identifying failure modes for bin-picking systems and suction grippers [4, 5].

Gupta, K., Sanders, K., Solar-Lezama, A. Randomly Sampled Language Reasoning Problems Reveal Limits of LLMs. ICLR 2025 Workshops.
Xu, K., Kordi, Y., Nayak, T., Asija, A., Wang, Y., Sanders, K., Byerly, A., Zhang, J., Van Durme, B., Khashabi, D. Tur[k]ingBench: A Challenge Benchmark for Web Agents. NAACL 2025.
Sanders, K., Kriz, R., Liu, A., Van Durme, B. Ambiguous Images With Human Judgments for Robust Visual Event Classification. NeurIPS 2022 D&B.
Sanders, K., Danielczuk, M., Mahler, J., Tanwani, A., Goldberg, K. Non-Markov Policies to Reduce Sequential Failures in Robot Bin Picking. CASE 2020.
Huh, T. M., Sanders, K., Danielczuk, M., Li, M., Chen, Y., Goldberg, K., Stuart, H. S. A Multi-Chamber Smart Suction Cup for Adaptive Gripping and Haptic Exploration. IROS 2021.

News

Nov. 2025	Bonsai was accepted to AAAI 2026 for an oral presentation.
Oct. 2025	Wrapped up my internship at AWS! Paper currently under review.
Sept. 2025	Our workshop on multimodal RAG was accepted to host its second iteration at ACL 2026.
Aug. 2025	CLAIMCHECK was accepted to EMNLP Findings 2025.
Aug. 2025	Co-organized the first workshop on multimodal retrieval and generation at ACL 2025.
July 2025	CORE and FORTIFY were presented at ACL 2025.
July 2025	MMMORRF was presented at SIGIR 2025.
June 2025	MultiVENT 2.0 and Video-ColBERT were presented at CVPR 2025.
May 2025	Interning at AWS to work on reasoning model reward functions.