Are we going MAD? Benchmarking Multi-Agent Debate between Language Models for Medical Q&A

Dries Smit | Paul Duckworth | Nathan Grinsztajn | Kale-ab Tessera 1 | Tom Barrett | Arnu Pretorius

1 University of Edinburgh



The ability of generative agents to accurately and truthfully answer medical questions is a long-standing challenge. Recently, many single and multi-agent strategies have been proposed for large language models (LLMs) to improve their performance on various tasks. The contribution of our work is a comprehensive benchmark suite, including open-source implementations, of multi-agent debate strategies for medical Q&A. We present valuable insights for utilizing different strategies, summarizing and highlighting the trade-offs between cost, time, and accuracy. We build upon these insights to provide novel strategies that outperform previously published LLM debate strategies on three medical Q&A datasets.