MIRA: A Novel Framework for Fusing Modalities in Medical RAG

*Equal contribution

1Mohamed Bin Zayed University of Artificial Intelligence, UAE
2Aalto University, Finland
📌 This paper is accepted at ACM Multimedia 2025

Multimodal Large Language Models (MLLMs) have significantly advanced AI-assisted medical diagnosis, but they often generate factually inconsistent responses that deviate from established medical knowledge.

Retrieval-Augmented Generation (RAG) improves factual accuracy by integrating external sources, yet it introduces two major challenges:

  • Insufficient retrieval can overlook critical information, while excessive retrieval may introduce irrelevant or misleading content, undermining output quality.
  • Even when models initially produce accurate answers, over-reliance on retrieved data can later lead to factual inconsistencies.

To overcome these limitations, we propose the Multimodal Intelligent Retrieval and Augmentation (MIRA) framework—designed to optimize factual consistency in MLLMs.

MIRA is built upon two key components:

  1. A calibrated Rethinking and Rearrangement module that dynamically adjusts the quantity of retrieved contexts to manage factual risk.
  2. A medical RAG framework that combines image embeddings and a domain-specific medical knowledge base, with a query-rewriting module to support efficient multimodal reasoning.

This dual mechanism empowers the model to integrate both its inherent knowledge and external medical references effectively.

Our evaluation on public medical visual question answering and report generation benchmarks shows that MIRA substantially improves factual accuracy and overall performance, achieving new state-of-the-art results.

🔥Highlights

Key contributions of MIRA:
  1. We introduce MIRA, the first retrieval-augmented generation (RAG) framework that seamlessly integrates structured multimodal retrieval with adaptive reasoning, outperforming existing methods in clinical decision-making accuracy and efficiency.

  2. Unlike static retrieval paradigms, our Context-Rethink module employs an iterative “rethink-rearrange” cycle for dynamic k-selection, ensuring precision in evidence selection. This is further enhanced by Chain-of-Thought (CoT) reasoning to maintain high factual consistency in medical VQA.

  3. Our architecture introduces a novel dual-pathway retrieval mechanism featuring specialized vision and language encoders for fine-grained image-text alignment. Additionally, a curated citation module improves interpretability, raising the standard for transparency in medical AI.

  4. MIRA achieves state-of-the-art (SOTA) performance on MedVQA by significantly reducing factual errors through online search augmentation and adaptive CoT-based verification. It also delivers 9Ă— faster inference compared to 72B parameter models, enabling real-time, high-precision medical reasoning.

MIRA : Architecture

MIRA consists of four key components designed for end-to-end training and efficient inference: (1) a dual-pathway retrieval system with dedicated vision and language encoders, (2) a context-rethink module for dynamic retrieval control, (3) a multimodal fusion module that aligns retrieved knowledge with image features, and (4) a decoder-only large language model (LLM) enhanced with Chain-of-Thought (CoT) reasoning; along with an optional citation module for post-hoc interpretability.

Overview of the MIRA (Multimodal Intelligent Retrieval and Augmentation) pipeline. The system integrates image and text-based retrieval to enhance the generation process.

MIRA: Quantitative Results

Performance comparison of multimodal report generation abilities between MIRA and other frameworks specialized on measured on 1000 samples split from MIMIC-CXR

Performance comparison of PMC-VQA question set correctness analysis

MIRA: Qualitative Results

Visualization of attention distribution across all slices of input sequences, emphasizing the model’s focus on critical tokens. This highlights how the model has learned selectively attends to important parts of the input, guiding the generation process for more accurate and contextually relevant responses.

BibTeX

@misc{mira,
      title={MIRA: A Novel Framework for Fusing Modalities in Medical RAG}, 
      author={Jinhong Wang and Tajamul Ashraf and Zongyan Han and Jorma Laaksonen and Rao Mohammad Anwer},
      year={2025},
      eprint={2507.07902},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07902}, 
}