MIRA: A Novel Framework for Fusing Modalities in Medical RAG

🔥Highlights

Key contributions of MIRA:

We introduce MIRA, the first retrieval-augmented generation (RAG) framework that seamlessly integrates structured multimodal retrieval with adaptive reasoning, outperforming existing methods in clinical decision-making accuracy and efficiency.

Unlike static retrieval paradigms, our Context-Rethink module employs an iterative “rethink-rearrange” cycle for dynamic k-selection, ensuring precision in evidence selection. This is further enhanced by Chain-of-Thought (CoT) reasoning to maintain high factual consistency in medical VQA.

Our architecture introduces a novel dual-pathway retrieval mechanism featuring specialized vision and language encoders for fine-grained image-text alignment. Additionally, a curated citation module improves interpretability, raising the standard for transparency in medical AI.

MIRA achieves state-of-the-art (SOTA) performance on MedVQA by significantly reducing factual errors through online search augmentation and adaptive CoT-based verification. It also delivers 9× faster inference compared to 72B parameter models, enabling real-time, high-precision medical reasoning.

MIRA : Architecture

MIRA consists of four key components designed for end-to-end training and efficient inference: (1) a dual-pathway retrieval system with dedicated vision and language encoders, (2) a context-rethink module for dynamic retrieval control, (3) a multimodal fusion module that aligns retrieved knowledge with image features, and (4) a decoder-only large language model (LLM) enhanced with Chain-of-Thought (CoT) reasoning; along with an optional citation module for post-hoc interpretability.

Overview of the MIRA (Multimodal Intelligent Retrieval and Augmentation) pipeline. The system integrates image and text-based retrieval to enhance the generation process.

MIRA: Quantitative Results

Performance comparison of multimodal report generation abilities between MIRA and other frameworks specialized on measured on 1000 samples split from MIMIC-CXR

MIRA: Qualitative Results

Visualization of attention distribution across all slices of input sequences, emphasizing the model’s focus on critical tokens. This highlights how the model has learned selectively attends to important parts of the input, guiding the generation process for more accurate and contextually relevant responses.

BibTeX

@misc{mira,
      title={MIRA: A Novel Framework for Fusing Modalities in Medical RAG}, 
      author={Jinhong Wang and Tajamul Ashraf and Zongyan Han and Jorma Laaksonen and Rao Mohammad Anwer},
      year={2025},
      eprint={2507.07902},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.07902}, 
}