Multi-Modal RAG Framework Integrating Text, Image, And Audio Intelligence

7 May

Authors: Mrs. C. Radha, Sanjay V, Santhosh P, Rejitha R

Abstract: The sheer volume of unstructured data generated in multiple modalities like text, images, audio, and video has overwhelmed existing text-based retrieval augmented generation (RAG) models. In this work, we propose a unified framework for multi-modal RAG (MM-RAG) that enables unified multimodal processing, multimodal representation, cross-modal retrieval, and multimodal document generation within a single model. The proposed framework employs cutting-edge technology: BridgeTower for unified text-image encoder, Whisper for automatic speech recognition, FAISS for vector similarity search, and large vision-language models (LVLMs) for answer generation. Through experiments conducted on three enterprise use cases, it is evident that MM-RAG has a high cross-modal retrieval accuracy of 94.2%, hallucination reduction of 67.3% when compared with unimodal baselines, and a multi-modal document throughput of 1,200 pages per hour. It is clear from our comparative study that MM-RAG significantly outperforms the uni-modal system for tasks demanding multimodal integration, especially visual product search (recall@10 = 92.8%) and audio/video content retrieval (accuracy = 88.5%).Safely allocated to each asset, depending on the investment strategy.

DOI: https://doi.org/10.5281/zenodo.20071343