Medical Visual Question Answering

21 May

Authors: Latheesha P, Meenakshi P, Nikitha D, Uma Maheswari T, Dr. Sundara Rajulu Navaneetha Krishnan

Abstract: A Transformer-Based Multimodal Framework for Medical Visual Question Answering offers significant convergence of computer vision and natural language processing for automatic medical image understanding with text-based inquiries. Common shortcomings of previous approaches are the requirements of interpretable outputs, small annotated datasets, and domain-specific reasoning under restriction. In this paper, we propose a transformer-based Med-VQA framework that is optimized on the Med-VQA-RAD dataset and whose structure is based on Salesforce-VQA-Base. Our approach takes advantage of the fusion of multimodal features-textual and visual data in enhancing response dependability and accuracy. We assess the performance using standard metrics like accuracy, BLEU, and medical-focused evaluation measures that demonstrate gains over baseline models. The results indicate that the proposed architecture enhances the diagnostic question-answering capability and offers understandable information to clinicians. The current work will lead the path for future research in scalable and explainable Med-VQA systems and will advance the development of AI-assisted tools for clinical decision-making.