Multi-Modal AI Evaluator: A Unified Framework for Evaluating Instruction-Following across Text, Image, and Audio Modalities

Authors- Guhaa Priyan GP, Methun R, Prasanna SN, Jeevak A

Abstract-Artificial Intelligence has really come a long way in handling all sorts of content, whether it’s text, images, or audio. However, in our everyday lives, we often need to make sense of and reason through these different types all at the same time. This paper presents the Multi-Modal AI Evaluator, a framework designed to assess how effectively AI models can follow instructions that involve various input types. The system boasts a user-friendly web interface created with Streamlit, allowing users to input text, images, and audio. It integrates the Google Speech-to-Text API for transcription, GPT-2 for language generation, and a human-in-the-loop feedback system to ensure a comprehensive evaluation of how well an AI model produces responses. The evaluator provides real-time scoring based on accuracy, contextual relevance, and user feedback. This work marks a significant advancement toward establishing standardized evaluations for multi-modal AI systems and fostering the development of artificial intelligence that better meets human needs.

DOI: /10.61463/ijset.vol.13.issue2.262