Authors: Mrs. Vibhavari Jawale, Mrs. Deepali Hajare, Arhant Sahuji, Tanay Shinde, Ananya Vaishnav, Ritesh Kadam
Abstract: The rapid advancement of computer vision and natural language processing has paved the way for new forms of intelligent video surveillance. Traditional closed-circuit television (CCTV) and motion-based monitoring systems are limited in their ability to understand contextual information, often resulting in false alarms and requiring extensive human intervention. To address this gap, recent research explores the integration of vision-language models (VLMs) and sentiment analysis for context-aware surveillance. This review focuses on emerging methodologies where image captioning models such as Salesforce BLIP are used to describe real-time video frames in natural language, followed by sentiment-driven analysis to assess the nature of the detected activity. The combination of visual understanding, language-based context generation, and sentiment inference enables systems to differentiate between benign and suspicious behavior, thereby reducing false positives and providing actionable insights. Key applications include public safety in smart cities, security in high-risk environments like airports and banks, and monitoring sensitive areas such as hospitals and military zones. The core contribution of this review is the evaluation of how VLM-based context awareness augments conventional object detection pipelines, shifting surveillance toward more explainable and human-like alerting mechanisms. Furthermore, we discuss computational challenges, accuracy limitations, and privacy concerns while highlighting the societal implications of deploying such systems, including alignment with Sustainable Development Goals (SDGs) such as fostering safe cities and reducing crime. Future directions include multimodal fusion, real-time optimization, and ethical frameworks for responsible deployment.
DOI:
International Journal of Science, Engineering and Technology