Enhanced Video Captioning Using A Hybrid Vision-Swin Transformer Technique With Semantic Feature Augmentation And Improved Optimization By Eurasian Oystercatcher Algorithm

18 Jun

Authors: Dr. R Muthuram Associate Professor, S.Sanjay,

 

Abstract: – As an important step of multimedia processing, video captioning requires natural language generation for video content, integrating state-of-art approaches in computer vision and NLP to describe unmanaged visual information with useful text. It's complex task, as leveraging the temporal progression and the structured connections between objects, actions, and events in video is quite challenging. Into our paper, we suggest novel hybrid transformer model, that effectively integrates ViT and Swin Transformer based classifier for video captioning. The MSVD dataset is utilised for work. Caption preprocessing comes after which applies spelling correction, tokenization, part of speech tagging, and stop word removal. Applying TF-IDF, N-Grams, and semantic web-based feature extraction techniques for building a richer representation over textual data A hybrid transformer model was then utilised for extracting visual features and produce captions, followed by hyperparameter optimization utilising Eurasian Oystercatcher Optimiser (EOO). These captions are scored against ground truths utilising metrics like BLEU, METEOR and CIDEr.

DOI: http://doi.org/