Authors: Sakshi Keshri, Nitin Namdev
Abstract: In video content analysis, differentiating between violence and non-violence remains a formidable challenge, primarily due to the dynamic, high-dimensional, and temporally complex nature of video streams. Traditional architectures such as ResNet50 and MobileNetV2, though powerful for static image classification, struggle to capture the evolving temporal dynamics inherent in sequential video data. To address this limitation, we introduce a hybrid deep learning framework that unites the spatial abstraction power of InceptionV3 with the sequential learning capacity of Long Short-Term Memory (LSTM) networks. The pipeline begins with robust preprocessing—noise suppression, effective feature extraction, and data refinement—ensuring the model is trained on high-quality representations. By fusing InceptionV3’s rich spatial embeddings with LSTM’s temporal modeling, the system achieves a remarkable accuracy of 99.86% and a validation accuracy of 92.48%, significantly surpassing baseline models. Beyond establishing state-of-the-art results, this hybrid approach underscores the transformative potential of deep learning in advancing intelligent video surveillance and enabling fine-grained, real-time behavioral analysis in safety-critical environments
International Journal of Science, Engineering and Technology