Beyond Pixels: Multimodal Detection Of Interface-Consistent Chat Screenshot Manipulations

10 May

Authors: Vivek, Yashvardhan Pannu, Anirudh Thakur

Abstract: With the growing use of chat screenshots as evidentiary material in legal proceedings, journalistic investigations, and corporate disputes, the need for reliable authentication tools has become urgent. Existing image forensics frameworks — optimised for natural photographs or synthetic face-swap detection — fail to address a structurally distinct attack category: interface-consistent manipulations, in which the visual grammar of a messaging platform is preserved while semantic content is falsified. This paper presents BeyondPixels, a three-channel multimodal detection framework combining Error Level Analysis (ELA), a domain-adapted EfficientNet-B3 convolutional neural network, and an OCR-driven semantic validator. The three channels address complementary manipulation signatures: pixel-level compression artefacts, structural image disruptions, and logical inconsistencies in message text and timestamps. A weighted fusion engine converts per-channel scores into a single normalised authenticity score. Evaluation on a purpose-built corpus of 3,000 synthetic screenshots — spanning five manipulation categories, two interface themes (WhatsApp and Telegram), and two languages (English and Hindi) — yields 90.3% accuracy and 89.5% F1-score, with an AUC of 0.951. Ablation confirms that every channel contributes independently and the OCR module is decisive for text-only attack categories that image-level methods cannot detect.

DOI: https://doi.org/10.5281/zenodo.20102569