Authors: Siddhi Gupta, Ms. Dimpy Parashar, Dr. Yatu Rani
Abstract: The rapid expansion of digital messaging platforms has given rise to an alarming surge in spam and scam communications, particularly in linguistically diverse regions like India where users frequently communicate in Hinglish—a dynamic blend of Hindi and English. Traditional spam detection systems are designed exclusively for monolingual text and fail to process such code-mixed linguistic patterns effectively. This paper presents a hybrid, context-aware spam and scam detection framework that combines rule-based keyword filtering with ensemble machine learning techniques. The system performs multi-stage text preprocessing, applies TF-IDF feature extraction, and trains three individual classifiers—Logistic Regression, Naive Bayes, and Support Vector Machine—fused through soft-voting ensemble learning. Evaluated on a curated dataset of 5,200 Hinglish and English messages, the proposed Voting Classifier achieves an overall accuracy of 94.7%, precision of 93.8%, recall of 94.5%, and F1-score of 94.1%, surpassing all individual baseline models. The results confirm that ensemble fusion combined with domain-specific rule-based filtering meaningfully improves detection coverage for code-mixed, real-world spam and scam messages.
International Journal of Science, Engineering and Technology