An Efficient Email Spam Detection Framework Using TF-IDF Vectorization And Comparative Machine Learning Classifiers

22 May

Authors: Gandu Eshwar, Chirra Ram Gopal Rao, Thandu Venkat Sai

Abstract: Electronic mail is one of the most widely used forms of digital communication, yet it is increasingly compromised by the proliferation of unsolicited bulk messages, commonly referred to as spam. Spam email not only consumes bandwidth and storage but also exposes users to phishing, malware, and identity-theft risks. Conventional rule-based and blacklist-driven approaches struggle to keep pace with the rapidly evolving obfuscation strategies adopted by spammers. This paper presents an efficient and scalable email spam detection framework that combines Term Frequency–Inverse Document Frequency (TF-IDF) vectorization with a battery of supervised machine learning classifiers, including Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB), K-Nearest Neighbors (KNN), Random Forest (RF), Extra Trees Classifier (ETC), and gradient boosted ensembles. Experiments performed on a publicly available labeled corpus of 5,572 messages demonstrate that the proposed TF-IDF + Linear SVM pipeline attains 99.9% accuracy on training data and 98.2% accuracy on unseen test data. Ensemble strategies based on soft voting and stacking achieve precision values as high as 1.0, eliminating false positives in the evaluated test partition. The reported findings establish the proposed framework as a lightweight, interpretable, and deployment-ready solution for real-world spam filtering systems.