Authors: Associate Professor A.C. Sawant, Suyog Kute, Ayush Kalbhor, Janmajay Dethe, Sonali Kshirsagar
Abstract: The increasing complexity of malware-based network attacks poses a serious challenge to modern communication systems. Traditional signature-based intrusion detection systems struggle to detect novel or evolving attack patterns. This paper presents a comprehensive machine learning framework for classifying malware network traffic using flow-based features. The study uses a dataset of 60,938 instances spanning six classes: normal traffic and five malware types (worm, rootkit, buffer overflow, ipsweep, and sqlattack). The dataset exhibits extreme class imbalance, with normal traffic comprising 99.43% of samples. To address this, we apply label encoding, one-hot encoding, stratified sampling, and feature standardization, and evaluate performance using accuracy, precision, recall, F1-score, confusion matrix, ROC analysis, and training time. Seven supervised classifiers from different learning paradigms—Logistic Regression, Decision Tree, K-Nearest Neighbors, Random Forest, Gradient Boosting, XGBoost, and LightGBM—are developed and compared. Experimental results show that ensemble methods significantly outperform single classifiers in detecting minority attack classes. XGBoost achieves the highest overall performance: 99.95% accuracy, 99.96% precision, 99.95% recall, and 99.95% F1-score, with a training time of 3.70 seconds. LightGBM provides the best trade-off between accuracy (99.79%) and speed (3.25 seconds), while Gradient Boosting requires substantially longer training (82.99 seconds). The findings confirm that ensemble learning is highly effective for malware traffic classification under severe class imbalance.
International Journal of Science, Engineering and Technology