Development of Ensemble Model for Advanced Matamorphic Malware Detection

16 Sep

Authors: Ali Shuaibu Babaa, Aminu Abdullah, Farouk Lawan Gambo

Abstract: This thesis presents the development of an ensemble machine learning model for the detection of advanced metamorphic Windows Portable Executable (PE) malware, which poses significant challenges to traditional detection due to its ability to constantly rewrite code and evade signatures. The research employs a static analysis approach, extracting diverse feature sets including opcode n-grams, assembly instruction patterns, PE structural attributes, and textual strings. Feature dimensionality was reduced using Principal Component Analysis (PCA), while XGBoost-based ranking was applied for feature selection. Four heterogeneous classifiers Decision Tree (DT), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Neural Network (NN) were trained and combined using bagging, boosting, and stacking ensemble techniques to enhance accuracy and resilience. The study utilized two categories of datasets: (i) large-scale real-world malware and benign PE files obtained from public repositories, and (ii) synthetically generated metamorphic malware created with the Next Generation Virus Creation Kit (NGVCK) to validate robustness against mutation. Data preprocessing included cleaning, normalization, and class balancing using oversampling techniques. Model evaluation was conducted using k-fold cross-validation and a separate hold-out validation set, with metrics including Accuracy, Precision, Recall, and F1-score. Experimental results demonstrated that individual classifiers achieved outstanding detection performance: Decision Tree (CV Accuracy 0.9976; Val Accuracy 1.0000; Precision/Recall/F1 all 1.0000), SVM (CV Accuracy 0.9941; Val all metrics 1.0000), Neural Network (CV Accuracy 0.9965; Val all metrics 1.0000), and KNN (CV Accuracy 0.9906; Val Accuracy 0.9921; Val F1 ≈ 0.9582). The ensemble configurations consistently outperformed the individual classifiers and the Ahmed Ali (2020) SVM baseline, delivering superior adaptability to mutated malware and reduced false positives while maintaining computational feasibility.This work contributes a robust and practical ensemble framework that leverages multi-modal static features and classifier diversity to achieve high-accuracy metamorphic malware detection. The study recommends extending future research to include dynamic behavioral features, reproducibility artifacts, and deployment-oriented evaluations to further strengthen real-world applicability.

DOI: https://doi.org/10.5281/zenodo.17131177