Authors: Seema Ahirwar, Rupali Chaure
Abstract: This paper presents a comparative evaluation of seven supervised ML classifiers Logistic Regression, Naive Bayes, k-Nearest Neighbors, Decision Tree, Random Forest, XGBoost, and SVM for early Type 2 diabetes (T2DM) prediction. Using the Pima Indians Diabetes Database (PIDD, n=768) and Frankfurt Hospital Diabetes Dataset (FHDD, n=2000), we apply standardized preprocessing with SMOTE-based class imbalance correction and stratified 10-fold cross-validation[1]. Models are evaluated on Accuracy, Precision, Recall, F1-Score, ROC-AUC, and MCC. Results show XGBoost consistently outperforms all classifiers (AUC: 0.901 PIDD, 0.951 FHDD), while Logistic Regression retains interpretability advantages for clinical deployment. Feature importance analysis identifies fasting plasma glucose, BMI, and HbA1c as top predictors, aligning with clinical guidelines.
International Journal of Science, Engineering and Technology