Authors- Lithika Jadav Malothu
Abstract-Toxicity prediction is a pivotal step in drug discovery, cru- cial for minimizing late-stage failures due to adverse effects of chemical compounds. This study investigates the perfor- mance of multiple machine learning models, including Light- GBM, Random Forest, and XGBoost, in combination with different molecular featurizers such as Extended Connectiv- ity Fingerprints (ECFP), ChemBERTa, and Graph Convolu- tional Networks (GCN). Using the Tox21 dataset as a bench- mark, we demonstrate that the Random Forest model paired with ChemBERTa achieves superior predictive accuracy and interpretability for toxicity prediction tasks. Additionally, our analysis identifies key substructures, such as aromatic rings and halogenated groups, that significantly influence toxic- ity predictions, emphasizing the role of appropriate model- featurizer combinations in enhancing prediction accuracy and interpretability. This work contributes to the growing body of research in cheminformatics by providing insights into the selection of optimal model-featurizer pairs, which can poten- tially guide safer and more efficient drug development pro- cesses.