Predicting Overtraining and Injury Risk in Marathon Athletes Using Machine Learning

18 May

Authors: Vikas Kumar, Dr. Akhilesh Das Gupta Institute of Professional Studies

Abstract: I built a machine learning model that can actually predict how much injury risk you have and what you might be doing wrong in your training. The idea came from my own personal experience — I was struggling with shin pain, tight hamstrings, calf tightness, thigh soreness, and sore knees after just my first week of running. I started watching YouTube videos and reading blogs to understand why this was happening. I learned about different running zones and why beginners who start too fast get injured early, how skipping warm-ups puts a cold body at serious injury risk, and how not eating enough or skipping recovery days makes everything worse. Basically injury does not come from one thing — it comes from getting multiple things wrong at the same time. Since collecting real athlete data from scratch would have taken years, I used a synthetically generated dataset of 1,240 weekly training records simulating 180 athletes, built around features like the Acute:Chronic Workload Ratio (ACWR), weekly mileage change, and recovery patterns. I tested four different classification models — Logistic Regression, Decision Tree, Random Forest, and XGBoost — to see which one would work best for this kind of prediction. XGBoost came out on top with 88.7% accuracy and ROC-AUC of 0.93 on the held-out test set. The model is deployed as a web app where the user fills in details like age, weekly mileage, height, mileage change percentage, recovery days, and active days — and gets back their injury risk percentage along with suggestions to help them stay injury-free.

DOI: https://doi.org/10.5281/zenodo.20266328