COVID-19 Data Analysis and Forecasting Using Machine Learning and Time Series Models

11 May

Authors: Assistant Professor S. P. Gunjal, Tanaya Balasaheb Sandbhor, Nisha Sanjay Tekade, Sadichha Balshiram Talekar, Tejaswini Rajendra Pawar

Abstract: The COVID-19 pandemic has caused an unprecedented global health crisis, making accurate forecasting of case counts critically important for government planning and resource allocation. This paper presents a comprehensive data analysis and multi-model forecasting study on the Kaggle COVID-19 day-wise dataset spanning January 22 to July 27, 2020. We apply statistical time series models — ARIMA and Holt-Winters Exponential Smoothing — alongside machine learning approaches including Ridge Regression, Lasso, ElasticNet, and Random Forest Regressor. Data preprocessing includes Augmented Dickey-Fuller (ADF) stationarity testing, first-order differencing, 7-day rolling smoothing, and lag feature engineering. Models are evaluated using MAE, RMSE, and R² metrics with 80/20 temporal train-test split and 5-fold time-series cross-validation. Random Forest achieved the best performance with RMSE = 13,227 and R² = 0.9612. A 30-day future forecast with 95% confidence intervals is generated using the best-fit ARIMA(2,1,2) model. Results demonstrate that ensemble machine learning methods outperform classical statistical models for COVID-19 case prediction.

DOI: https://doi.org/10.5281/zenodo.20117590