Data-Driven Crop Yield Forecasting Using Machine Learning and Global Agricultural Datasets

12 Apr

Authors: Ambuj Kumar Misra

Abstract: Accurate prediction of crop yields is fundamental to sustainable food systems, agricultural policy, and climate adaptation planning. This paper presents a comprehensive, data-driven framework for crop yield forecasting that integrates satellite remote sensing, climate reanalysis data, soil property databases, and agronomic records sourced from global platforms including FAOSTAT, NASA MODIS, ERA5, and SoilGrids. Six machine learning architectures — Linear Regression, Decision Tree, Random Forest, Gradient Boosting (XGBoost), Long Short-Term Memory (LSTM), and a hybrid CNN-LSTM network — were systematically evaluated across three globally significant staple crops (wheat, rice, and maize) spanning 123 countries and the period 2000–2022. The hybrid CNN-LSTM model achieved the highest predictive accuracy, recording an R² of 0.94 and an RMSE of 1.63 t/ha on held-out test data. Feature importance analysis identified precipitation, mean growing-season temperature, and the Normalized Difference Vegetation Index (NDVI) as the dominant predictors of yield variability. The study also demonstrates that integrating satellite-derived phenological metrics with climate variables substantially improves forecast skill relative to climate-only baselines. These findings establish a scalable, transferable methodology for near-real-time yield monitoring applicable to food security assessment and early warning systems.

DOI: http://doi.org/10.5281/zenodo.830