Authors: Bikash Shrestha
Abstract: Predictive maintenance (PdM) has emerged as a cornerstone for ensuring the high availability and reliability of modern cloud infrastructure. As cloud environments grow in complexity, traditional reactive and preventive maintenance strategies often fall short, leading to either costly unplanned downtime or wasteful over-servicing of resources. This review explores the integration of Machine Learning (ML) algorithms—such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Random Forests—in predicting hardware failures, network anomalies, and software degradations. By analyzing real-time telemetry data including CPU thermals, disk I/O latency, and power consumption, ML models can identify pre-failure patterns with high precision. The article discusses the architectural transition toward "AIOps," the challenges of data heterogeneity in multi-cloud environments, and the future role of Edge-Cloud collaboration. Ultimately, the synthesis of ML with cloud monitoring transforms maintenance from a cost center into a strategic advantage, ensuring 99.999% service level objectives.
DOI: https://doi.org/10.5281/zenodo.19417741
International Journal of Science, Engineering and Technology