Authors: Amit Verma
Abstract: As cloud computing and Network Function Virtualization (NFV) become the backbone of modern digital infrastructure, ensuring the reliability of virtualized environments is paramount. Traditional rule-based fault detection systems often struggle with the dynamic, high-dimensional, and opaque nature of virtual machines (VMs) and containers. This article explores the paradigm shift toward Machine Learning (ML)-driven fault detection, analyzing how supervised, unsupervised, and deep learning models identify anomalies in system logs, performance metrics, and network traffic. We examine the architecture of these systems, the critical role of feature engineering in capturing temporal and structural dependencies, and the transition toward proactive self-healing environments. By reviewing current methodologies and performance benchmarks, this article highlights the trade-offs between detection latency and computational overhead. Finally, we discuss persistent challenges such as data sparsity, model interpretability, and the emerging integration of Large Language Models (LLMs) and Digital Twins in the fault diagnosis lifecycle for 2026 and beyond.
International Journal of Science, Engineering and Technology