ARCH: An LLM-Driven Autonomous Self-Healing Framework For Cloud Operations Using Retrieval-Augmented Generation

24 Mar

Authors: Ripusoodan Sharma, Dr. Kirti Jain

 

Abstract: The increasing complexity of cloud-native environments has made fault detection and recovery a critical challenge for modern IT operations. Traditional AIOps solutions rely heavily on static rules or data-driven models, which often lack adaptability in dynamic and unpredictable scenarios. To address these limitations, this paper proposes the ARCH (Autonomous Reasoning and Contextual Healing) framework, an intelligent self-healing system that integrates Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) for autonomous cloud operations. The proposed framework follows a layered architecture consisting of perception, cognition, knowledge, and action components, enabling continuous monitoring, contextual reasoning, and automated remediation. By leveraging advanced reasoning strategies such as chain-of-thought and action-oriented decision-making, the system dynamically analyzes telemetry data and executes corrective actions with minimal human intervention [5], [7]. Furthermore, the incorporation of RAG enhances the system’s ability to utilize historical incident data, thereby improving decision accuracy and contextual awareness [4]. The effectiveness of the ARCH framework is evaluated using key performance metrics, including Mean Time to Repair (MTTR), Autonomous Success Rate (ASR), and overall system efficiency. Experimental results demonstrate that the proposed approach significantly improves fault resolution performance, achieving up to 82% reduction in MTTR and 89.5% autonomous success rate compared to conventional methods. These findings highlight the potential of LLM-driven architectures in enabling scalable and intelligent self-healing cloud systems.

DOI: https://doi.org/10.5281/zenodo.19204589