Eternal Voice: An Adaptive Multi-Modal Framework For Personalized Emotional Memory Preservation

4 Jun

Authors: Suyash Arvind Lothe, Bhavya Ketan Doshi, Dr. Jasbir Kaur, Sandhya Thakkar, Suraj Kanal

Abstract: The preservation of human identity through gen-erative AI presents unique challenges in maintaining vocal fidelity, emotional nuance, and ethical integrity. While individual technologies for text-to-speech (TTS) and large language models (LLMs) are mature, their integration into a cohesive, user-centric framework for digital legacy remains under-explored. This paper presents Eternal Voice, an adaptive multi-modal framework that tightly couples speaker-adaptive voice synthesis with an emotion-aware conversational agent. The key contribution of this work is a novel integration pipeline. It utilizes a fine-tuned Llama-3-8B model with emotion vector conditioning and a Tacotron 2/WaveNet vocoder stack optimized for low-resource speaker adaptation (SV2TTS). We provide a comprehensive technical evaluation, including an ablation study that quantifies the impact of emotion injection on response naturalness. Experimental results on the LibriTTS and ESD datasets demonstrate a Mean Opinion Score (MOS) of 4.6 ± 0.2 for voice similarity. Moreover, the dialogue coherence improves by 18% over standard LLM baselines. Critically, this paper includes a rigorous discussion of failure modes, deepfake countermeasures, and the limitations of current emotional AI in handling ambiguous human affect.

DOI: http://doi.org/10.5281/zenodo.20541580