Authors: Sia Rawat
Abstract: Recent advancements in audio-driven talking-head generation have enabled the creation of lifelike videos directly from speech input. Despite this progress, most current workflows assume desktop-centric usage, requiring manual recording, configuration, and rendering. In contrast, everyday users increasingly rely on wearables, especially smartwatches, to capture spontaneous speech. This paper proposes W2V (Wrist-to-Video), a novel pipeline where a smartwatch records speech in naturalistic settings and, upon pairing with a PC or MacBook, automatically triggers speech-to-video synthesis. The pipeline integrates lightweight on-device preprocessing, speech recognition and structuring, and advanced video generation through diffusion-based, NeRF-based, and 3D-aware talking-head models. Emotional expression and co-speech gestures are supported via adapter and diffusion modules, and outputs are finalized with captioning and branding for practical deployment. Privacy, fairness, and computational efficiency are built into the design, ensuring both accessibility and ethical safeguards. This work surveys relevant literature, proposes a detailed technical architecture, and outlines evaluation, limitations, and future scope. The result is a frictionless, user-centered system that transforms casual speech into professional video content—lowering the entry barrier for education, telehealth, enterprise, and creative applications.
DOI: https://doi.org/10.5281/zenodo.17131002
International Journal of Science, Engineering and Technology