Sign-to-Speech: Generating Natural Language Audio from Skeletal Sign Language Input Using Transformer Models | International Journal of Science, Engineering and Technology

Authors: Dr. Pankaj Malik, Rahul Singh, Daksh Khandelwal, Shruti Bajpai, Aayush Tiwari

Abstract: This research presents Sign-to-Speech, a novel two-stage framework that translates skeletal sign language gestures into natural spoken language using Transformer-based models. The system first captures 3D skeletal joint data using pose estimation tools like MediaPipe, then translates the sequential joint information into textual sentences using a custom Pose2Text Transformer. These sentences are subsequently converted into speech using a Tacotron2-based Text-to-Speech (TTS) synthesizer. To evaluate the proposed method, we used the RWTH-PHOENIX-Weather 2014T dataset and a custom Indian Sign Language (ISL-TTS) dataset containing synchronized skeletal, textual, and audio samples. Our Pose2Text model achieved a BLEU score of 41.7, METEOR of 0.47, and Word Error Rate (WER) of 23.8%, outperforming conventional CNN-LSTM baselines. The final speech output was assessed using Mean Opinion Score (MOS), where our model received an average MOS of 4.1/5, indicating high naturalness and intelligibility. These results demonstrate the feasibility of end-to-end skeletal sign-to-speech translation, enabling seamless communication for hearing-impaired individuals and laying the foundation for real-time assistive technologies.

DOI: https://doi.org/10.5281/zenodo.16269016