LipNet: Deep Learning for Visual Speech Recognitions

Authors- Atharva Mahesh Khandagale, Atharva Sunil Bagave, Aditya Santosh Pande, Rishabh Hitesh Jain, Professor Kalaavathi B

Abstract-LipNet is a deep learning system that completely reimagines the approach towards visual speech recognition. It makes predictions over whole sequences of words by forcing a watch of lip movements in videos to a single process. Unlike previous techniques that rely on hand-crafted feature extraction and processing in separate stages, LipNet models this in one end- to-end process. It uses spatiotemporal Convolutional Neural Networks to capture visual features and RNNs with LSTM units for handling sequences. A salient feature of LipNet is the use of Connectionist Temporal Classification loss, which will enable it to learn directly from unsegmented data. Tested on various challenging datasets like GRID, LipNet has set a new standard in automated lip-reading accuracy. Its streamlined design and impressive performance open up exciting possibilities in areas like accessibility, silent communication, and security, making it a major step forward in this field.

DOI: /10.61463/ijset.vol.13.issue2.210