Meta claims its AI improves speech recognition quality by reading lips
VentureBeat January 7, 2022
People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly — or entirely — on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.
To investigate whether visuals — specifically footage of mouth movement — can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best...