VentureBeat January 7, 2022
Kyle Wiggers

People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly — or entirely — on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.

To investigate whether visuals — specifically footage of mouth movement — can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best...

Today's Sponsors

LEK
ZeOmega

Today's Sponsor

LEK

 
Topics: AI (Artificial Intelligence), Technology, Voice Assistant
This Is Not Broadcom’s ‘Nvidia Moment’ Yet
ChatGPT adds more PC and Mac app integrations, getting closer to piloting your computer
AI Fuels Tech Surge as Micron, Databricks and Basis Ride Wave of Growth
AI Translator Could Be ROI Boon
‘Orgs need to be ready’: AI risks and rewards for cybersecurity in 2025

Share This Article