VentureBeat January 7, 2022
Kyle Wiggers

People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly — or entirely — on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.

To investigate whether visuals — specifically footage of mouth movement — can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best...

Today's Sponsors

LEK
ZeOmega

Today's Sponsor

LEK

 
Topics: AI (Artificial Intelligence), Technology, Voice Assistant
In AI Businesses Trust—But Are Still Accountable For Integrity Lapses
Visualizing ChatGPT’s Rising Dominance
Sam Altman Speaks On Tech Progress
AI Makes Echocardiography Faster, More Accessible
AI-Driven Dark Patterns: How Artificial Intelligence Is Supercharging Digital Manipulation

Share This Article