VentureBeat January 7, 2022
Kyle Wiggers

People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly — or entirely — on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.

To investigate whether visuals — specifically footage of mouth movement — can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best...

Today's Sponsors

LEK
ZeOmega

Today's Sponsor

LEK

 
Topics: AI (Artificial Intelligence), Technology, Voice Assistant
‘Humphrey’ AI tool launched to streamline NHS and public services
Oracle shares jump 7% on involvement in AI infrastructure initiative that Trump will announce
Cofactor AI Launches Platform to Help Hospitals Fight Tidal Wave of Claims Denials and Announces $4 Million Seed Round
5 Healthcare AI Trends in 2025: Balancing Innovation and Patient Safety
Do We Need Humans in the Loop? A Novo Nordisk Exec Weighs In (Video)

Share This Article