VentureBeat October 17, 2024
Zyphra Technologies, the company working on a multimodal agent system combining advanced research in next-gen state-space model architectures, long-term memory, and reinforcement learning, just released Zyda-2, an open pretraining dataset comprising 5 trillion tokens.
While Zyda-2 is five times larger than its predecessor and covers a vast range of topics, what truly sets it apart is its unique composition. Unlike many open datasets available on Hugging Face, Zyda-2 has been distilled to retain the strengths of the top existing datasets while eliminating their weaknesses.
This gives organizations a way to train language models that show high accuracy even when operating across edge and consumer devices on a given parameter budget. The company trained its Zamba2 small language model using this...