Feb 9, 2025

Introducing Ultravox v0.5: Taking the Lead in Speech Understanding

Last November, we introduced Ultravox v0.4.1, an open-weight speech language model designed for real-time Voice AI. While it was a strong performer among open models, proprietary models still held an edge in key benchmarks.

With Ultravox v0.5, we’ve closed that gap. This latest release brings significant improvements in speech understanding, multilingual support, and real-world adaptability. It now outperforms OpenAI’s GPT-4o Realtime and Google’s Gemini 1.5 Flash on key benchmarks for speech understanding while also maintaining the flexibility and transparency of an open-weight model. [1]

As always, the weights are available on Hugging Face.

Leading in Speech Understanding

Our goal with Ultravox has always been to give models the ability to understand natural, real-world audio without harming general reasoning or instruction following capabilities. We primarily measure our progress on two key benchmarks:

  • CoVoST-2 – Measures the accuracy of speech-to-text translation across multiple languages, used as a proxy for general speech understanding capabilities

  • Big Bench Audio – Evaluates general reasoning capabilities based on speech input

Across both evaluations, Ultravox v0.5 shows clear improvements over proprietary models:

Speech Translation Performance (CoVoST-2 BLEU Score) [2]

Speech-Based Reasoning Performance (Big Bench Audio Score) [3]

These improvements make Ultravox v0.5 more capable of handling complex, real-world voice interactions, with fewer misunderstandings and stronger reasoning capabilities — leading to more reliable AI-driven voice assistants.

How v0.5 Improves on v0.4.1

In addition to outperforming proprietary models, Ultravox v0.5 introduces major improvements over its previous version:

  • 60% improvement in transcription accuracy, with lower word error rates (WER) across 82 evaluation sets from LibriSpeech, CommonVoice, and Fleurs.

  • 18% improvement in speech-based web question answering, particularly in handling named entities and fine-grained speech details.

  • Expanded language support from 15 to 42 languages, making it significantly more accessible for global applications.

Real-World Applications

While benchmark scores are useful, the real value comes from how these improvements translate into practical applications. Ultravox v0.5 makes a difference in several key areas:

  • Live multilingual conversations – Businesses can process real-time speech in 42 languages without needing to know the user’s preferred language at the start of the conversation. This makes it a great choice for companies that appeal to a global audience.

  • AI-powered customer support – More accurate recognition of names, commands, and intent leads to better AI-driven interactions. Ultravox can be further fine-tuned for industry specific applications to achieve even stronger performance.

Thousands of customers already built on Ultravox to handle hundreds of thousands of real-world customer interactions across customer support, inbound and outbound call handling, and AI voice assistants.

Expanded Language Support

Ultravox v0.5 significantly broadens its multilingual capabilities, increasing supported languages from 15 to 42. [4] Unlike traditional models that require pre-selecting a language for accurate recognition, Ultravox v0.5 can seamlessly switch between them in real time. Current supported languages are:

Arabic, Belarusian, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hindi, Hungarian, Italian, Japanese, Latvian, Lithuanian, Macedonian, Marathi, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.

Moving Beyond Cascaded Systems

Ultravox works by adapting strong text-based foundation models (e.g., Meta’s Llama 3.3 andDeepseek’s V3 and R1) and training them to natively understand human speech while maintaining their general language reasoning capabilities. This approach ensures that improvements in speech processing do not degrade the model’s ability to follow instructions or perform text-based reasoning tasks.

Unlike traditional cascaded systems (ASR → LLM → TTS), Ultravox v0.5 processes speech directly, enabling stronger contextual modeling while eliminating cumulative errors introduced by multi-stage processing. Early results indicate that Ultravox v0.5 outperforms cascaded approaches in difficult real-world conditions, such as noisy environments and low-quality microphone inputs. We’ll be publishing a more detailed report on this soon.

Start building on Ultravox

Ultravox v0.5 is available today through Ultravox Realtime, our managed service for building and scaling real-time Voice AI applications.

  • Scales to thousands of concurrent calls

  • Industry-low pricing: $0.05 per minute

  • 30 free minutes to get started

Try it now: https://demo.ultravox.ai

Start building: https://app.ultravox.ai

Looking Ahead

We remain focused on our goal of building real-time, human-level voice AI that can handle the complexity of natural communication. Ultravox v0.5 is a major step forward, but we’re already exploring new architectures and techniques that will define the next generation of voice AI in both understanding and generation.

Ultravox Realtime continues to evolve, and we’re seeing strong adoption from companies around the world. We look forward to sharing more updates soon.

Join Us

Developers — Start building real-time voice AI today with Ultravox Realtime. Check out our docs, example code, and active community on Discord

We're Hiring — Interested in building the future of Voice AI? We're hiring. See our open positions.

Footnotes

1. Ultravox v0.5 benchmarked against OpenAI GPT-4o Realtime (2024-12-17) and Google Gemini Flash 1.5 (002). Model weights are available on Hugging Face.

2. CoVoST-2 BLEU scores are reported as the average across the following selected language pairs on the test set for audio samples under 30 seconds: English to Arabic, Catalan, and German; and to English from Spanish, Russian, Swedish, Turkish, and Chinese. Evaluated using Fixie AI’s evals.

3. Big Bench Audio is a benchmark released by Artificial Analysis, designed to evaluate the reasoning capabilities of audio-language models. Evaluated using Fixie AI’s evals.

4.Training data availability varies across languages, which may affect performance. We welcome feedback and contributions of training data to further enhance Ultravox.