Back
GitHub / MicrosoftLaunchGitHub / Microsoft2026-04-05

Microsoft Open-Sources VibeVoice: A Full Voice Stack With 60-Min ASR and 90-Min TTS in One Release

Microsoft has released VibeVoice, an open-source family of voice AI models covering both speech recognition and text-to-speech at lengths previously reserved for enterprise APIs. The ASR model processes 60-minute audio in one pass with speaker diarization; the TTS model generates 90 minutes of multi-speaker expressive speech. A lightweight 0.5B streaming variant achieves ~300ms latency.

Original source

Microsoft dropped VibeVoice this week — a family of open-source voice AI models that does something no single open-source release has done cleanly before: ship both ends of the voice pipeline (listen and speak) at enterprise quality under an MIT license.

**What's in the package:** VibeVoice-ASR handles up to 60 continuous minutes of audio per pass, producing structured transcriptions with speaker identification, timestamps, and support for 50+ languages. VibeVoice-TTS generates up to 90 minutes of expressive speech with up to 4 distinct speakers. VibeVoice-Realtime is a compact 0.5B streaming model targeting ~300ms latency for real-time voice applications.

**The technical differentiator** is the use of continuous speech tokenizers operating at an ultra-low 7.5 Hz frame rate — an architecture choice that makes long-form audio tractable without the quality degradation typical of chunked approaches. The project has already been integrated into Hugging Face Transformers v5.3.0.

**The competitive context:** ElevenLabs, Deepgram, and AssemblyAI have been building proprietary moats around voice quality and long-form processing. VibeVoice is a direct signal that Microsoft considers these moats temporary. For the open-source ecosystem, this arrives shortly after Mistral's Voxtral and OmniVoice's 600-language TTS release, accelerating what is becoming a crowded race to own the open-source voice stack.

**The caveats:** Microsoft's own documentation marks VibeVoice as "intended for research and development purposes only," citing bias, deepfake risk, and misuse concerns. The project doesn't yet include watermarking or misuse-detection tooling. The community is already deploying it in production apps regardless.

Panel Takes

The Builder

The Builder

Developer Perspective

Hugging Face Transformers integration on day one plus an MIT license means this lands directly into every serious ML team's toolkit this week. The long-form context lengths alone solve a problem that's been forcing developers to stitch together chunked solutions with brittle overlap logic.

The Skeptic

The Skeptic

Reality Check

The deepfake risk is real and currently unmitigated — high-quality open-source TTS at this fidelity will be misused, and Microsoft's 'research only' label provides zero technical friction. The absence of any audio watermarking or provenance tooling in the release is a meaningful gap that the research-only framing doesn't excuse.

The Futurist

The Futurist

Big Picture

This is the open-source voice equivalent of Stable Diffusion's image moment — the point where the capability transitions from cloud-controlled to universally accessible. The next 18 months will see voice AI embedded in every local app, every browser plugin, every terminal tool. VibeVoice just opened that door.