Back
Together AI BlogResearchTogether AI Blog2026-04-01

Together AI's Aurora Turns Speculative Decoding Into a Self-Improving System

Together AI released Aurora, an open-source reinforcement learning framework that makes speculative decoding continuously adaptive to live inference traffic. Instead of static offline-trained draft models, Aurora's draft model learns from real production requests — delivering a 1.25x additional speedup on top of already-optimized static speculators.

Original source

Speculative decoding has been one of the more impactful inference optimizations of the last two years: instead of generating one token at a time with a large model, a small draft model guesses ahead and the large model verifies in parallel, speeding up generation significantly. The problem has always been that draft models are trained offline and drift from real production traffic over time. Together AI's Aurora changes this by making the draft model continuously learn from live inference requests.

Aurora operates through a serve-to-train flywheel with two decoupled components: an inference server running speculative decoding with a target and draft model, and a training server that asynchronously processes the results, performs gradient updates, and periodically updates draft weights without interrupting service. Accepted tokens provide positive rewards; rejected proposals provide counterfactual supervision. The draft model acts as a reinforcement learning policy with the target model as the environment — a clean formulation that has real theoretical grounding.

The practical results are impressive: a 1.25x additional speedup over a well-trained static speculator, with real-world tests on MiniMax M2.5 showing 1.63× speedup and Qwen3-Coder-Next hitting 1.92× at batch size 1. Perhaps more important, the paper demonstrates that "online training from scratch can outperform a carefully pretrained static baseline" — meaning you don't even need to spend resources pretraining a good draft model if Aurora has enough production traffic to learn from.

The framework is open source (github.com/togethercomputer/aurora), algorithm-agnostic by design, and comes with a Tree Attention mechanism that processes complex speculative branching in a single batched pass. For any organization running inference at scale, Aurora represents a significant infrastructure improvement that's now free to use and study. The accompanying arXiv paper (2602.06932) provides the full theoretical treatment.

Panel Takes

The Builder

The Builder

Developer Perspective

Training the draft model on your actual production traffic rather than generic text is obviously the right approach — it's surprising it took this long to see a well-executed open-source version. The 1.92x speedup on Qwen3-Coder at batch size 1 is the number I'll be quoting to justify integrating this.

The Skeptic

The Skeptic

Reality Check

The speedup numbers are at batch size 1 — production workloads typically run much higher batch sizes where the gains shrink. The RL training loop also adds operational complexity that could cause subtle regressions if the reward signal drifts. Real-world adoption will be the real test.

The Futurist

The Futurist

Big Picture

Inference optimization that automatically improves with usage is a self-reinforcing competitive moat. Aurora makes Together AI's inference quality compound over time in a way that static optimization pipelines fundamentally can't. This is the infrastructure story of 2026.