Ship or Skip — Daily AI Tool Reviews

Speculative decoding has been one of the more impactful inference optimizations of the last two years: instead of generating one token at a time with a large model, a small draft model guesses ahead and the large model verifies in parallel, speeding up generation significantly. The problem has always been that draft models are trained offline and drift from real production traffic over time. Together AI's Aurora changes this by making the draft model continuously learn from live inference requests.

Aurora operates through a serve-to-train flywheel with two decoupled components: an inference server running speculative decoding with a target and draft model, and a training server that asynchronously processes the results, performs gradient updates, and periodically updates draft weights without interrupting service. Accepted tokens provide positive rewards; rejected proposals provide counterfactual supervision. The draft model acts as a reinforcement learning policy with the target model as the environment — a clean formulation that has real theoretical grounding.

The practical results are impressive: a 1.25x additional speedup over a well-trained static speculator, with real-world tests on MiniMax M2.5 showing 1.63× speedup and Qwen3-Coder-Next hitting 1.92× at batch size 1. Perhaps more important, the paper demonstrates that "online training from scratch can outperform a carefully pretrained static baseline" — meaning you don't even need to spend resources pretraining a good draft model if Aurora has enough production traffic to learn from.

The framework is open source (github.com/togethercomputer/aurora), algorithm-agnostic by design, and comes with a Tree Attention mechanism that processes complex speculative branching in a single batched pass. For any organization running inference at scale, Aurora represents a significant infrastructure improvement that's now free to use and study. The accompanying arXiv paper (2602.06932) provides the full theoretical treatment.

Together AI's Aurora Turns Speculative Decoding Into a Self-Improving System

Panel Takes