LLMs Can Teach Themselves to Code Better With No Teacher, No RL, No Verifier
A new paper from Anthropic researchers shows that simply sampling your own model's outputs and fine-tuning on them boosts code generation pass@1 from 42% to 55% on hard benchmarks — no labels, no reward model, no execution needed.
Original sourceA paper published April 1, 2026 — "Embarrassingly Simple Self-Distillation Improves Code Generation" by Ruixiang Zhang, Richard He Bai, and colleagues at Anthropic and Google — presents a technique called Simple Self-Distillation (SSD) that is exactly as described: sample solutions from a model at a particular temperature and truncation setting, then fine-tune on those raw, unverified outputs with standard cross-entropy loss.
No human labels. No reference solutions. No teacher model. No reward model. No verifier. No execution environment. No reinforcement learning. Just the model's own outputs fed back in.
The results are striking: SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrated on harder problems. The technique generalizes across Qwen and Llama model families at 4B, 8B, and 30B scale — including both instruct and thinking variants. The paper's theoretical analysis traces the gains to a "precision-exploration conflict" in LLM decoding: SSD reshapes token distributions context-dependently, suppressing distractor tails where precision matters while preserving diversity where exploration matters.
The practical implication for builders: if you have a domain-specific coding dataset and a capable base model, a cheap post-training step — generate 50–100 solutions per problem, keep syntactically valid ones, fine-tune — may close a meaningful gap without requiring RL infrastructure. The paper trended on Hacker News with 298 points and sparked debate about whether this is genuinely "embarrassingly simple" or whether the hyperparameter sensitivity (temperature, truncation) makes it trickier in practice.
Panel Takes
“The practical recipe is immediately actionable: generate your own dataset from your own model, fine-tune, measure. The lack of any external infrastructure requirement makes this a legitimate week-one experiment for any team with a domain-specific coding task.”
“The "embarrassingly simple" framing undersells the tuning required. The temperature and truncation choices that produce good self-distillation data are non-obvious, the paper was done by researchers with significant compute and expertise, and the benchmarks are standard coding tasks — not the messy real-world code most people actually write.”
“Models that improve by observing their own outputs at scale, without external supervision, is the mechanism that makes capability gains self-sustaining. SSD is a small example of a much larger dynamic: the bottleneck on model improvement is shifting from human labels to compute.”