Meta Ships Llama 4: Open-Weight Multimodal MoE With 10M Context, First to Match Frontier Closed Models
Meta released Llama 4 Scout and Maverick — the first open-weight models with native multimodal understanding, MoE architecture, and a 10M token context window. Maverick benchmarks competitively with GPT-4o and Gemini 2.0 Flash at less than half the active parameters.
Original sourceMeta's Llama 4 release marks a genuine architectural departure from prior Llama generations. Both Scout and Maverick are natively multimodal — they understand text and images without adapters or separate vision encoders — and both use a Mixture-of-Experts design that keeps active parameter counts low (17B) while scaling total capacity dramatically.
Scout's 10M token context window is the largest ever released in an open-weight model. This isn't a headline number: 10M tokens absorbs entire codebases, legal document sets, or scientific paper corpora in a single inference call. For agentic pipelines that previously required chunking strategies or retrieval augmentation just to handle large inputs, Scout removes that architectural constraint entirely.
Maverick goes the other direction: 128 experts, 512K context, and benchmark performance that Meta claims matches or exceeds GPT-4o on broad multimodal evaluations and trades blows with DeepSeek v3 on coding and reasoning. Independent evaluators have noted the benchmarks are competitive but not uniformly dominant over GPT-5 tier models. Still, the efficiency argument is real — Maverick gets those results at a fraction of the per-token compute cost of fully dense frontier models.
The release lands on Hugging Face and llama.com simultaneously, with same-day availability on major cloud platforms. Meta's Llama license still restricts certain commercial uses above a user threshold, which the open-source community has noted. The full Llama 4 herd — including a rumored "Behemoth" frontier model — is expected to be revealed at LlamaCon on April 29.
For the developer community, the practical takeaway is straightforward: open-weight multimodal models are now competitive enough to build production systems on, self-hosting MoE is economically viable, and the 10M context window eliminates an entire class of RAG workarounds. The Llama 4 release is the first time "just use a local model" is a genuinely reasonable answer for a broad range of multimodal agentic tasks.
Panel Takes
“”
“”
“”