Mercury 2: The fastest reasoning LLM, powered by diffusion - Breakthrough Speed at 1,000+ Tokens Per Second
Inception Labs has launched Mercury 2, the first diffusion-based reasoning language model that achieves over 1,000 tokens per second—roughly 5× faster than leading speed-optimized models like Claude 4.5 Haiku and GPT-5 Mini. Unlike traditional autoregressive models that generate text sequentially, Mercury 2 uses parallel refinement through diffusion, producing multiple tokens simultaneously while maintaining competitive quality on reasoning benchmarks.
Overview
Mercury 2 represents a fundamental architectural shift in how large language models generate text. Instead of predicting one token at a time, this diffusion-based approach starts with a rough sketch of the full output and iteratively refines it through parallel processing. This breakthrough enables dramatically faster inference speeds—achieving 1,009 tokens per second on Nvidia Blackwell GPUs with just 1.7 seconds end-to-end latency, compared to 23.4 seconds for Claude Haiku 4.5 with reasoning enabled.
The model delivers this speed while maintaining performance on par with Claude 4.5 Haiku and GPT 5.2 Mini across quality benchmarks, scoring 91.1 on AIME 2025, 73.6 on GPQA, and 71.3 on IFBench. Mercury 2 is production-ready and available through an OpenAI-compatible API at significantly lower costs: $0.25 per million input tokens and $0.75 per million output tokens—roughly four times cheaper than comparable models.
Top Recommended Resources
1. Introducing Mercury 2 – Inception
- Direct from the source—the most reliable information about the model's design philosophy
- Comprehensive overview of the diffusion-based architecture and how it differs from traditional approaches
- Official benchmarks and performance claims with context about practical applications
2. Mercury: Ultra-Fast Language Models Based on Diffusion
- Detailed technical explanation of how diffusion models are applied to language generation
- Comprehensive benchmark results showing Mercury Coder Mini achieving 1,109 tokens/sec on NVIDIA H100 GPUs
- Academic credibility with contributions from multiple researchers and cross-validation through real-world testing on Copilot Arena
3. Mercury 2 - Intelligence, Performance & Price Analysis
- Mercury 2 ranks #1 out of 133 models for speed at 1,196.2 tokens per second
- Unbiased analysis showing the model is "notably fast, however very verbose"
- Interactive comparisons against 26+ competing models with detailed breakdowns across coding, reasoning, and knowledge domains
4. Inception launches Mercury 2, the first diffusion-based language reasoning model
- Clear explanation of how Mercury 2 "refines multiple text blocks simultaneously instead of going through a text word for word"
- Competitive pricing analysis showing Mercury 2 is roughly four times cheaper than Claude Haiku 4.5
- Context about industry exploration of alternatives to the dominant Transformer architecture
5. Mercury 2: The fastest reasoning LLM (Hacker News Discussion)
- Identification of promising use cases: faster coding agents, voice AI systems, agentic iteration loops, and batch processing
- Critical evaluation from developers noting concerns about quality ceilings compared to frontier models
- Discussion of infrastructure opportunities for running diffusion models on specialized hardware like Cerebras chips
Summary
Mercury 2 represents a significant architectural innovation in language model design, demonstrating that diffusion-based approaches can deliver dramatically faster inference speeds while maintaining competitive quality. For developers building latency-sensitive applications like coding assistants, voice AI, or agent systems, Mercury 2's 5× speed advantage and lower costs make it a compelling option. Start with the official Inception Labs announcement to understand the core capabilities, then dive into the arXiv paper for technical depth. Use Artificial Analysis for objective benchmarking data, and explore the Hacker News discussion for real-world developer perspectives on practical applications and limitations.