Mercury 2: The fastest reasoning LLM, powered by diffusion - Breakthrough Speed at 1,000+ Tokens Per Second

February 25, 2026 Query: Mercury 2: The fastest reasoning LLM, powered by diffusion
Mercury 2: The fastest reasoning LLM, powered by diffusion - Breakthrough Speed at 1,000+ Tokens Per Second

Photo by Taiki Ishikawa on Unsplash

Mercury 2: The fastest reasoning LLM, powered by diffusion - Breakthrough Speed at 1,000+ Tokens Per Second

Inception Labs has launched Mercury 2, the first diffusion-based reasoning language model that achieves over 1,000 tokens per second—roughly 5× faster than leading speed-optimized models like Claude 4.5 Haiku and GPT-5 Mini. Unlike traditional autoregressive models that generate text sequentially, Mercury 2 uses parallel refinement through diffusion, producing multiple tokens simultaneously while maintaining competitive quality on reasoning benchmarks.

Overview

Mercury 2 represents a fundamental architectural shift in how large language models generate text. Instead of predicting one token at a time, this diffusion-based approach starts with a rough sketch of the full output and iteratively refines it through parallel processing. This breakthrough enables dramatically faster inference speeds—achieving 1,009 tokens per second on Nvidia Blackwell GPUs with just 1.7 seconds end-to-end latency, compared to 23.4 seconds for Claude Haiku 4.5 with reasoning enabled.

The model delivers this speed while maintaining performance on par with Claude 4.5 Haiku and GPT 5.2 Mini across quality benchmarks, scoring 91.1 on AIME 2025, 73.6 on GPQA, and 71.3 on IFBench. Mercury 2 is production-ready and available through an OpenAI-compatible API at significantly lower costs: $0.25 per million input tokens and $0.75 per million output tokens—roughly four times cheaper than comparable models.

Top Recommended Resources

1. Introducing Mercury 2 – Inception

2. Mercury: Ultra-Fast Language Models Based on Diffusion

3. Mercury 2 - Intelligence, Performance & Price Analysis

4. Inception launches Mercury 2, the first diffusion-based language reasoning model

5. Mercury 2: The fastest reasoning LLM (Hacker News Discussion)

Summary

Mercury 2 represents a significant architectural innovation in language model design, demonstrating that diffusion-based approaches can deliver dramatically faster inference speeds while maintaining competitive quality. For developers building latency-sensitive applications like coding assistants, voice AI, or agent systems, Mercury 2's 5× speed advantage and lower costs make it a compelling option. Start with the official Inception Labs announcement to understand the core capabilities, then dive into the arXiv paper for technical depth. Use Artificial Analysis for objective benchmarking data, and explore the Hacker News discussion for real-world developer perspectives on practical applications and limitations.