Large language models may appear to respond instantly, but behind every answer is a highly repetitive process. Most modern LLMs generate text one token at a time. A token might be a word, part of a word, punctuation mark or short piece of code. The model predicts one token, adds it to the context, then predicts the next. This continues until the response is complete.
That process works well, but it creates a major bottleneck for AI services that need to handle many users at once. The larger and more complex the task becomes, the more the delay can be felt. This is especially important for coding assistants, reasoning tools and AI agents that need to generate long responses while still feeling responsive.
NVIDIA is now highlighting a technique called DFlash that could help address that limitation by changing how the "drafting" stage of AI generation works.
Why Token-by-Token Generation Is a Problem
Autoregressive models are built around a simple rule: generate the next token based on everything that came before it.
The approach produces high-quality results, but it is inherently sequential. The GPU cannot simply generate a full paragraph in one step because each new token depends on the previous one.
This means that even powerful AI hardware can spend part of its time waiting for the next stage of generation to begin. In latency-sensitive environments, that limits how many users can be served without slowing down each individual response.
The challenge becomes even greater as AI shifts from short chat prompts to multi-step workflows. Coding agents, research systems and other task-oriented tools may need to produce large amounts of text, reason through problems and interact with multiple tools before reaching a result.
How Speculative Decoding Helps
One existing solution is speculative decoding.
Instead of asking the largest model to generate every token by itself, a smaller and faster model tries to predict several future tokens first. The larger target model then checks those predictions in parallel.
When the draft is accurate, the target model can accept multiple tokens in a single verification step. This reduces the number of full generation cycles needed and can speed up the response.
However, traditional speculative decoding still has a limitation. The smaller draft model usually generates its predictions one token at a time as well.
That means the drafting process can become slower as the number of proposed tokens increases. It is faster than relying entirely on the largest model, but it does not fully remove the sequential bottleneck.
DFlash Takes a Different Approach
DFlash replaces the usual autoregressive drafter with a lightweight block-diffusion model.
Rather than predicting one future token after another, the DFlash drafter predicts a whole block of masked future tokens in a single forward pass. In simpler terms, it attempts to fill in several upcoming pieces of text at the same time.
The larger target model still performs the final verification, so it remains responsible for the actual answer. This is important because it means DFlash is designed to accelerate generation without changing the final output quality expected from the main model.
The technique shifts more of the work into parallel processing, which is where modern GPUs are most effective.
Why This Matters for NVIDIA Hardware
Modern AI systems are not always limited by raw processing power alone. During text generation, memory movement and token-by-token execution can become the real bottlenecks.
DFlash is designed to expose more parallel work to the GPU during the decoding stage. That allows the hardware to spend less time waiting for sequential token generation and more time handling multiple operations together.
NVIDIA tested the approach on Blackwell-based systems using the gpt-oss-120b model and TensorRT-LLM. In its reported benchmarks, DFlash delivered more than 15 times the throughput of standard autoregressive decoding at high-interactivity targets, while also outperforming EAGLE-3 speculative decoding in the same test environment.
At lower concurrency, the company also reported that DFlash could more than double responsiveness for certain workloads.
These results are promising, although they should be viewed in context. Performance can vary depending on the model, hardware, framework, workload, batch size and latency target.
Better Performance for Coding, Reasoning and AI Agents
The potential benefits are especially relevant for workloads that need both speed and long-form output.
Interactive coding tools need to generate code quickly without making developers wait. Reasoning systems may need to produce lengthy internal responses before giving users an answer. AI agents can perform multi-step tasks that involve planning, tool calls, analysis and repeated generation.
In all of these cases, a small delay per token can become a much larger delay by the end of the task.
DFlash aims to improve that trade-off. By generating more candidate tokens in parallel, it may allow AI systems to maintain a smoother experience for each user while also serving more users at the same time.
NVIDIA's tests showed DFlash outperforming EAGLE-3 across coding, retrieval-augmented generation, reasoning, writing, multilingual and summarisation workloads.
The Technology Behind the Speed-Up
DFlash relies on three main ideas working together.
The first is block-diffusion drafting, where several possible future tokens are predicted at once instead of sequentially.
The second is target hidden-state conditioning. The DFlash drafter receives context features from the larger target model, giving it a stronger understanding of what the main model is likely to generate next.
The third is key-value injection, which passes those target-model context features deeper into the drafter's processing layers. This is intended to improve the quality of its predictions and increase the number of tokens accepted during verification.
The result is a smaller model that does not need to fully reason through every token from scratch. Instead, it uses signals from the larger model to make faster, more informed draft predictions.
Support for Popular AI Inference Frameworks
One reason DFlash is attracting attention is that it is moving beyond research demonstrations.
The project has released model checkpoints for several model families, including Qwen, Kimi, Llama, Gemma and gpt-oss. It is also being supported through widely used inference tools such as vLLM, SGLang and TensorRT-LLM.
For teams already using these platforms, adoption may be relatively straightforward. In some cases, switching from an existing speculative decoding approach to DFlash can be handled mainly through configuration changes and a compatible draft-model checkpoint.
NVIDIA reported up to 5.8 times higher throughput for Gemma 4 31B using vLLM on a Blackwell Ultra GPU, while Qwen3 8B on SGLang showed gains of up to 5.1 times over standard autoregressive decoding in the tested benchmarks.
A Sign of Where AI Inference Is Headed
The rise of DFlash reflects a broader shift in AI development.
For a long time, the focus was mainly on training larger and more capable models. Now, attention is increasingly turning toward how efficiently those models can be deployed in the real world.
A model is only useful if people can access it at a reasonable speed and cost. Faster inference means lower serving costs, better responsiveness and the ability to support more users with the same infrastructure.
DFlash does not replace autoregressive models. Instead, it uses a diffusion-based system to make the existing generation process more efficient while leaving final verification to the original target model.
That balance may prove useful as AI services continue to grow in scale and complexity.
Final Thoughts
DFlash is an interesting example of how AI systems can be made faster without simply relying on bigger hardware.
By replacing sequential drafting with parallel block-diffusion predictions, the technique gives GPUs more useful work to do during one of the slowest parts of LLM generation. The target model still verifies the result, helping preserve response quality while improving speed.
For developers building coding assistants, AI agents, enterprise chat systems or high-volume reasoning tools, this could become an important optimisation path. The reported gains will not apply equally to every setup, but the core idea is clear: the future of AI performance may depend just as much on smarter inference techniques as it does on larger models.


Comments