Please wait while we're digging into our systems to find what you want....

What’s on your mind..?

LEMON BLOG

NVIDIA’s DFlash Could Make Large Language Models Respond Much Faster

Thursday, 25 June 2026

News

Large language models may appear to respond instantly, but behind every answer is a highly repetitive process. Most modern LLMs generate text one token at a time. A token might be a word, part of a word, punctuation mark or short piece of code. The model predicts one token, adds it to the context, then predicts the next. This continues until the response is complete.

That process works well, but it creates a major bottleneck for AI services that need to handle many users at once. The larger and more complex the task becomes, the more the delay can be felt. This is especially important for coding assistants, reasoning tools and AI agents that need to generate long responses while still feeling responsive.

NVIDIA is now highlighting a technique called DFlash that could help address that limitation by changing how the "drafting" stage of AI generation works.

Why Token-by-Token Generation Is a Problem

Autoregressive models are built around a simple rule: generate the next token based on everything that came before it.

The approach produces high-quality results, but it is inherently sequential. The GPU cannot simply generate a full paragraph in one step because each new token depends on the previous one.

This means that even powerful AI hardware can spend part of its time waiting for the next stage of generation to begin. In latency-sensitive environments, that limits how many users can be served without slowing down each individual response.

The challenge becomes even greater as AI shifts from short chat prompts to multi-step workflows. Coding agents, research systems and other task-oriented tools may need to produce large amounts of text, reason through problems and interact with multiple tools before reaching a result.

How Speculative Decoding Helps

One existing solution is speculative decoding.

Instead of asking the largest model to generate every token by itself, a smaller and faster model tries to predict several future tokens first. The larger target model then checks those predictions in parallel.

When the draft is accurate, the target model can accept multiple tokens in a single verification step. This reduces the number of full generation cycles needed and can speed up the response.

However, traditional speculative decoding still has a limitation. The smaller draft model usually generates its predictions one token at a time as well.

That means the drafting process can become slower as the number of proposed tokens increases. It is faster than relying entirely on the largest model, but it does not fully remove the sequential bottleneck.

DFlash Takes a Different Approach

DFlash replaces the usual autoregressive drafter with a lightweight block-diffusion model.

Rather than predicting one future token after another, the DFlash drafter predicts a whole block of masked future tokens in a single forward pass. In simpler terms, it attempts to fill in several upcoming pieces of text at the same time.

The larger target model still performs the final verification, so it remains responsible for the actual answer. This is important because it means DFlash is designed to accelerate generation without changing the final output quality expected from the main model.

The technique shifts more of the work into parallel processing, which is where modern GPUs are most effective.

Why This Matters for NVIDIA Hardware

Modern AI systems are not always limited by raw processing power alone. During text generation, memory movement and token-by-token execution can become the real bottlenecks.

DFlash is designed to expose more parallel work to the GPU during the decoding stage. That allows the hardware to spend less time waiting for sequential token generation and more time handling multiple operations together.

NVIDIA tested the approach on Blackwell-based systems using the gpt-oss-120b model and TensorRT-LLM. In its reported benchmarks, DFlash delivered more than 15 times the throughput of standard autoregressive decoding at high-interactivity targets, while also outperforming EAGLE-3 speculative decoding in the same test environment.

At lower concurrency, the company also reported that DFlash could more than double responsiveness for certain workloads.

These results are promising, although they should be viewed in context. Performance can vary depending on the model, hardware, framework, workload, batch size and latency target.

Better Performance for Coding, Reasoning and AI Agents

The potential benefits are especially relevant for workloads that need both speed and long-form output.

Interactive coding tools need to generate code quickly without making developers wait. Reasoning systems may need to produce lengthy internal responses before giving users an answer. AI agents can perform multi-step tasks that involve planning, tool calls, analysis and repeated generation.

In all of these cases, a small delay per token can become a much larger delay by the end of the task.

DFlash aims to improve that trade-off. By generating more candidate tokens in parallel, it may allow AI systems to maintain a smoother experience for each user while also serving more users at the same time.

NVIDIA's tests showed DFlash outperforming EAGLE-3 across coding, retrieval-augmented generation, reasoning, writing, multilingual and summarisation workloads.

The Technology Behind the Speed-Up

DFlash relies on three main ideas working together.

The first is block-diffusion drafting, where several possible future tokens are predicted at once instead of sequentially.

The second is target hidden-state conditioning. The DFlash drafter receives context features from the larger target model, giving it a stronger understanding of what the main model is likely to generate next.

The third is key-value injection, which passes those target-model context features deeper into the drafter's processing layers. This is intended to improve the quality of its predictions and increase the number of tokens accepted during verification.

The result is a smaller model that does not need to fully reason through every token from scratch. Instead, it uses signals from the larger model to make faster, more informed draft predictions.

Support for Popular AI Inference Frameworks

One reason DFlash is attracting attention is that it is moving beyond research demonstrations.

The project has released model checkpoints for several model families, including Qwen, Kimi, Llama, Gemma and gpt-oss. It is also being supported through widely used inference tools such as vLLM, SGLang and TensorRT-LLM.

For teams already using these platforms, adoption may be relatively straightforward. In some cases, switching from an existing speculative decoding approach to DFlash can be handled mainly through configuration changes and a compatible draft-model checkpoint.

NVIDIA reported up to 5.8 times higher throughput for Gemma 4 31B using vLLM on a Blackwell Ultra GPU, while Qwen3 8B on SGLang showed gains of up to 5.1 times over standard autoregressive decoding in the tested benchmarks.

A Sign of Where AI Inference Is Headed

The rise of DFlash reflects a broader shift in AI development.

For a long time, the focus was mainly on training larger and more capable models. Now, attention is increasingly turning toward how efficiently those models can be deployed in the real world.

A model is only useful if people can access it at a reasonable speed and cost. Faster inference means lower serving costs, better responsiveness and the ability to support more users with the same infrastructure.

DFlash does not replace autoregressive models. Instead, it uses a diffusion-based system to make the existing generation process more efficient while leaving final verification to the original target model.

That balance may prove useful as AI services continue to grow in scale and complexity.

Final Thoughts

DFlash is an interesting example of how AI systems can be made faster without simply relying on bigger hardware.

By replacing sequential drafting with parallel block-diffusion predictions, the technique gives GPUs more useful work to do during one of the slowest parts of LLM generation. The target model still verifies the result, helping preserve response quality while improving speed.

For developers building coding assistants, AI agents, enterprise chat systems or high-volume reasoning tools, this could become an important optimisation path. The reported gains will not apply equally to every setup, but the core idea is clear: the future of AI performance may depend just as much on smarter inference techniques as it does on larger models.

How do you feel about this post?

Tags:

About the author

Mr LemonGuy

Creator of Lemon Web Solutions, Mr. LemonGuy explores the front lines of technology—from cybersecurity to AI-driven development. Part developer, part digital architect, he focuses on delivering high-impact industry news and open-source projects that bridge the gap between emerging tech and real-world application.

Comments

No comments made yet. Be the first to submit a comment

Discover Topics

Application Development 89 post(s)

Cybersecurity 195 post(s)

Designs & Artworks 107 post(s)

Games 668 post(s)

General Information 108 post(s)

Mobile Development 209 post(s)

News 278 post(s)

Tech Gadgets 154 post(s)

Technical Solutions 126 post(s)

Web Development 205 post(s)

Explore all articles

Stay Updated

FRESH FINDS

Ditch the Keys: My Experience with the eLinkSmart Fingerprint Padlock

18 February 2025 | Tech Gadgets

Last week, I decided to streamline my daily routine by reducing the number of keys I carry. After some research, I purchased the e...

Bowling (PSX) – Classic Strikes and Retro Alley Fun

05 October 2025 | Games

In the late 90s, not every PlayStation game was about racing or fighting — some were just about pure, casual fun. Bowling (PSX) br...

SolarWinds Serv-U 15.5: Four Critical Bugs, One Clear Message — Patch Now

27 February 2026 | Cybersecurity

If you run SolarWinds Serv-U in your environment, this is one of those updates you don't "schedule for later." SolarWinds just shi...

Yoot Tower – Creative Building and Relaxed Management Simulation

02 February 2026 | Games

Yoot Tower is a quietly confident simulation game that focuses on creativity, observation, and long-term planning rather than cons...

Hackers Are Using Steam’s Wallpaper Engine To Spread Account-Stealing Malware

19 June 2026 | Cybersecurity

Cybercriminals are once again showing how even familiar platforms can be abused when users are not careful. This time, researchers...

With SARA Ending, What Other Government Aid Can Malaysians Turn To?

15 December 2025 | News

As the Sumbangan Asas Rahmah (SARA) programme draws to a close, many Malaysians are understandably asking what comes next. Over th...

RTM To Leave Astro Platform This July After Nearly 30 Years

13 May 2026 | News

Radio Televisyen Malaysia, better known as RTM, is set to end its long-running presence on Astro this July, marking the close of a...

E-Invoicing Postponed for SMEs: What the Delay Really Means for Malaysian Businesses

06 January 2026 | News

The Malaysian government has announced a significant reprieve for small and medium enterprises (SMEs that fall within the RM1 mill...

Adventure Capitalist: A Casual Business Simulation Game Built Around Growth, Upgrades, And Idle Money-Making Fun

09 May 2026 | Games

Adventure Capitalist is a casual business simulation game that turns the simple idea of making money into a surprisingly addictive...

New Windows Update Rules Are Coming To Windows 11: 5 Things Users Should Know

11 May 2026 | Technical Solutions

Windows Update has always been one of those Windows features that users know is important, but still complain about for very under...

Castlevania: Harmony Of Dissonance: A Gothic Action Adventure Built Around Castle Exploration, Magic, And Dark Atmosphere

16 May 2026 | Games

Castlevania: Harmony of Dissonance is a gothic action-adventure game that brings together side-scrolling combat, castle exploratio...

Microsoft Authenticator Security Flaw Shows Why MFA Apps Must Be Kept Updated

21 May 2026 | Cybersecurity

Microsoft Authenticator is one of those apps many people rarely think about after setting it up. It quietly sits on the phone, app...

Microsoft Confirms September Windows Server Update Causing Active Directory Sync Issues

17 October 2025 | Technical Solutions

A Problematic Update for Windows Server 2025 - Microsoft has acknowledged that its September 2025 security update (KB5065426)...

Maybank to Introduce Daily Limits and Cooling-Off Period for Reload Services

06 May 2025 | Web Development

If you're a Maybank user who frequently tops up mobile credits or game PINs online, take note: starting 26 May 2025, Maybank will ...

The Top Dating Apps in 2025: What’s Changing, What’s Working, and What You Should Know

02 January 2026 | Mobile Development

Dating apps in 2025 are no longer just about swiping right and hoping for the best. The entire scene has evolved with smarter matc...

Burrito Bison: A High-Speed Arcade Game Built on Momentum and Satisfying Progression

12 April 2026 | Games

Arcade games often succeed by combining simple mechanics with a strong sense of progression, and Burrito Bison exemplifies this ap...

Super Mario Bros. 3: A Classic Platform Adventure Built Around Creative Worlds, Power-Ups, And Timeless Level Design

07 June 2026 | Games

Super Mario Bros. 3 is one of the most iconic side-scrolling platform games ever made, remembered for its colourful worlds, tight ...

Lemon SkyCraft

28 December 2024 | Games

In a vibrant reimagining of Kuala Lumpur's skyline, this futuristic hovercraft racing game delivers an electrifying experience as ...

Nazi Zombies and the Timeless Hook of Wave-Based Survival Pressure

07 February 2026 | Games

Some games don't need a long setup to become instantly gripping. Nazi Zombies is built around a simple premise that escalates fast...

Adobe Premiere vs Adobe After Effects: Understanding the Difference

17 September 2025 | Designs & Artworks

When people first step into the world of video editing and motion graphics, two Adobe names always pop up—Premiere Pro and After E...

Windjammers: A Fast-Paced Arcade Sports Game That Redefined Competitive Play

21 March 2026 | Games

Arcade sports games have always thrived on simplicity and speed, but Windjammers elevated that formula into something truly specia...

Rapid KL’s RM150 Monthly Pass Is Now Available Through TNG eWallet

21 June 2026 | Mobile Development

Commuters using Rapid KL's rail and bus services can now purchase the RM150 Rapid Monthly travel pass directly through the Touch '...

Lucky Blocks Obby Browser Challenge

18 February 2026 | Games

Lucky Blocks Obby is the kind of game that looks simple at first glance, then immediately reminds you that "simple" doesn't mean "...

The Cranberries – Zombie Guitar Cover

23 March 2025 | Guitar Covers

Few songs from the 1990s carry as much emotional weight and political power as Zombie by The Cranberries. With its raw vocal deliv...

JPJ Gears Up for Major Kejara Demerit System Overhaul

21 July 2025 | Web Development

The Road Transport Department (JPJ) is in the final stretch of revamping the Kejara demerit points system, a long-standing mechani...

Someone Got iOS to Run on a Nintendo Switch—But Don’t Expect It to Replace Your iPhone

23 June 2025 | Tech Gadgets

The Nintendo Switch is still a playground for modders and tinkerers, especially now that excitement is building for its successor,...

AI And Humans Face Off In A Cybersecurity Showdown

18 May 2026 | Cybersecurity

Artificial intelligence is quickly becoming part of the cybersecurity world, but a recent hacking competition showed that the tech...

Microsoft is Discontinuing Support for Office Apps on Windows 10. Upgrade Windows 11?

18 January 2025 | General Information

Microsoft has announced that it will no longer support Microsoft 365 apps such as Word, PowerPoint and Excel on computers running ...

Some Galaxy S26 Ultra Users Say the Display Is Causing Eye Strain and Headaches

18 March 2026 | Tech Gadgets

Samsung's Galaxy S26 Ultra was never going to be an easy phone to sell on dramatic upgrades alone. A lot of the early conversation...

Moving from Android to iPhone Just Got Easier with Move to iOS Upgrade

08 May 2025 | Mobile Development

Switching from Android to iPhone can feel like a daunting task, especially when you're worried about losing photos, contacts, or i...

Space Invaders (PSX) – Now Playable in Your Browser

14 September 2025 | Games

I've just published Space Invaders for the PlayStation 1 (PSX) on my blog, fully playable in any modern web browser. This means yo...

Heart of Darkness – Cinematic Platforming and Atmospheric Survival

03 February 2026 | Games

Heart of Darkness delivers a deeply cinematic platforming experience defined by mood, tension, and deliberate pacing. From its ope...

Malaysia’s Cashless Momentum Continues With 8.44 Billion Transactions In 2025

24 April 2026 | Mobile Development

Malaysia's shift towards a cashless society is clearly gaining pace. According to Payments Network Malaysia, the country recorded ...

Ninja Gaiden III: The Ancient Ship Of Doom: A Classic Ninja Action Sequel Built Around Speed, Precision, And Darker Survival Pressure

07 June 2026 | Games

Ninja Gaiden III: The Ancient Ship of Doom is a classic action platform game that continues Ryu Hayabusa's intense journey with fa...

Genting-Based Scam Operation Targeted by MCMC Raids Following Fake SMS Complaints

04 February 2026 | Cybersecurity

If you were up in Genting Highlands recently and your phone suddenly got weird messages that looked "legit enough," this news will...

F-Zero (SNES) – The Futuristic Racer That Redefined Speed

30 September 2025 | Games

When the Super Nintendo Entertainment System (SNES) launched in 1990, it needed a title that would showcase the power of Nintendo'...

The Role of PACS in Malaysian Hospitals: Revolutionizing Medical Imaging

06 March 2025 | General Information

Picture Archiving and Communication Systems, or PACS, have revolutionized the way medical images are stored, retrieved, and shared...

Welcome February 2026, Malaysia (and the rest of the world)

01 February 2026 | General Information

February always feels like the "real" start of the year. January is the warm-up lap where we pretend we're organised, and February...

LEMON QR-CODER: The QR Code Generator That Won’t Ask for Your Wallet

06 February 2025 | Application Development

It all started this morning when my friend casually asked, "Hey, can you generate a QR code for me?" No problem, right? I mean, th...

Cloudflare Outage Brings Lemon Web Solutions to a Halt

18 November 2025 | Personal Blog

On 18 November 2025, Cloudflare experienced a major global outage that knocked many websites and online platforms offline for minu...

Public Bank Raises Minimum OS Requirements for MyPB App: What Customers Need to Know

03 October 2025 | Mobile Development

Public Bank, one of Malaysia's largest financial institutions, has announced an important update for its mobile banking customers....

Unitres Dreams and the Allure of Retro Games That Feel Like a Secret

07 February 2026 | Games

Some games don't grab you with flashy introductions or loud promises. They pull you in quietly, with a tone that feels slightly my...

Sora ChatGPT: Bringing Ideas to Life with AI-Generated Videos

16 February 2025 | Designs & Artworks

Artificial Intelligence has taken storytelling to a whole new level, and OpenAI's Sora ChatGPT is proof of that. Imagine describin...

Die in the Dungeon: Dice-Deck Tactics and Roguelike Runs

01 March 2026 | Games

Die in the Dungeon is a tactical roguelike that finds a fresh angle on a familiar idea by replacing cards with dice. Instead of bu...

Windows 11’s Latest Update Caused Major Frame Drops — Nvidia Fixed It While Microsoft Stayed Silent

03 December 2025 | Technical Solutions

Windows 11's October update has become one of the most frustrating rollouts in recent memory. What was supposed to be a routine Pa...

Mario Party Advance (GBA) – A Solo-Focused Spin on the Mario Party Formula

13 December 2025 | Games

When Mario Party Advance released on the Game Boy Advance, it surprised many fans by taking the familiar party game formula in an ...

Final Fantasy VII – The JRPG That Changed Gaming Forever

27 December 2025 | Games

Final Fantasy VII is more than just a classic role-playing game. It is a cultural landmark that reshaped how stories, characters, ...

A New Security Flaw in Windows – Should You Be Worried?

07 March 2025 | Cybersecurity

Cybersecurity researchers have uncovered a serious vulnerability lurking in the Windows Common Log File System (CLFS). If exploite...

Eagle Ride: Soar, Dodge, and Chase the Perfect Run

01 March 2026 | Games

Eagle Ride is the kind of game that feels relaxing at first, then quietly turns into a focus test once you realise how quickly one...

Girl Zone Online: Fashion, Creativity, and Casual Mini-Game Fun

14 June 2026 | Games

Girl Zone is a colorful casual game built around fashion, creativity, personality, and light mini-game variety. Instead of focusin...

LEMON VIDEO CHANNELS

Step into a world where web design & development, gaming & retro gaming, and guitar covers & shredding collide! Whether you're looking for expert web development insights, nostalgic arcade action, or electrifying guitar solos, this is the place for you. Now also featuring content on TikTok, we’re bringing creativity, music, and tech straight to your screen. Subscribe and join the ride—because the future is bold, fun, and full of possibilities!

ABOUT LEMON

Experienced webmaster specializing in functional, visually appealing web design since 2008, with a strong focus on programming and innovation.

Learn more

LEMON BLOG

NVIDIA’s DFlash Could Make Large Language Models Respond Much Faster

About the author

Mr LemonGuy

Related Posts

Malaysian NVIDIA GeForce RTX 5070 Ti Pricing

NVIDIA Investigating Black Screen Issues with WHQL drivers 572.xx

MITI Investigating Claims of Nvidia’s AI Chips Moving from Singapore to Malaysia

Comments

Discover Topics

Stay Updated

FRESH FINDS

LEMON VIDEO CHANNELS

ABOUT LEMON

QUICK ACCESS

SOCIAL MEDIA

CONTACT INFO