Speculative Speculative Decoding

or, Queue Management by Any Other Name

Tim Dingman

Jun 02, 2026

Paper · Repo

Originally presented as a live talk on May 20, 2026

Background

To understand this paper, we have to understand how a model on a GPU turns your prompt into predicted tokens.

Let’s start with what we already know from our daily use of LLMs: you pass in your prompt all at once, you wait a bit, and then you start getting back tokens one by one.

Already we can start to relate to this diagram. The part where you pass in your prompt is called prefill. During prefill, you are filling up the working memory of the model, what we call the KV cache. The diagram here kinda breaks up “KV” and “cache”, but you can see that the prompt turns into KV vectors, and then those get cached, hence “KV cache”.

So now we have all the input in our working memory. That can take a long time depending on the hardware you’re using and how big your input is, but importantly, prefill happens for all input tokens in parallel. On my GPU at home, prefill runs at several hundred tokens per second.

So once prefill is done, the model is ready to start making predictions. And it makes those predictions one at a time, as the diagram shows: first “jumps”, then “over”, etc etc. This stage is called decode, because we’re decoding the mathematical representations that the model works with into words that humans work with.

Because decode runs one token at a time, it is much slower than prefill. On my GPU, decode runs at something like 30 tokens per second.

Note that when we decode a token, it gets cached too - that’s why the dotted line at the top spans prefill and decode. So the KV cache, that working memory, grows every time the model outputs a new token.

Now let’s look a level deeper, at the hardware. On a GPU, you have your chip and your memory, aka your VRAM. When you start up your model server, the server reads the model from your hard drive and puts it into your VRAM, right next to your chip.

When you actually use your model, like by sending it a prompt and getting back a response, your model server takes one layer of your model at a time from VRAM and sends it to your chip for computation. So if I’m at the very first attention layer, it’s gonna take my input and the matrices that actually make up the first attention layer and send ‘em to the chip for multiplication and so on. Then it takes that first attention layer back from the chip, along with the KV cache created from the computation, and it’s gonna send in the first feed-forward layer for computation.

That process of loading and computing and unloading happens over and over again until you’re at the final layer, where the chip can finally produce the predicted token. Then you gotta do the whole routine over again to predict the next token.

So this shuttling of weights to and from the chip is typically what slows you down - the constraint is your memory bandwidth, not the speed of your chip. If you can somehow compute multiple tokens at once in decode, like you do for prefill, then you can avoid repeating that shuttling. Sure, it costs you a tiny bit of extra time on the chip to deal with more tokens, but that’s peanuts compared to the time you saved in transit.

The question is, how do you predict multiple tokens? If predicting a token requires understanding all the tokens before it, how can you predict more than one?

The answer is in the name of the technique: “speculative decoding”. Instead of making a brand new prediction every time, you speculate about what the next few tokens will be - you take a guess beforehand and then check.

Speculative decoding works for the same reason prefill is faster than decode: inputs get processed in parallel. As we said before, once you load the weights onto the chip, it’s quick to do one or two or three or four calculations. As long as you have a good way to guess tokens, the model can check them all in parallel.

Of course, if the first token fails, then the other ones you guessed after it will likely be wrong and you’ll have to throw them away. But if your method for guessing tokens is cheap enough, and you’re not wrong too often, it can work out.

One common method is to have a version of the model itself make the guesses. Specifically, a much smaller version, ten or a hundred times smaller in fact, so it’s much faster and also can fit on the same GPU. This “draft model” as it’s called is not nearly as smart as the target model that is actually producing tokens, but it’s often smart enough. After all, most tokens are not incredibly complex or subtle; language is chock full of common and supporting words, and a lot of sentences are pretty mundane, meant to support the occasional novel or surprising sentence. It’s even more true for code, which demands predictable structure in a way natural language doesn’t.

As the graphic shows, the draft model quickly produces a few tokens, which all go into the target model in parallel rather than in serial. And thinking back to the last slide about where the bottleneck is, because you now have multiple tokens ready for computation, you save all that shuttling from VRAM to the chip on the second and third and fourth tokens.

In the example here, we have indeed generated four tokens, but the third one gets rejected, and the target model’s predicted token takes its place. The fourth draft token doesn’t get checked at all, because it depends on the third draft token being correct, which it wasn’t.

Let’s zoom in on this example a bit more. We have our four speculated tokens from the draft model, and we’re going to verify them with the target model.

Specifically, we’re going to check if the probability of the speculated token for the target model is at least as high as the probability of that token from the draft model. Like in our case, the target model thought “Brown” was 93% likely, and the draft model though it was 92% likely, and since the target model is smarter, we take the increased probability as a sign that the draft model was pointing us in the right direction. Similar story for “Fox”.

But for “Hopped”, the draft model was more confident in that token than the target model was. That’s a bad sign and means we should reject the draft model’s choice.

Incidentally, when the target model rejects the third token, it substitutes its own - in this case it’s the word “jumped”. That extra token you get from the target model when it rejects the token from the draft model is called a “bonus token”, because you get it “for free” in the process of verification. If you’re really lucky and all your speculated tokens get approved, you get a bonus token after that, directly from the target model. Like if all four tokens had been right in this example, we also would have gotten a fifth token as well, with virtually no extra effort.

Now as you might imagine, the draft model is going to be better at predicting some tokens than others. Like on a hard reasoning problem, the acceptance rate will be quite low, maybe like 25%. But on a more structured and straightforward task, like using a web search tool, it could be near 100%. So speculative decoding isn’t a complete across-the-board speedup, but for a lot of mundane LLM uses it’s helpful. It’s really an empirical question depending on your use cases, your hardware, what model you’re using, stuff like that. Like me personally, on my hardware and for my Hermes Agent, speculative decoding 4 tokens at a time has been helpful and fits on my GPUs.

I should add that there are other forms of speculative decoding, or of trying to predict multiple tokens in one go anyway. We’ll briefly see one called EAGLE-3 in the paper for example.

Another example, which we’re looking at here, is literally called “multi-token prediction” or MTP. The difference with MTP is that is has to be part of a model’s training from the get-go, it’s not an external enhancement.

You can see it right there along the top, at the boxes labeled “Cross-Entropy Loss”. The loss is the single number that tells you how well your model is doing. In a normal model, your pretraining loss is based on how likely you predicted the actual next token in the training data would be. That’s the first box along the top, it has an arrow pointing to L_Main - that’s the symbol for loss.

But here in MTP, there are multiple losses! As you continue along the top, you’ll see L_MTP^1 and L_MTP^2 - the losses for predicting the first and second of the multiple tokens. So now your loss is from the normal token, the first MTP token, and the second MTP token. And if you do training right and minimize the overall loss, you can get pretty good at predicting multiple tokens.

As with so many thing, MTP is an invention of the DeepSeek crew, and all their models since V3 have had it. More recently, Gemma 4 and Qwen3.5 got it, which has been great for local LLM users.

Lastly, one thing I want to remind folks of is what LLMs actually predict. It’s not just one token, it’s actually a distribution of tokens, each with its own likelihood.

Technically the model predicts a likelihood for every single token in its vocabulary, which is usually in the high tens or low hundreds of thousands. Of course nearly all of them will receive virtually 0% odds, so in practice you end up with something more like the distribution here, where a small number sum to almost 100% probability. The shape of the distribution, and how far it extends out, can vary a lot and will be important later on. For now though, just remember that the model usually has a few thoughts on what could or should come next.

The Paper

So now that we’re equipped with the knowledge of speculative decoding, and other techniques like it, we can finally discuss speculative speculative decoding.

For reference, we have SD on the left. Our draft model speculates a few tokens in blue, our target verifies the first one but rejects the second and third, and in the process produces a bonus token, in yellow. A line of connected dots means a sequence of tokens, so we see the end result is two new tokens. Then the cycle begins again, with the next speculated sequence in red.

You’ll notice the draft and target turns happen in serial - one waits while the other works. Surely we could make use of the draft model while the target model is working, right? Like the draft model is fast, it could produce lots of tokens in the time it takes the target model to verify.

There are two problems though. One is that the target model is using up all the memory bandwidth while it’s working. So at least on the same GPU, there is no way for the draft model to work in parallel. That’s fixable with another GPU of course.

The second, bigger problem is that the draft model depends on having all the prior tokens available. If the target model isn’t done verifying, then we don’t know all the prior tokens yet! We know the prior speculated tokens, but of course not all the speculated tokens will be right. Even worse, the bonus token could in theory be any token in the vocabulary. What are we supposed to do?

Well as you probably guessed, there is a solution, shown on the right. Let me walk through it.

First off, you’ll see there are now two parallel paths, separated by a little dotted line down the middle. That’s showing the target model on one GPU and the draft model on another GPU. So they can work simultaneously.

Now the sequence of events is this:

The target produces a token, shown in green
The draft model returns a few speculated tokens
The target model starts verifying. Relatively speaking, that’s gonna take a while
On the draft side, we start preparing contingencies, shown as this kinda branching thing with tones of red and yellow. Specifically, at each token position, we are guessing what bonus tokens the target will produce in case the next speculated token gets rejected. So even for the first token, the green one from the target, we could get the first speculated token rejected. So now we have to be prepared for whatever bonus token in yellow the target model will produce
Since we have our bonus tokens predicted, we might as well speculate on what comes after those. Again, that’s the chains in red. So as long at the actual bonus token is one we anticipated and speculated on, we’re ready to immediately return that new speculated sequence
In this example, the target model verified only the first speculated token, then it produced a bonus token, and that bonus token was one of the three we anticipated. So the draft model immediately returned a new speculated sequence, in that medium shade of red. It’s a little hard to see, but in the tree thing they bolded the line showing the sequence of events

That’s basically it! You now understand the core concept of SSD. Now we look at optimization and performance.

So the big catch on this method is that you have to guess the bonus token, or at least guess it enough of the time that your extra effort isn’t wasted.

On its face, that might seem tough, since the bonus token can in theory be any token in the vocabulary. But of course in practice, it’s not a random choice, and the whole idea of a draft model is it knows how to make reasonable guesses at what the target model will say.

As we know, LLMs actually produce a distribution of tokens, so we have a straightforward way to guess bonus tokens from the start: pick the most likely token in the distribution as the one to speculate, and save a few others for your guesses at the bonus token, in case the speculated token was wrong.

On the surface that seems fine. But because of how verification works, it’s actually a big problem if your other guesses are overconfident, which they tend to be for draft models. Intuitively, since draft models are much smaller than target models, draft model distributions tend to be smaller and thus put a lot of probability on the few tokens they think could be right. By contrast, the smarter target models have wider distributions, accounting for genuinely different thoughts but also things like richer vocab.

So one key to making this work is to fiddle with the raw distributions of the draft model to spread probabilities out more, take out that overconfidence and just hedge a bit. That makes your bonus tokens a better safety net.

The other optimization they work on is how much speculating to do, and where exactly to do it. Like if you had all the time in the world, you would pick tons of bonus tokens at every position in the speculation. But in reality we only have the time it takes for the target model to verify, so the draft model needs to make the best use of it. How exactly they figure out the optimal number of bonus tokens per position is too far in the weeds for this paper, but you can see in the graphic that different positions have different numbers.

Okay, now let’s look at performance. Our target model here is Llama 3.1 70B, and the draft model is Llama 3.2 1B.

Here they compare three methods of decoding, of producing tokens:

Autoregressive, which is the baseline of just one token at a time
Speculative decoding, which runs draft and target models in serial
Speculative speculative decoding, which runs draft and target models in parallel

As you’d expect, SSD wins out, at 4x the speed of AR, and almost twice the speed of SD when using vLLM, which is a very popular LLM server program. SGLang is a relative newcomer and apparently does better with SD than vLLM does, but still, SSD is about a third better.

It’s important to note that the efficacy of speculative decoding, and other techniques like MTP, vary by domain and difficulty.

Like a very easy programming problem is gonna be super predictable, because it has lots of structured language and the draft model’s guesses are gonna be pretty good, because the problem is easy and because programming is a very common use of LLMs.

By contrast, SD will probably not work that well on high-minded creative writing. The draft model is gonna have bad guesses, and there’s just not nearly as much of that stuff in the training data in the first place.

So to that end, it’s important to measure speed gains on different benchmarks. Here we have one code, two chat, and one math benchmark, respectively. As we would expect, the programming and math benchmarks show higher speed gains.

On a related note, temperature also impacts SSD performance. It’s the same intuition, that less predictable text means a lower hit rate for SSD. Higher temperature flattens out the probability distribution and make guessing harder. That’s also why people conflate high temp with creativity, because it makes unlikely tokens relatively more likely, thus creating surprises.

My Takeaways

There is simultaneously a compute shortage and compute overhang
- Lots of efforts to squeeze more performance out of existing hardware
This is a “yes and” approach
- Innovation will continue on the other constraints (e.g. memory bandwidth)
ML infra is still in its infancy
- People have only been serving LLMs at scale for a few years
- Innovations in architecture (e.g. MoE) and inference (e.g. reasoning/TTC scaling) will present different avenues for optimization
I am waiting for AI to discover something like this
- Anyone working on RSI will hit this stuff at some point
- Great test of creativity in the wild

Discussion about this post

Ready for more?