Embarrassingly Simple Self-Distillation Improves Code Generation

or, How to Train Your Distribution

Tim Dingman

Jun 01, 2026

Paper · Repo

Originally presented as a live talk on May 27, 2026

Background

So to really understand this paper, we need to know a lot about how LLMs predict the next token.

We spend a lot of time talking about the three main parts of the transformer:

Embeddings, which turn tokens into vectors
Attention, which forms a holistic understanding of the input
Feed-forward, which processes or “thinks about” that holistic understanding

What we typically neglect though is what happens at the end of the transformer, after all N transformer blocks. What is that last step that turns the output of the last feed-forward layer into a token? The way it’s shown here, it make it seem like the last feed-forward layer just outputs tokens.

In fact, that’s not the case. After all your transformer blocks, you need some way to turn that final transformer output into whatever it is your model is supposed to output. In the case of an LLM, that’s going to be language.

So we attach and train a language modeling head, or LM head for short. The end result of the LM head is not a single token, but a probability for all tokens in the model’s vocabulary, which usually contains around 100k tokens.

Now the vast majority of them are going to be zero or near-zero probability. But depending on the context, you could have kind of a long tail of possible outcomes. Of course in other cases it’s going to be pretty certain, like if the sentence says “The capital of France is”, the token “Paris” is gonna get 99.9% probability or something.

So that probability distribution is what the LLM produces. But how do we get from a probability distribution to a single, selected token?

There are some simple ways you could come up with, like taking the most likely one every time, or picking randomly based on their probabilities. And people do both.

But there’s a lot more to it than that. And how exactly you sample from this distribution can be a surprisingly big deal. So we need to take a closer look at the mechanics.

So broadly, the knobs to twiddle for sampling from your LLM’s probability distribution are called “inference parameters”. Some other settings are inference parameters too, like the maximum number of tokens per response, but a lot of them are about shaping or sampling from the distribution at each position, each time you need a new token.

The most familiar one is temperature. If you’ve heard of it, you probably heard what it impacts, like creativity or predictability. That’s true, but it will serve us better to understand the math directly.

Temperature is a way to fiddle with the distribution before you sample it. When temperature is 1, you get the unaltered distribution, shown in the middle here. 1 is actually considered a high temperature, which might be surprising given the mathematical impact is to do nothing - like you’d naively expect that to be the baseline or neutral temperature.

The lower the temperature goes, the more uneven things get. So the most likely token gets more likely, and everything else gets less likely, sometimes dramatically so. In the extreme case, T = 0, the most likely token from the original distribution is the only token in the new distribution, so you always pick it. A lot of evals run at T = 0 because it makes results reproducible. Like if you always pick the most likely token, then given the same input you should always get the same sequence of tokens as outputs.

On the flip side, raising T past 1 flattens things out. In the extreme case where T approaches infinity, all tokens are equally likely. Practically speaking you of course would not want that, but even a temp like 1.5 shown here is considered pretty high.

You can see where the conflation of temperature and creativity comes from - if creativity means making unusual or unexpected choices, raising the temperature means choices that started unlikely are now relatively more likely.

So once you have your temperature set, there are a few other ways you can alter your distribution.

One of the most common ways is to keep just the top few token choices, drop the rest, and redistribute that dropped probability equally amongst the token choices you kept. That’s called “top-k”. In this example, k=2, so we kept the top two and dropped the rest and gave their probability mass to the two we kept.

Another common choice is top-p, where you set a cumulative probability and drop everything below that. So let’s say I picked p=0.95, meaning I want to keep adding tokens to my retained set until I cross 95% total probability. As soon as I cross that threshold, I drop the remaining tokens and again redistribute the probability mass.

You can do both by the way. Like you take your original distribution, do top-k so you only keep the top 10 or whatever, redistribute probability mass, and then do top-p. Depending on the shape of your distribution, doing one, the other, or both can make a big difference.

Now once you have your final distribution, after your rebalancing with temperature and truncating with top-k and top-p, you may want to characterize it, to describe it in a few metrics.

You’re likely familiar with the basic stats properties like the mean, the median, the standard deviation and so on. Maybe you even know fancier terms like skewness or kurtosis.

But for LLMs, the key metric for these next token probability distributions is the entropy.

The term originally comes from physics, specifically thermodynamics, where it measures disorder or randomness. From that framing it seems bad, and if you already have some baggage about the term like I once did, you’ll have to put it down for the LLM context.

The term gains a positive valence in information theory, where Claude Shannon adapted it. Here, entropy correlates with the amount of information a message carries, relative to a certain context. So for example if you have a weighted coin that always lands heads, then you’re never going to be surprised when I tell you the coin came up heads. But if I have a fair coin, then you’ll always be relatively surprised, since a priori you have no reason to believe heads vs tails. So in information theory, higher entropy is better, since it means the information is more valuable.

Let’s take that information theory understanding and apply it to LLMs. If my text so far is “The coin flip landed”, most of the probability is going to be split evenly between “heads” and “tails”, with a few other small terms like “on” or “near”. If we ignore those and just give 50% odds to “heads” and “tails”, then we get .5 for p(x). If you use base 2 for the log, which is customary, you get H = 1. The unit when using base 2 is bits, so that’s 1 bit of entropy.

If we work out the math for my example in the previous charts, we see more entropy, about 1.44 bits, since it contains more possibilities, but there is a clear favorite. If all four had the same odds, H would increase to 2 bits. If we had eight options all with the same odds, H would increase to 3 bits. Entropy is rising because we’re adding information about what could realistically happen.

Conversely, if one token had 97% odds and the rest just had 1% each, H would be about 0.24 bits, way lower than before, because the distribution would barely mean anything - you’re almost guaranteed to get the same token every time.

So we can measure the entropy at each position in our response. Some positions will have low entropy, where it was really predictable and basically guaranteed what token we would get at that position. Other positions will have high entropy, where it seems the model had a genuine choice to make and considered many viable options.

Entropy is important not only for measuring, but for training. In fact, for pretraining and for SFT, the entropy is how score your model!

Let me break it down. As we saw before, your model produces a distribution for the next token - what it thinks are the realistic tokens, for the given context, based on all the training data it has seen so far. In the formula above, we’re going to call the model’s distribution q, and we’re going to call the current token position x.

In addition to our model’s distribution, which we can measure and look at, there is also the true distribution, the likelihood of every possible next token given basically all the information in the world: all language ever spoken, the complete physical state of the person speaking it, the time of day, the weather, and so on and so on. On here, we call that p.

Of course that true distribution doesn’t concretely exist, in the sense that it can never be found and calculated. It’s more like a Platonic ideal.

But what we can do is treat any piece of training data as a sample from that true distribution. Like if I write an essay, that’s some real text from some combination of factors from the true distribution. So when we train on that, the model gets one particular view of the true distribution. And if we get lots of different samples, from all sorts of people and places and circumstances, we can get more and more samples from the true distribution. We’ll never fully build it, but we can keep getting closer with more data.

So when we model the true distribution, what we’re really doing is stringing all these individual discrete samples together into one smooth curve, which hopefully is close to the true distribution.

The difference between the samples of the true distribution p and our model q is the loss. And the loss has two parts: the entropy of the true distribution p, since it is changing all the time; and the distance between p and q. Those parts together form the loss. We have no influence on p of course, so loss is never zero, but we can keep trying to shrink the gap between p and q.

The Paper

This is the entire method for the paper. Hopefully some of the math looks a bit familiar, but I’ll break it down into English.

First, they take a model and use it to produce one response per prompt, with certain values for temperature, top-k, and top-p.

Then, they use those prompt-response pairs to do SFT.

Then they test the model, using potentially different values for temperature, top-k, and top-p.

That’s it! They don’t check the responses at all, there is no QA. It’s just raw synthetic data from the same model they train, with certain temperature and top-k and top-p.

Incredibly, this works. We’re going to come back to this table of results once we understand more about why their method works, but for now I wanted to flash the results.

So the core insight of the paper in my view is this: some parts of a response are very predictable, while others are very uncertain.

The most illustrative domain for their insight is code. A lot of code is pretty standard, pretty formulaic, because code has lots of hard and fast rules. Like in Python here, when you define a function, you have to write “def” and then a function name with parenthesis around the arguments and then a colon. If it doesn’t look like that, you’re going to throw an error. There are lots of rules in natural language too, but almost none of them are hard and fast; natural language is famously flexible, it has artistic license and can even contain mistakes like misspellings without destroying the meaning.

Of course, since code is just a way to solve problems and one problem can have many solutions, some code is quite flexible. As the authors show here, there are many different ways to sort a list of things, and which one to pick can be a subtle matter, or may be only a matter of taste depending on the circumstances.

The authors coin two terms for the ends of this spectrum: “fork” for the uncertain cases, and “lock” for the certain ones.

As the charts illustrate, forks and locks have different needs. At a fork, the model needs to explore, to have lots of roughly equal next token probabilities so that it can learn better and so that minor changes to upstream context can tip it towards a different path easily. So there we would like a higher temperature.

By contrast, at a lock, the model needs to be sure. Locks have one right answer, and all other tokens in the distribution are distractors. At locks, we want low temperature.

The problem is that we can’t adjust temperature on the fly like this - it’s a parameter we provide once for the entire inference, for the entire response. In practice we end up compromising, but it’s not optimal.

So you might think we could just make a better rule, right? Like if the distribution looks like this, then change the temperature like that, maybe ignore the tail as defined by a certain probability, things like that. But that’s not very ML of us, right? That’s not very Bitter Lesson, putting human-engineered rules on a thing that learns. Instead, what if we could teach the model to change its distributions? After all, producing the right distribution is already what we train models to do!

What if there was some way to train models to dynamically adjust temperature, top-k, and top-p, so that we get distributions more like this - flatter at forks, more peaked at locks, and always ignoring distractors. Or if it’s not literally learning different T, k, and p values, at least learning to change distributions as if it did learn different T, k, and p.

In the interest of reducing variables, the researchers try out a much simpler model: a finite state machine (FSM). A FSM is not a transformer. It’s barely even a model. Very concretely, each node on here is a set of sixteen numbers, each representing the probability of going in one of sixteen possible directions. They call the directions “tokens” to match LLM terminology, which is why you see “tok” and then a number on all the arrows here.

The idea here is they can construct trajectories to only be made of forks and locks. Like starting from the root, tokens 0 and 1 are both viable, so there is a genuine choice to make. But then after the choice, from either Fork-A or Fork-B, there is only one right choice to make - there are three locks in a row.

The way the model operates is by selecting from its “tokens” in the same way an LLM selects from the tokens in its vocabulary: take the raw probabilities, adjust with temperature and top-k and top-p, then pick based on these final probabilities. So you’re isolating just the impacts of inference parameters and training from all the other LLM stuff, and you have this fake, tightly controlled scenario.

Importantly, you can train this FSM using cross-entropy loss on sequences of tokens, the same way you would train an LLM.

So what is their crazy method, what information exactly are they gonna train their toy model on?

You might think they can use the PASS or FAIL information. That would be perfectly reasonable, given we typically do SFT on good examples, or if you’re still in the RL mindset where we’re thinking in rewards. But that wouldn’t be teaching us anything new!

Instead, they’re just going to self-distill, to take a bunch of trajectories, whether PASS or FAIL, and train on them. The key variables to adjust are the inference parameters: temperature, top-k, and top-p.

What they show is that if you train on trajectories with the right inference parameters, regardless of the quality of the data, you can teach the model to shape its probabilities better.

This is what they actually observe, not just an illustrative example. For the lock nodes, the model learns to drop distractors and put more weight on the dominant token. For the fork node, the model learns to spread out more amongst the options left after truncation.

I want to give a quick intuition on why this should work before we get to the empirical results.

Let’s start with self-distillation at T=1, with no truncation. All we’re doing then is getting unaltered samples from the model and reinforcing those. But since we’re not changing the samples at all, or filtering them like we would if quality were a concern, then training on them should produce no change.

However, with SSD, we specifically avoid T=1, and we do use truncation, i.e. top-k and top-p. So now the synthetic samples are different from what the model normally would say, so there is something to learn from.

So really SSD is just reinforcing and sharpening instincts the model already has. If the distribution looks like a lock, truncating puts even more probability on the top token, and the imbalance is already large enough that temperature shouldn’t change much. If the distribution looks like a fork, temperature spreads out probability more, but truncation ensures you don’t explore too far down the tail.

It’s not gonna help if the model is trash in the first place. And it does bank on there being some untapped potential within the model to then tap. But on the flip side, it’s embarrassingly simple and practically free compared to producing high-quality training data.

Now let’s take a closer look at the empirical results.

For all models they tested in basically every segment on the two version of LiveCodeBench, SSD helps. If you’re not familiar with pass@1 and pass@5, it just means how many chances a model had to get the problem right. Pass@1 measures accuracy in the way an end user would expect, like of course we want the one response we get to be correct. Pass@5 is more a reflection of model potential, like is it able to get the right answer.

Two trends to call out. First, within the instruct models, the smarter models benefited more. The authors don’t explain why, but I think it boils down to the whole “self-distillation” thing. Like if we think SSD’s function is to bring out the potential of the model, then smarter models likely have more potential. It’s actually pretty similar to how RLVR works, if you believe in the elicitation hypothesis, which I do. I wish they had actually done something similar here, where they measured pass@256 or some other really high number in order to see whether SSD was doing the same sort of internal optimization. We know it can’t be learning from the data in the normal way, given how this SFT data is created and how poor the quality can be, at least by traditional measures.

Second, the models generally improved more on the harder problems. Again, the authors don’t provide a direct explanation, but if we think back to the forks vs locks thing, we’d expect harder problems to have more forks. Like on an easy problem, not everything is a lock of course, but the distribution of positions is gonna be more on the lock side. I imagine for harder problems there will be way more forks, so if SSD helps with forks and locks, you can see why it might help with harder problems more.

Of course this doesn’t scale indefinitely; at some point, a problem is just impossibly hard for a model, and no amount of training is gonna help. So that might be starting to bite on the hard problems for Qwen3-4B for example.

To drive home the point about how the training data isn’t operating in the conventional way, they really dial up the temperature on some of it. As you can see on the left, a temperature of 2 can lead to complete gibberish, let alone unrunnable code.

And yet, if you train on this data, where 62% of outputs contain no extractable code at all, you still improve.

To give you an idea of how much better SSD is compared to just picking the optimal temperature, they do a sweep over temperature values on different benchmarks and models. In every case, the best temperature for the untrained model is worse than the performance of the trained model.

I also think this graph is helpful for providing an idea of temperature’s impact. 0.5 to 1.4 is a pretty big range, but for most cases the difference isn’t very big, and the big differences themselves fall in unpredictable patterns. Like why is there a seven-point drop from 0.9 to 1 for the bottom-right graph? Very strange.

My Takeaways

This is the elicitation hypothesis but with trash data
- How well it works depends on how much potential the model has left to bring out
Surely there is a cap to this work
The result is more instructive than useful
What are we losing?
- Leans more into priors
- So the model may be worse past a certain difficulty - just sharpens the boundary

Discussion about this post

Ready for more?