Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

or, Save the Whale's Compute

Tim Dingman

Jun 08, 2026

Originally presented as a live talk on February 18, 2026

Paper · Repo

Background

Let’s start with DeepSeek, the authors of this paper. They are one of the two top-tier labs in China, along with the Qwen team. Where Qwen shoots for quantity though, DeepSeek goes for quality.

Like look at the models DeepSeek has released here on their Hugging Face page. It’s been over a year since they released their last major model, DeepSeek V3, there at the bottom. It was apparently updated in March last year but it originally came out at the end of 2024. And you may remember DeepSeek R1, the model that briefly crashed the stock market in January 2025. They made two minor releases in 2025, and they released an OCR model for turning images of text into text, but that’s about it. If you compare that to the Qwen page, where they’ve released models across pretty much all modalities in both major and minor versions in the last year, and you’ll see what I mean.

Fundamentally it’s all because of resource constraints. For one, China overall is way behind the US in terms of available compute. They also don’t have access to the latest and greatest NVIDIA chips. So they’re working with fewer chips from a generation behind. That means every lab in China has to get clever.

Since Qwen is part of Alibaba, they still have relative abundance within the Chinese AI ecosystem. That’s why they’re able to produce models of many types and sizes and so on. But DeepSeek is different. They are an outgrowth of a hedge fund. They have always been more of a science experiment, and even though the hedge fund does well they don’t have the budget of an Alibaba. So DeepSeek is the cleverest of all these clever Chinese labs.

And because they’re so clever and so focused on R&D, they end up producing some pretty impressive papers. Probably the most impressive one is this Mixture of Experts paper, from all the way back in January 2024. That’s incredibly early in the history of the MoE architecture, which is totally dominant now in SOTA models. While they didn’t invent the idea, which actually dates to well before the modern Gen AI era, they laid a lot of the groundwork for how those SOTA models do MoE today.

First, they made lots of smaller experts and selected several per token, rather than having a few bigger experts and only picking one. If you remember Mixtral, which was 8 experts of 7B parameters each where only one was active per token, you know how big of a change DeepSeekMoE was. The smaller experts allowed more specialization and fine-grained knowledge.

Second, they introduced a shared expert, which is always active no matter which other experts are selected. The one shared expert tends to learn basic things like sentence structure or formatting, leaving room in all the other experts for more complex knowledge.

The graphic on the right shows both changes. In a traditional MoE, you might have N experts and pick 2. In this example, they bump it out to 2N experts and have 4 active: 1 shared expert and 3 selected experts.

Now in modern MoEs they’ve turned these knobs all the way up. Like in this paper they have 64 experts with 8 active, for a sparseness of 8, whereas in the most recent DeepSeek model they have 256 experts with 8 active, for a sparseness of 32.

And remember, this was all born of resource constraints, not pure performance gains. The virtue of MoE models is they trade expensive compute for cheap memory. So like in this paper, their model is 16B parameters but only has 2.8B active per token, swapping experts in and out of memory as they’re selected. Performance matched that of Llama 2 7B, which takes more than twice the compute to run. MoE is more complex and finicky, but if you’re really skilled engineers like the folks at DeepSeek, you can handle the complexity to unlock efficient performance.

Keep those themes of sparsity and complexity for performance in mind. There’s a reason they call their findings “a new axis of sparsity”.

Now let’s get into the topics of this paper. I want to start from a human, intuitive point of view and then transition into the technical.

First I want to highlight the different types of memory in humans. People talk about short-term vs long-term memory, but the popular notion of short-term memory is confused; it’s actually on a pretty short timescale as the graphic shows, and the better term is “working memory”. An example of working memory is if I give you a phone number to remember and repeat back to me. Most people have a working memory of 5-7 things. In the phone number example, a number without an area code is 7 digits, and if it’s an area code you’re familiar with you don’t need to keep all the digits in working memory, you have it in long-term memory.

Now within long-term memory, I want to distinguish between declarative and procedural memory. Declarative is stuff you can accurately describe and verbalize, like past events or facts. Procedural is stuff you can do but may not be able to explain well, “muscle memory” and the like. Declarative memory is much more amenable to written records and is easy to look up, while procedural memory is more diffuse. That split is going to be important in how the authors structured the Engram model.

Here’s a little illustration of working vs long-term memory. Since working memory is difficult to expand, we get smarter and more capable by putting things in long-term memory and then referencing it in working memory. So again, thinking back to my phone number example, the area codes in your home state are in long-term memory, so if someone gives you a phone number you can just point to that set of three digits rather than taking up three slots in working memory.

By analogy, working memory is like compute for a model, whereas long-term memory is like parameters. If you can look up the right parameters and swap them in, then you can avoid constructing the same ideas from scratch with your compute. That’s going to save compute for the things that require it, like handling new information and doing reasoning.

Thinking further about memory and meaning, let’s recall another slide from that agent memory paper in January. It shows how text turns into tokens, then into token IDs, then finally into embeddings. Embeddings are literally vectors, just a series of numbers, but more abstractly they represent the meaning of a token. And as we know from natural language, words and concepts and proper nouns etc have more than just their strict meaning: there’s also connotations, synonyms, personal associations, many many layers.

All LLMs have this single-token embeddings table. It’s how the model turns words into vectors it can then process into a next token prediction. But you could also do the same thing with groups of tokens instead of individual tokens. Like you could have one row in an embeddings table for “Alexander the Great” as one group, one n-gram as researchers call it, in this case a 3-gram since it has 3 words in it.

Of course you can’t have a row for every possible combination of three tokens, there are just too many combinations. But there could be ways around that with enough cleverness.

As for how LLMs build meaning from n-grams now, here’s an example from a related paper about how models piece together context as they process input layer by layer. The middle column, “Latent State Translation”, is a model’s own description of its “thoughts” so to speak at that point in its processing.

The input phrase here is going to be “Diana, Princess of Wales”. And at the start, after the first few layers of attention + feedforward, the model has really only picked up the very basic fact that this input relates to a country in the UK and in Europe. In the fourth layer it picks up on princess, in the fifth layer it connects princess and Wales, and then finally in layer 6 it recognizes that hey, we’re actually talking about a specific historical person with that title who has other facts associated with her, like her lifespan and her work.

That’s helpful and where we want to end up of course, but notice how much work went into putting together something the model already knew and was not context-dependent in any way. Like you could look up that exact information in a barebones encyclopedia. Maybe there’s a way to cut out that reconstituting process and just get the fully constructed information right away.

The Paper

Well that’s exactly what they did here with Engram. This figure is from later in the paper but it illustrates the idea pretty well: if you find that the last two or three tokens formed a known entity or phrase or something, just look up that meaning rather than reconstituting it from the knowledge in your parameters.

So in the first example, “Great” is bright red because the 3-gram “Alexander the Great” is well known. Same with most of the tokens in his horse’s name.

Then in the second example, the “way” in “By the way” is also quite red, because that’s a stock phrase and doesn’t need to be interpreted as individual words or with much attention to other context.

It even works in Chinese. In the first example, the first red block is the end of a famous phrase, The Four Great Inventions. You can see after the colon there are four list entries, and each one of them ends in red blocks too. For reference, the four inventions are papermaking, the compass, gunpowder, and printing.

This example is for 3-grams as I mentioned, but they also did 2-grams. They tried 4-grams but it barely helped, so apparently most proper nouns and stock phrases and the like are two or three tokens.

Okay, we’ve arrived at the architecture piece, which is going to take the most explaining.

First I want to situate us. At core we are still working with a transformer, still seeing the core operations:

Embedding, which turns tokens into meanings
Attention, which calculates how each token relates to every other token and establishes the meaning of the entire input
Feed forward, which processes that meaning into a prediction for the next token. In this model it happens to be a MoE setup

The big thing we’ve added is this Engram piece, also with a slight tweak to the embeddings which we can ignore. Let’s take each new step in order.

So the Engram starts with three inputs: the result of the previous transformer block, which is called “Input Hidden” here; then there’s the 2-gram; and the 3-gram.

Input Hidden will come into play later. For the 2-gram and 3-gram, we’re going to turn them into row IDs for our embedding tables using something called a hash. So the 2-gram and 3-gram are both going to become h different embedding rows.

Then we’re going to put all those rows together into one bigger row, specifically a row that is 2h bigger than each individual row. That’s called concatenation. That one big row stores all the meanings our embedding tables learned.

After that we’re going to basically copy the attention playbook. But whereas attention looks at how each part of the input relates to every other part, Engram looks at how each part of the input relates to these meanings we’ve fetched from the embedding tables. So maybe in this example the prior context is about Greek history, or about famous horses, or about literary devices in myth. That’s going to get mixed in with these context-free meanings we have about Alexander the Great, using that attention technique.

Now not all 2-grams and 3-grams are going to be meaningful of course. In fact most of them won’t have a static, context-free meaning the way proper nouns or common phrases do. If the fetched meanings are dissimilar from the constructed meanings, then the scaled dot product is going to be low, because the dot product measures similarity. So then you end up ignoring the fetched meanings. That’s fine because the Engram operation is pretty cheap, and the gain you get when the fetched meanings are similar to the constructed meanings outweighs the cost.

Anyway, you do a quick convolution at the end for stability and then add that Engram output to Input Hidden. Now your attention and feed-forward layers have this extra meaning and can skip the work they would have had to do to calculate that same meaning.

So that’s how it works for one transformer block with Engram. Not every transformer block needs Engram though, like in this example they show it at layers 2 and 4, although real models are dozens of layers deep so this is not totally indicative. In fact, the full model they train has 30 layers, and they put Engram blocks at layers 2 and 15, for reasons we’ll see in a bit. Earlier is generally better though since the memory lookup work is most helpful when the model is still piecing meanings together.

This diagram is also helpful because it reinforces the point that Engram memories do NOT live on the GPU, on the device as they say here. Instead, the Engram tables are either in RAM or the harddrive, both of which are far cheaper than GPU compute or memory. Again, there’s a throughline with their MoE paper; they’ve traded a little of an expensive resource, compute, for a lot of a cheap resource. If compute is your constraint then you want to make as many versions of that trade as you can.

Here’s another view of how having this Engram block to look up memories helps. On the left we have a graph that basically shows how close a representation is to the final one. So in general your representation gets more and more accurate as you keep processing, which makes sense. The interesting bit here is that the Engram lines, in yellow and red, are lower than the normal model’s line in blue. That means the Engram blocks give the model a head start sometimes, like a little boost, on its way to the final representation.

The sort of blocky diagrams on the rest of the slide tell a similar tale. We don’t need to get into what CKA means, but basically the lighter colors mean more similarity. So you can see that the early Engram layers are more similar to middle layers of the baseline model, with jumps at layers 2 and 15 where the Engram blocks are.

So it seems like Engram helps. The question is, how much? To measure that they did a few things.

One version of this question is, if I only have a certain number of parameters, how should I split them between Engram and the rest of the blocks? That’s what the graph on the left checks out. The allocation ratio, rho, is the % of parameters allocated to the rest of the blocks. So at 100%, you have no Engram, just the baseline model. At 0% you would have just Engram which doesn’t really make sense, so they only investigate down to 40%. So they train all these test models with different allocation ratios, and they measure the loss at two different amounts of training. At both levels of training, you see the same pattern, where something like 80% is optimal. That means a full 20% of parameters in the optimal model go to Engram.

Another version of this question is, how much can I scale Engram? Like keep the other parameters fixed, just allow for bigger and bigger embedding tables. That’s what the graph on the right addresses. What it finds is there is no limit, that the more space you allocate for memory, the better your model does. It might become impractical, but if you can support or afford it it seems to always help.

Of course the ultimate question to how much Engram helps is how much better the end model is.

Here we compare four models, all with the same number of active parameters and thus the same demands on inference compute: a dense 4B, a MoE 27B as we’ve been seeing for the baseline, Engram 27B to keep the same total parameters as the baseline, and Engram 40B where we keep the same active parameters and experts as Engram 27B but really bump out the Engram part, more than tripling those parameters.

Keep in mind that these are base models, not instruct models, so all they’ve done is pretraining, no post-training.

Anyway, as you can see Engram 27B beats the baseline model on every benchmark, sometimes by 4 or 5 points. As we saw with that U-curve on the previous slide, allocating ~20% of parameters to Engram is going to really help performance.

What’s interesting is that Engram 40B is not uniformly better. For several benchmarks it’s less than a point better, which I consider basically equal, and for a couple of them Engram 27B is better by more than a point. The authors attribute the somewhat mixed record to the shared limit on pretraining tokens, which is likely too low for a 40B model. Also, when you bump out the Engram parameters without bumping out the other parameters, you get away from that optimal allocation ratio.

One area of performance they pay special attention to is long-context, which typically means 32k or higher, even up to 1M in rare cases. They examine long context because it tests their theory that Engram allows the model to refocus effort from entity or phrase recognition in local contexts, which Engram now handles, to reasoning and global contexts.

Here they deploy two benchmarks, LongPPL and RULER, both on 32k token sequences. We don’t need to examine every metric, but the three groups are helpful to know: that’s perplexity, which is a measure of surprise; NIAH, short for “needle in a haystack”, where the model has to fetch one or more words or phrases from the long context; and then the other tasks, which are frankly less artificial than NIAH, for example question answering.

So compared to the baseline model, with 50k pretraining steps and 1.63 pretraining loss, some version of Engram pretty much always wins. The one with the most pretraining steps does the best unsurprisingly, but the one with fewer pretraining steps yet equal pretraining loss to the baseline model generally does second-best. So it seems like you can get more performance for the same amount of pretraining, or use less pretraining to get the same performance.

Finally, they do what’s called an ablation, basically taking away some of the stuff you added to see how important is actually was. In this graph they ablate the Engram component, so like they train the complete model with two Engram blocks and then just delete the Engram blocks.

Here’s the result, grouped by benchmark skill. The more reasoning-heavy tasks like reading comprehension and commonsense reasoning take minor hits, while factual recall tasks fall off a cliff. This validates two claims: one, Engram offloads a lot of factual memory but not process memory; two, that offloading lets the rest of the model focus more on other work.

My Takeaways

I suspect this will follow the same path as DeepSeekMoE: appreciated for good engineering but not really “in the water” until DeepSeek releases a ~SOTA model with Engram
- One of their researchers confirmed on Twitter that DeepSeek V4, whose release is immanent, will not have Engram
Post-training an Engram model will present new challenges
- Recall that the models in this paper are only base models
- If Engram becomes the new standard, we will need to track such differences
Could make the most sense on edge devices, where power management is an additional concern
- 1B params is ~2 GB at full quality (i.e. in BF16 format, before quantization)
- The Engram parameters (~20% of the total) can live in storage rather than RAM

Discussion about this post

Ready for more?