MAI-Thinking-1: Building a Hill-Climbing Machine
or, Microsoft Enters the SOTA Wars
Originally presented as a live talk on June 10, 2026
Background
So before this announcement, I think most people weren’t aware Microsoft worked on any models at all. Certainly they use models, like in their various Copilot products, but that’s pretty much all models from other providers.
But actually, one arm of Microsoft has been releasing models for some time now. That arm is Microsoft Research, and they have a good reputation for their Phi series of models in particular. The Phi Family as the image calls them is a wide array of research-size models, runnable and trainable on relatively modest hardware, and covering a lot of different areas. The Phi models aren’t great generalists, but they have historically done well on code and STEM work.
Microsoft Research has a lot of other models on their Hugging Face page, and they release a decent number of papers as well, but at least in my mind, I have historically considered “Microsoft model” and “Phi” synonymous.
But this paper and this release are from a different arm: Microsoft AI. And to talk about Microsoft AI, we have to talk about this man: Mustafa Suleyman.
Mustafa Suleyman is a giant of the field. In 2010 he and Demis Hassabis cofounded DeepMind, which Google acquired in 2014. DeepMind is the original seat of Google’s own generative AI efforts, so already in his bio we can credit him as the father, or perhaps the grandfather, of Gemini.
Fast forward to early 2022. Suleyman leaves Google, joins a VC firm, then quickly leaves to cofound a new lab with Reid Hoffman called Inflection AI. Keep in mind, this is before ChatGPT came out in late 2022, but well after GPT-3 came out, in mid 2020. So folks in the know are seeing the pace of AI progress is starting to really pick up - inflecting, if you will.
In 2023, Inflection launched its ChatGPT competitor, a chatbot named Pi, powered by the Inflection series of models. Inflection-2 came out in November 2023, Inflection-2.5 came out in March 2024, but by then it was already clear OpenAI was well ahead on both capabilities and chatbot market share.
So on March 19th, 2024, their lead investor bailed them out: Microsoft hired Suleyman and most of the 70-person team to become Microsoft AI. They also paid a huge “non-exclusive license” fee. So this was one of the first of what I like to call “fake acquisitions”: a bigger company scooping out their target talent from a smaller company, delivering a big payout, and leaving the remains of the smaller company alone. This is the same playbook Amazon used on Adept and that Google used on both Character AI and Windsurf.
It does share some characteristics with the investment Meta made in Scale in 2025, with Alex and a few other folks leaving to form Meta Superintelligence Labs, but the other transactions were “licensing fees” with no economic upside for the bigger company. Meta actually has a stake in Scale’s success, and Alex is on the board. Obviously the pattern-matching led to some rough vibes at the time of the transaction, but Scale is very much still a going concern, while the other companies I mentioned are shells of their former selves.
Anyway, now two years later Suleyman and his Inflection crew, with a lot of other talent on board, have finally released their first models.
The parallel thread here, and the subtext to a lot of this, is the relationship between Microsoft and OpenAI.
Microsoft first invested in OpenAI all the way back in 2019, before the release of GPT-3, as the exclusive cloud provider for OpenAI. At the time, Microsoft execs felt they had fallen behind Google on machine learning, given the latter’s high-profile acquisition of DeepMind. So as part of the investment, Microsoft got a license to anything OpenAI built. It also gave Microsoft 75% of OpenAI’s profits, up to the amount they originally invested. Pretty wild deal in retrospect huh?
So Microsoft continued deepening the partnership, with more big investments in 2021 and 2023, and deployment of GPT-4 across tons of Microsoft product surfaces. And with Microsoft as the exclusive provider of compute, any increasing demand for GPT meant increasing demand for Azure, and thus increasing profit for Microsoft.
Then in November 2023 came the near-acquisition moment: the OpenAI board fired Sam Altman on a Friday, over the weekend Microsoft offered to hire Sam and whatever staff wanted to move over from OpenAI, but by Monday Sam was back in the driver’s seat. As we know now, that left the door open for Suleyman at the future Microsoft AI.
Since then, much of the partnership has come undone. Microsoft is no longer the exclusive provider of compute. Their ownership stake is down, although owning roughly a quarter of a $1T company ain’t bad. And Microsoft’s IP and royalties arrangement with OpenAI is looser than before. Now with the release of the MAI models, it’s almost a frienemy relationship. And I’m sure Suleyman would love to best GPT one day.
Now briefly on the science side, I do want to cover one thing, which is loss.
Loss is the way you quantify how good your model is in many stages of training. It’s a single number that you set out to optimize, to get as low as possible via training.
For this paper, the form of loss we need to discuss is negative log likelihood, or NLL. NLL is what we use during pretraining and during supervised fine tuning (SFT).
Conceptually, NLL looks at the likelihood your model gave for the actual next token in your training data. If your model predicted the correct next token with 100% probability, the loss is 0. If it predicted the correct next token with 0% probability, the loss approaches infinity.
NLL is a nice way to measure performance because it is fast and objective compared to more intuitive forms of benchmarking, like human evals or leaderboards. It is also continuous between pretraining and SFT, so you can compare apples to apples. It also works for any text you think is trustworthy: just feed in some and see if the next predicted word is what your text says is next. That’s quite different from the training and eval data we make here, where we spend a lot of time ensuring quality and diversity and fit to spec and all that. Of course there are advantages to the other approaches and the other data like we make, and we’ll see MAI use them.
So for all these reasons, NLL is the consistent ruler we will see the MAI folks use to measure performance. Although occasionally we will see its cousin, “bits per byte” (BPB), which accounts for the differences in tokenizers.
The Paper
Before we start, just note that I had to skip a lot of things in the interest of time, particularly the serving and training infrastructure. If you want that detail, it’s all in the paper - they were quite thorough.
Let’s start with architecture.
MAI-Base-1 is a 1T parameter model, with 35B parameters active, across 78 layers and with 512 total experts, 8 of which are active at a time, for a sparsity ratio of 64. That is all very similar to many other modern models where we have architecture details. For example, Kimi K2.6 is 1T total parameters and 32B active parameters. DeepSeek V4 Pro is larger but in the same ballpark. So far so good.
Where it starts to go off the beaten path is in the details of the layers.
So for attention, which is where the model forms a holistic picture of the input, they switch between local and global attention, in a 5:1 ratio. So that means for 5 attention layers, the attention is only looking 512 tokens back, and then for the sixth layer it looks all the way back to the first token, assuming we’re within the model’s 256k context window. Switching between local and global is common, but 5:1 is on the more extreme end.
The weirder thing, which I have never seen before, is switching between dense and mixture-of-experts feed-forwards - the layer where all the processing or “thinking” happens. They’re switching it up every time, so in a 1:1 ratio. They claim that interleaving dense and super sparse MoE like this yields the same results as a less-sparse MoE throughout, but faster, which makes sense given how successful interleaving attention types has been.
Speaking of MoE, on the right side they diagram theirs, which is also a bit weird. So they’re showing their 512 experts, and we know they pick 8 of them each time, but they have this down and up projection thing and these yellow boxes around the experts.
Here’s what’s going on with that: they are compressing the inputs before going to the experts. That’s what the down projector does. But routing, which is where you select experts to activate, is super sensitive. So they split out those functions: the router uses the uncompressed input to pick which experts to use, and the dispatcher relays the compressed input to the selected experts. Then they combine all that info and uncompress it.
Here’s a bit about how they settled on their MoE setup.
What they’re showing here is how performance improves on a series of benchmarks given more training and more experts. So the x axis is FLOPs, which is the standard unit of training compute. The y axis is something they call “efficiency gain”, or EG for short, which is basically how much better your experiment is compared to your baseline. And the different colors show different numbers of total experts - they only activate 8 every time.
The upshot is that more total experts improves performance, and training them for longer sometimes improves performance. But more parameters means more cost and more latency, so they compromise on 512.
This one decision per se is not critical for understanding the whole model, but it’s emblematic of how the MAI team approached all of their decisions: empirically, with lots of rigorous optimization experiments. It’s a refreshing paper for its rigor, and also for its openness.
By the way, the weighted average in the bottom-right weights code at 50%, math and STEM at 17.5% each, general at 10%, and multilingual at 5%. So you can really see how much they’re optimizing for code and reasoning compared to everything else.
Worth noting these benchmarks include at least some human-generated data from external vendors — a sign that demand for human data in GenAI isn't going away.
Here’s another example of that optimization loop and rigor.
What they’re showing here is how different mixes of training data impact performance, as measured by NLL on two different validation sets - that’s the x and y axes. Each dot is a different data mix, and each color is a different size of model they trained in order to test the mixes.
They’re only showing a few sizes of model here, so there’s maybe a few dozen dots on here, but they mention that in total they trained several thousand models of between 760M and 4B active parameters.
One particular aspect they pay attention to is how different datasets interact, both within their domain but also across domains.
Like here for example, they’re measuring performance on a graduate-level physics benchmark, which a vendor created for them for this purpose. You’ll see our old friend NLL on the y axis, so lower is better here.
What each of the six graphs show is how adding the given data type in impacts performance. Like before, each dot is a model trained on one specific data mix.
In this example, math and STEM stuff helped, which includes “web PDFs” - those are PDFs of scientific papers. General web content didn’t matter, and code actually hurt. I personally was surprised by that, because generally code data helps with math performance, and vice versa, and physics is mostly math. But that’s why you run the tests! Anyway, hopefully you’re getting a sense of how a team ends up training several thousand models in their pursuit of one final model.
Relatedly, even though this much training takes a lot of infrastructure work, we just don’t have time to cover it. But if you want to know all the gnarly bits about managing GPUs and various forms of parallelism, it’s in there.
So after all that fiddling, here’s where they net out on pretraining mix.
Keep in mind this is all naturally generated - none of it is synthetically made for the purpose of training data. They even take great pains to identify and remove suspected AI-generated text from scraped sources. However, given the high share of code and the pervasiveness of coding agents, I suspect a decent share of the data is ultimately from models.
Speaking of code, it’s over half the data! And it gets one of the highest multiples there in the right corner, meaning they replay the data about two times. That’s kind of a proxy for usefulness. So apparently STEM data, books and journals, and especially math data were all very helpful.
30T tokens in pretraining is pretty good by the way. That’s about the same pretraining corpus as DeepSeek V4. To go much higher than that you’d need to start generating a lot of synthetic data expressly for pretraining purposes, which is a gamble and thus was not where MAI wanted to start.
Now after pretraining there’s midtraining, kind of smaller and more focused pretraining where they only take the cream of the crop. It’s also typically where people extend the context window.
Both cases are true here. They filter heavily for quality and rebalance domains a bit, then they extend context out to 256k tokens. That adds another 3.5T tokens or so.
Here’s the actual loss graph from their final training run, just to give you an idea of what researchers actually look at. We have NLL on the y axis, and we see how it pretty quickly flattens out, with very gradual gains for the vast majority of the training run.
Kind of wild to think how much quality improvement and polish comes from this seemingly small drop in loss. Also, if they had more tokens, they probably would have gotten even better results. But as we mentioned, new pretraining tokens are hard to come by these days.
So once pretraining and midtraining are done, you end up with your base model. In case you’re not familiar, base models are basically like autocompletes, they can’t do chat. Chat and other abilities come during post-training, which we’ll look at after this slide.
Base models are actually not that common to release anymore, so the comparison selection is limited, but they did their best here and picked reasonable competition - you’ll see the Kimi and DeepSeek models I mentioned before for example. They also included an earlier version of their model to show how they improved.
As for what they’re testing on, the top-right is code from their infrastructure, while the other three are benchmarks built by external vendors — again, a signal of ongoing demand for vendor-built eval data.
So with their capable pretrained and midtrained model in hand, MAI are going to use reinforcement learning (RL) to post-train three specialist teacher models: SWE & Agentic, STEM, and Helpfulness & Safety. Then those three teachers are going to use SFT to distill down into one single student, which will get another round of RL training to form their final model, MAI-Thinking-1.
We will tackle each aspect in turn, but one thing I want to say here is how common this sort of pipeline is becoming. For folks who joined the DeepSeek V4 paper review, you’ll have seen this before. The question is, why? Why not just keep training your one model?
The biggest reason is actually right there on the diagram: you can train the teachers in parallel. Assuming you have the compute for it, which Microsoft clearly does, it speeds up overall training time to do these separate skills in parallel and then come together at the end.
A couple fun notes about their RL scheme.
First of all, they’re using GRPO basically, kind of the standard RL algorithm, but with one neat trick they call “adaptive entropy control”, shown here. Basically, if they find their model isn’t exploring enough, if the entropy of its next-token probability distribution is getting too low, they flip on this switch that lets the model make bigger moves on each training step. The top graph is entropy, with symbol H, and they’re targeting an average of 0.3. The bottom graph is their switch, which they flip on when H is too low and generally flip off when H is too high.
Second, they control maximum response length based on how hard a problem is. So if it’s a very easy problem, maybe only 8k token budget. If it’s a really hard problem, they let the model go up to 128k tokens in its response. That’s like 100k words for one response! They also penalize length in the reward.
Now over the course of their RL, they lock in progress every so often by distilling the model onto itself. In other words, they take their somewhat improved model, make SFT training data out of it, and then train the base model on it. That’s what all the stars are, points of self-distillation. And the different colors are different models they distill onto. This lets them lock in progress and work off a clean slate, and also look for potentially bad behaviors in the SFT data that they can filter out - things like language switching or reward hacking.
It requires on the order of 1M SFT examples to effectively distill from the teacher to the student, but they get them “for free” by just keeping the correct rollouts from RL. Although weirdly, they report that even keeping the incorrect rollouts, like the model attempts that don’t get the right answer, still improves the student model in training.
Now let’s look at each of the three teachers quickly in turn.
This figure is their pipeline for producing QA pairs from natural sources for the STEM teacher’s mix. They also buy QA pairs from vendors, but of course we don’t get any details about the vendor processes.
Anyway, there’s a ton of work represented here, but I wanted to call out the scoring part in particular, since it applies to any QA pairs, even ones built by external vendors, so it could be a hint at their quality control process for purchased data.
I want to call out the Consensus Grading step. Basically, they’re doing a lot of attempts with a good model and seeing what the consensus answer is, then having a judge model compare that consensus answer against the ground truth from the QA pair. To me that sounds workable for most of the difficulty distribution, but for the hardest problems I suspect even a good judge will get it wrong, so I wonder if they’re leaving something on the table here. Or perhaps that tip of the difficulty distribution just isn’t present in their original sources.
This teacher also covers code, which they say they leaned on vendors for more, with 160k total coding problems.
The second teacher is for SWE and general agentic work.
Since they’re doing agents, they need environments, which each consist of a task, a sandbox for the agent to safely work in, an initial state, and rewards at the end. They’re going to decide those rewards with a combination of rules (i.e. verifiable states) and LLM judges (for e.g. task interpretation, helpfulness, and trajectory quality).
For the SWE tasks, they harvested from GitHub, filtering 102M PRs all the way down to just 266k, across 94k repos. Those are all the PRs that have fully fleshed-out issues, with all passing tests, and reproducible environments.
For the general tasks, their major goal was getting the model to pick the right tools. So they flooded the zone, offering 50+ API and MCP tools to the model, and even including some prompts that didn’t require any tool use at all. They did a lot of the data generation synthetically, 130k+ tasks across 150+ environments, with diversity across task, environment, personas etc.
The last teacher is Helpfulness & Safety, a real grab bag of more subjective stuff, which is in orange at the top.
I want to call out a couple things here. One is how they view human vs synthetic data. Human data has better complexity, like harder or many-step tasks that actually make sense and aren’t contrived, but synthetic data has better coverage - you can design and program in all the distributions you want your synthetic data generator to touch.
Another is the change in rewarding. As you move from left to right on this table, you get more and more subjective. Like Instruction Following for example is sometimes verifiable by rule even, for example on word count, whereas style is rarely that way. Even within a consistent method of rewarding, namely with LLM judges, the details change: rubrics for Instruction Following, yet a simple 0-2 scale for Style.
Lastly, one interesting trick not on this table: they estimate the distribution of appropriate response lengths based on the prompt, then penalize the model for anything outside of that. Yet another tool to combat overlong responses.
So once you have all three teachers, it’s time to distill onto the student, here labeled “Consolidated Model”.
As before, they’re going to reuse the RL rollouts as SFT, and they’re going to filter for relevant characteristics for each teacher, like correctness for STEM and style heuristics for Helpfulness & Safety.
Also as before, the mix of examples from each teacher is a matter of optimization. They find the share of examples is all that matters, not share of tokens, but they report both and it’s wild to see how disproportionate the mix can get while still being effective.
After distillation, there’s a final round of RL, mostly polish - a lot of the Helpfulness & Safety data, with a bit of the other two and a bit of long context, just to preserve those abilities.
Now we can get to results.
The MAI model doesn’t win on any benchmarks, but it’s competitive. Similar to how Meta’s Muse Spark was good but not amazing, just to prove they could do it. I wanted to compare benchmarks on the two but there’s basically no overlap, and Muse Spark isn’t available by API so even the benchmark maintainers can’t run it.
MAI picked Sonnet 4.6 as their comparison, and on the bottom table that seems to be true. Their model isn’t out yet though, so can’t compare vibes, which is increasingly crucial given how hard it is to measure some of these differences.
Finally, since they are good scientists, they do some human evals in addition to running programmatic benchmarks.
They actually call out their vendor, Surge AI, by name - a real sign of respect. Otherwise there’s not a ton new to report here; they use the standard RLHF task setup of several pointwise criteria on a 0-2 scale, then a pairwise scale for overall preference on a 7-point Likert scale, i.e. 1 = “Response A was much better” and 7 = “Response B was much better”.
Again, they’re on par or a bit better than Sonnet 4.6, generally a bit worse than Opus 4.6. A respectable first showing for MAI.
My Takeaways
This is the MSL playbook: build a good-enough model to prove you can do it
Meta’s Muse Spark isn’t really available outside of Meta products, and it may never be - or at least there’s no rush
MAI is in the same position
I predict MAI will primarily be for Microsoft product surfaces
This is what Amazon and Meta currently do
It’s also the fallback case for Google - they would keep building Gemini even if nobody used the API or subscribed to the chatbot
All the big tech cos are invested in either OpenAI or Anthropic or both anyway
In a way, OpenAI vs Anthropic is a proxy fight for the big tech cos
Microsoft’s demand for data should continue and increase
MAI has plenty of time to prove itself and all the financial runway it needs























