Been testing these via their "pool" agent. It's fast, and the agent adheres to the ACP spec pretty well (better than codex, opencode etc.) so it's a good experience in Zed.
Just by the (lack of) inter-model variance, I don't think SWEBench-Pro does a very good job of representing model capability. Terminal-Bench seems more challenging and separates the wheat from the chaff.
Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.
Very cool to see more small open models being worked on!
One nit: I've seen on this homepage, and many others, this notion that the people behind the models are "working towards AGI".
I get that this is marketing speak, but transformers are not AGI, and they will never be AGI, so it'd be great if people stopped saying that as it sort of wears out the meaning of "working towards AGI".
> Transformers have approximate knowledge of many things. Is this not 'general'?
Of course not. That's like saying the Encyclopedia Britannica is AGI.
> What does AGI mean to you?
I would define AGI as human-like machine intelligence (or superior).
This is difficult for some people to understand because they don't understand what "human-like" means in the first place. Neuroscientists would be able to set some of these wayward computer scientists straight on this question.
I can see how that would be implied by my comments so you're right to question that.
The principles that are found in the brain are what gives qualification to "AGI", not the brain itself, so it's possible there are other architectures that would qualify.
A few observations on LLMs that give the game away:
- They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
- They're stuck approaching the limit of human intelligence. This is because the technique cannot exceed human intelligence. I realize that OpenAI has made claims to the contrary, saying things like "oh our model found out some proof that was never proven before" — this doesn't count. It's a side effect of training on the Internet. In fact that proof probably did exist (in pieces) somewhere on the Internet, it just wasn't widely publicized.
So, you'll know it's AGI when you no longer see companies releasing new models. AGI won't require new models because the architecture will be what matters as whatever models you have will be constantly updating themselves in real-time, just like the human brain does (and every other brain).
And, you'll start to see the AIs actually outsmarting the smartest humans on the planet in every subject.
Agreed. The widespread anthropomorphizing is getting so tiring.
I blame it on the big companies in the space, but seeing intelligent folks regularly attributing intelligence to a complex autocomplete system is disappointing.
> but transformers are not AGI, and they will never be AGI
Like the claim "transformers are AGI", this needs proof, otherwise should be prefixed "I think". And honestly, positive proof is easier than negative proof (you just need to make one transformer model that is a AGI, whereas the never claim requires you to enumerated all possibilities).
That's like saying we should wait for positive proof of AGI from combustion engines. That'll never happen, no matter how much you tweak the engine. It's just not possible.
The negative proof is there in the definition itself. Transformers are not AGI, they're frozen human intelligence of the autocomplete variety. That can never be AGI and anyone who says otherwise doesn't understand transformers or AGI.
Probably a testament to how good Qwen3.6 is considering Qwen3.6-35B-A3B is not only ahead of their similar weight class XS.2 but also their M.1 (close to 10x bigger at 225B-A23B).
Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.
The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.
The fact theyre shipping the actual agent harness alongside the weights is the part that matters. Most labs dump the model and make you figure out the agent layer yourself. If its the same runtime they use for RL training, its actually been exercised in production rather than being some demo wrapper.
I like their honesty in benchmarks, looks like Qwen3.6 35B is outperforming their Laguna M.1 225B model
Also, *ops work, which in my experience can actually be more complicated than SWE is underrepresented there obviously.
I usually score pretty well in colour perception tests but distinguishing between those two purples made me doubt myself.
One nit: I've seen on this homepage, and many others, this notion that the people behind the models are "working towards AGI".
I get that this is marketing speak, but transformers are not AGI, and they will never be AGI, so it'd be great if people stopped saying that as it sort of wears out the meaning of "working towards AGI".
Transformers have approximate knowledge of many things. Is this not 'general'? Where is the goalpost here?
Of course not. That's like saying the Encyclopedia Britannica is AGI.
> What does AGI mean to you?
I would define AGI as human-like machine intelligence (or superior).
This is difficult for some people to understand because they don't understand what "human-like" means in the first place. Neuroscientists would be able to set some of these wayward computer scientists straight on this question.
But is that a hard requirement? Can a machine have Rat-like intelligence? Is all intelligence human-like (human-centric-mind-blindness-much?)?
> Of course not. That's like saying the Encyclopedia Britannica is AGI.
Well, I'd classify that as GK, general knowledge. Not artificial or intelligent.
Let's consider a definition of intelligence as the act of 'manipulating data', have you a better general definition of intelligence?
Yes.
> Can a machine have Rat-like intelligence?
Yes, and that would be closer to AGI than today's LLMs, because the fundamental principles and architecture is there.
What makes you think that? That there are no other patterns of intelligence?
The principles that are found in the brain are what gives qualification to "AGI", not the brain itself, so it's possible there are other architectures that would qualify.
A few observations on LLMs that give the game away:
- They require releases. You get a single binary blob and that blob is forever stuck at its so-called "intelligence" level. It never learns anything new.
- They're stuck approaching the limit of human intelligence. This is because the technique cannot exceed human intelligence. I realize that OpenAI has made claims to the contrary, saying things like "oh our model found out some proof that was never proven before" — this doesn't count. It's a side effect of training on the Internet. In fact that proof probably did exist (in pieces) somewhere on the Internet, it just wasn't widely publicized.
So, you'll know it's AGI when you no longer see companies releasing new models. AGI won't require new models because the architecture will be what matters as whatever models you have will be constantly updating themselves in real-time, just like the human brain does (and every other brain).
And, you'll start to see the AIs actually outsmarting the smartest humans on the planet in every subject.
I blame it on the big companies in the space, but seeing intelligent folks regularly attributing intelligence to a complex autocomplete system is disappointing.
Like the claim "transformers are AGI", this needs proof, otherwise should be prefixed "I think". And honestly, positive proof is easier than negative proof (you just need to make one transformer model that is a AGI, whereas the never claim requires you to enumerated all possibilities).
The negative proof is there in the definition itself. Transformers are not AGI, they're frozen human intelligence of the autocomplete variety. That can never be AGI and anyone who says otherwise doesn't understand transformers or AGI.
Interestingly, Gemma 4 26B-A4B and Qwen3.6 27B (dense) have been left out of the comparison.
The smaller models are becoming very good and quantization techniques like importance weighting and TurboQuant on model weights let you run aggressively quantized version (IQ2, TQ3_4S) on consumer hardware with extremely acceptable perplexity and quality loss.
Very exciting times for local LLMs.