Previewing GPT‑5.6 Sol: a next-generation model (openai.com)

1,116 points|by minimaxir|2d ago|733 comments|Read full story on openai.com

System card: https://deploymentsafety.openai.com/gpt-5-6-preview

Comments (733)

120 shown|More comments

1. ChrisArchitect|2d ago|context

Pre-official discussions:
https://news.ycombinator.com/item?id=48678789
https://news.ycombinator.com/item?id=48683021
2. rvz|2d ago|context

Other than the worst naming I have ever seen (Sol / Terra / Luna), the pricing is still expensive:
> GPT‑5.6 is priced per 1M tokens across three model sizes:
> Sol is $5 input / $30 output;
> Terra is $2.50 input / $15 output
> Luna is $1 input / $6 output.
The OpenAI casino has never been more ready to take your money on gambling even more tokens.
3. minimaxir|1d ago|context

Note that GPT 5.5 currently is $5 input / $30 output (short context) so Sol is in the same class, while Terra if the benchmarks are as claimed is indeed a half-price GPT 5.5 at comparable performance.
4. andrethegiant|1d ago|context

What don't you like about the naming?
5. lwansbrough|1d ago|context

I feel like going with Space + Latin is LLM-level creativity.
Edit: yeah. https://claude.ai/share/06fefe02-4299-44da-8c5a-42607f54ca77
6. arikrahman|1d ago|context

Can't buy cheaper as a selling point when Deepseek is basically free when hitting cache? Unsubsidized too, cloudflare and digital ocean can be the model provider for similar pricing.
7. Stitch4223|1d ago|context

With the $200/month plan I’ve never ran into any limits or issues. The product can be used every day for extensive sessions and development. What is everyone doing that makes them talk about tokens versus dollars?
8. minimaxir|1d ago|context

If you've never hit the limits, why not do the $100/mo plan?
9. nsingh2|1d ago|context

From what my own experiences are, and what's on their checkout page, $100 is 5x base usage and $200 is 20x. If $100 was 10x, then I personally would drop down. They want people to go to the highest tier.
10. aeonik|1d ago|context

You can hit limits with $100 if you use it all day.
You can do it easily if you use in fast mode.
I bet you could hit the limits of the $200/month using fast mode if you were using multiple sessions at the same time all day on fast mode.
The OpenAI tiers seem pretty well tuned.
I used to use the plus ($20/month), and that was good for a few sessions every once in a while.
But now that I'm using it to configure my network, monitoring, maintenance, I'm using it every day and I'm on the $100 plan. And I do pretty consistently hit the limits, but it's easy to pace myself.
I'mam thinking about upgrading to $200/month though. It would be nice not to have to ration it.
11. ai_slop_hater|1d ago|context

I ran out of usage using GPT-5.5 and had to buy a second subscription. I now switched to GPT-5.4 which is basically 2x usage.
12. fph|1d ago|context

But let's put it in perspective: what you're paying them is more than the average salary in many poorer countries.
13. Stitch4223|1d ago|context

Fair. From a business perspective said amount is very reasonable in Europe / USA. For personal use it’s already different. Sometimes the answer is simple, thanks.
14. kingstnap|1d ago|context

Don't forget this.
> For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate
Charging for cache writes is cringe and literally only Anthropic did it. Anyway this does mean the "real" prices are +25% on top of what you wrote there.
15. loufe|1d ago|context

"Next generation model"
If it was the next generation, why isn't it a major version change..?
16. ryangst_1|1d ago|context

LLM devs can't do version control
17. psychoslave|1d ago|context

Semantic is passé, word models moved to the next generation.
18. dominotw|1d ago|context

vibe versioning
19. cruffle_duffle|1d ago|context

To be fair, versioning has always been vibes based.
20. appplication|1d ago|context

Honestly LLMs are the ideal candidate for CalVer. It’s not like there’s any real API so there’s no backwards compatibility to maintain.
Even Apple adopted and standardized on it for their latest platform releases.
21. andy12_|1d ago|context

I think it makes more sense to make it so that major versions are different pretraining runs, and minor versions are simply the same pretraining run that was finetuned to different degrees. But it seems that that isn't cool anymore.
22. Kiro|1d ago|context

LLM versioning is entirely feelings driven. The ideal versioning is probably just names.
23. kaizenite|1d ago|context

Because if it sucks, they can just default to "It was a minor version change anyways"
24. goldenarm|1d ago|context

They could hold the GPT-6 name for the IPO
25. GTP|1d ago|context

Some assume it was to try to slip under the radar and avoid being limited by the government as they did with Fable.
26. therepanic|1d ago|context

By all appearances, they did not succeed in doing so.
27. HarHarVeryFunny|1d ago|context

AFAIK there is no difference between "generation" and "version". Version naming/numbering depends on how good it turns out to be, and competition. If the competition releases something then you need to push something out too.
Calling it 5.6 creates the least possible expectations, and therefore more potential for positive feedback.
The Sol/Terra/Luna naming is interesting. I wonder what Anthropic are considering for their next models? "Terminator", "Armageddon"?
28. wincy|1d ago|context

You gotta check out the new ChatGPT 6.3 Betelgeuse bro
29. rolph|1d ago|context

Heliopause
30. cyral|1d ago|context

If they called it 6.0 and it wasn't AGI, you'd see a lot of complaining here too
31. tasuki|1d ago|context

What is AGI? (I know what the shortcut expands to, I'm curious about your definition. Don't the current models fit?)
32. ChrisLTD|1d ago|context

If it's a new generation why isn't it GPT-6?
33. win311fwg|1d ago|context

It does not introduce incompatibilities with earlier 5.x models? Frontier models are at a point now that there will never be a need for another major version bump, aside from those chasing marketing gimmicks. They are smart enough to adapt.
34. ChrisLTD|1d ago|context

What would it mean to be incompatible with the other 5.x models?
35. paxys|1d ago|context

New request/response schema, new capabilities, or really anything that would break your existing workflows if you changed “5.5” to “5.6” in your application.
There have been many leaps forward in the past - tool calling, reasoning, agentic loops etc. 5.6 doesn’t have any of this. More intelligence doesn’t necessarily warrant a major version bump.
36. jurgenburgen|1d ago|context

Only speaks Klingon
37. peab|1d ago|context

not true. multimodality is still far from being solved
38. malnourish|1d ago|context

A major bump will be warranted if/when we can truly separate prompt from data.
39. win311fwg|1d ago|context

That is a different product line. It may be recorded as a version bump for marketing purposes, as already mentioned, but semantically begins at 0.
40. charcircuit|1d ago|context

Why would incompatibilities have anything to do with a major version bump?
41. alcasa|1d ago|context

They forgot how to do pretraining.
42. cleaning|1d ago|context

5.5 was a new pretraining run.
43. paxys|1d ago|context

Given the expectations everyone has created GPT-6 has to pretty much be AGI.
44. tasuki|1d ago|context

What is your definition of AGI that the current LLMs don't fit?
45. paxys|1d ago|context

As the old saying goes, I’ll know it when I see it. The current 5.x generation isn’t it.
46. gordonhart|1d ago|context

Autonomously Generating Income (which is why it will never be released to the general public)
47. koolala|1d ago|context

Hopefully it stands for AC Generation Improvements. If it prioritizes income it will bleed the planet dry. It needs to solve how expensive our cost is on the planet first or its entire existence was a mistake.
48. ThrowawayTestr|1d ago|context

When it understands why 6 7 is funny
49. isomorphic_duck|1d ago|context

Continual Learning? Why is this even a question? Isn’t it a well-known glaring issue with the current models? They cannot learn/adapt to new skills (in any permanent sense) once they are deployed.
50. FromTheFirstIn|1d ago|context

You’d have to really stretch the definition of AGI to make the current models fit
51. LordDragonfang|1d ago|context

The definition has already been stretched to not fit the previous models. There is no meaningful, static definition that significantly predates current capabilities.
There's a reason why ai xrisk doomers had to come up with the term ASI.
I would seriously suggest that everyone take a look at the wikipedia page for AGI from the month before ChatGPT was released, compare it to the current version, and not come to that conclusion.
https://en.wikipedia.org/w/index.php?title=Artificial_genera...
52. FromTheFirstIn|1d ago|context

The first sentence is “understand or learn any intellectual task that a human can.” Whatever you think of the benefits of LLMs, they don’t understand and they can only learn during the training period and with very minor adjustments in post training. So, no I don’t think any of these models are generally intelligent.
53. LordDragonfang|1d ago|context

> they don’t understand
I have not seen any instance of this frequently-made assertion which is at all justified. It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence (they clearly can comprehend even complex tasks well enough to execute on them, and if you won't call that "understanding", you're playing word games rather than stating an objective fact).
Likewise, agents can literally come to a greater understanding of a problem through trial and error, and there are plenty of mechanisms to retain that knowledge. If you don't want to call that "learning", you're just making a choice to define it in a way more restrictive than how we use it for humans, and intentionally making communication more difficult.
54. mellosouls|1d ago|context

It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence
"Understanding" has enough philosophical leeway in its use to allow at least the possibility of sentience as a prerequisite.
This is where the discussion about LLM capabilities becomes genuinely difficult, and dismissing that difficulty as "word games" or "spirituality vs evidence" is not helpful.
55. FromTheFirstIn|1d ago|context

Agents are always combining the same underlying weights to their inputs, relying on the same maps of semi-semantic space and the relationships between those that it was leaning towards at training time. The fact that it’s successful in making lots of people have an Eliza effect doesn’t make it understand something. It’s simulating understanding based on an enormous corpus of text, much of which is people working through things or sharing an understanding of something. Unless you believe that all intellectual activity is about finding the space between words you shouldn’t believe LLMs have any chance at understanding anything.
56. knollimar|1d ago|context

The "it's not X it's Y" where Y qnd X are the same indicates a lack of understanding.
57. mellosouls|1d ago|context

From that same page:
Various criteria for intelligence have been proposed (most famously the Turing test) but to date, there is no definition that satisfies everyone
58. 0x696C6961|1d ago|context

Always one goalpost away from what we have.
59. UltraSane|1d ago|context

AGI should be able to do every job a human can do using a computer at least as well as the average human.
60. LordDragonfang|1d ago|context

That's already been true for a while, you're overestimating the average human. They just have different failure modes.
61. Davidzheng|1d ago|context

And what is it worse at than an average human today that can be done on a computer?
62. UltraSane|1d ago|context

almost everything? AGI has to be able to completely replace a human in any information worker role indefinitely.
63. virgildotcodes|1d ago|context

I think you're speeding past the word "average" in the sentence. I'd argue that current frontier models already exceed the abilities of average humans across the majority of tasks you can do on a computer, although you might be able to argue that they tend to be a bit slower?
That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?
64. UltraSane|1d ago|context

" I'd argue that current frontier models already exceed the abilities of average humans " for things that fit in their context window sure but LLMs can't learn over time the way humans can. One example is LLMs are very good at writing a few thousands line of code but they absolutely cannot write coherent million line codebases. By average human I meant the average skill level for the job. AGI would need to be able to pass a interview and get hired and the perform well enough to not get fired.
65. Davidzheng|1d ago|context

Yeah it's not true that for every job, it is better than median worker of that job. But it is conceivable that for almost all jobs it is already better than the median human (not just workers of that job).
66. isomorphic_duck|1d ago|context

You have to understand that the median human is terrible at (almost) everything. Humans, the only examples of general intelligence we know, are economically valuable precisely because they can train themselves to specialise at a (relatively) narrow task over time. You don’t measure how good a coding model is by how well it programs relative to Doctors, or how well it can prove theorems relative to baristas, or how well it can write coherent novels relative to programmers. That would be a dumb metric.
67. tasuki|21h ago|context

> Humans, the only examples of general intelligence we know
Our intelligence only seems "general" to us, because we're viewing it through our own eyes. Our "intelligence" is specialized to our survival, and we're terrible at most tasks outside that scope.
68. isomorphic_duck|5h ago|context

We operate and think about subjects like Higher Topos Theory, Information Geometry and Algebraic Topology, which are several layers of abstractions removed from anything that can be termed as a skill “specialised to our survival”.
69. Davidzheng|1d ago|context

But in any case, I think more than 10% of information workers today can be replaced by current-generation models indefinitely.
70. ChrisLTD|1d ago|context

It's decent at rote coding tasks, but I haven't seen these things be reliable enough outside of that specific task to make the claim that it can do the work of any information worker.
71. UltraSane|19h ago|context

https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
72. UltraSane|19h ago|context

https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
73. leumon|1d ago|context

> We plan to make them more broadly available to people using ChatGPT, Codex, and the API soon.
I hope this means then fable will also get released again.
74. lanthissa|1d ago|context

why would it? if you're the us gov and sam&greg your good boy giving you 25m
and dario's you naughty boy who you dont agree with politically.
Let 5.6 free, keep fable chained and anthropic instantly sees rev loss and has to cave.
75. osti|1d ago|context

Sol? Looks like openai is jealous of anthropics good model naming ability and wants to emulate it.
76. dominotw|1d ago|context

sol has no soul
77. taytus|1d ago|context

It's missing u
78. alcasa|1d ago|context

They should have used Figher Jet codenames instead. The MiG-15 one has a nice ring to it.
79. arizen|1d ago|context

Sol Goodman
80. MrCheeze|1d ago|context

TBF, they did it first with ada/babbage/curie/davinci. "Sol" is a much weaker branding, though.
81. ddp26|1d ago|context

I'm going to pre-register my prediction that GPT-5.6 Sol is significantly behind Claude Fable 5, as evaluated by general consensus once time has passed for people to get familiar with both.
82. hmate9|1d ago|context

What is this prediction based on?
83. gpm|1d ago|context

I suspect the same just based on their versioning scheme fwiw.
84. jstummbillig|1d ago|context

solid
85. Onavo|1d ago|context

Fable is allegedly a massive model (estimates between 6-10+ trillion, with a few hundred billion active). If 5.6 is just an incremental upgrade over 5.5 (at the same model size) then it won't be able to fully compete with Fable just yet.
86. ddp26|1d ago|context

Based on my conjecture that Anthropic is ahead on AI research, and that OpenAI doesn't know how to make Fable-class models.
87. minimaxir|1d ago|context

I suspect GPT-5.6 Sol will at-the-least be affordable.
88. MostlyStable|1d ago|context

"Affordable" depends on what you need. When a task is able to be achieved by two different calibers of model, it's obviously more cost effective to use the less capable model, in the same way that you wouldn't hire a math PhD to do simple addition.
If what you need is only possible with the more capable model then the "affordability" of the less capable model is sort of irrelevant. If what you need is a novel mathematical proof, it doesn't matter that a high school student is "more affodable". You need the math PhD.
As "old" models get more and more capable, it's going to be an increasingly important skill to be able to adequately recognize when a task requires a frontier model and when it doesn't, so that the less capable (and therefore cheaper) model can be used.
89. Y_Y|1d ago|context

Affordable? I'd settle for available.
90. dimgl|1d ago|context

why
91. chanbam|1d ago|context

Because he likes attention and wants to feel special
92. CuriouslyC|1d ago|context

Claude will win on "vibes" and it'll be close in coding but considering how incremental Fable is above 5.5 in terms of overall smarts, there's no way 5.6 isn't considerably smarter on the whole.
93. simianwords|1d ago|context

I’m countering this prediction by stating that Fable and Sol will be somewhat similar - this has always been the trend and I see no reason why this should stop now.
94. HarHarVeryFunny|1d ago|context

OpenAI may have a model in the works that is similar next-gen size and architecture to Fable, but this isn't necessarily it. I'd guess that 5.6 was more of a hasty reaction to Mythos - same base model (same size, same price) as 5.5 but with additional post-training to make it more competitive with Mythos/Fable in some benchmarks.
Mythos/Fable is supposedly next generation in size vs Opus, and is rumored to have some architectural innovation in terms of dynamic routing/compute, possibly only fully enabled with Fable which at $10/50 is still twice the price of Sol 5.6's $5/30, but a big reduction from Mythos preview which had been an astronomical $30/150 possibly due to the dynamic routing not yet having been enabled.
95. ddp26|1d ago|context

Is this the trend? There have been various points where one of Anthropic or OpenAI was substantially ahead. Sure, many times they're close, but now doesn't seem like one of them.
96. nharada|1d ago|context

Is that the correct comparison? Fable is twice the price
97. ddp26|1d ago|context

Fair point.
98. mccoyb|1d ago|context

When will GPT-5.6 Protomolecule drop? Me and the boys on Eros can't wait to get our hands on it!
99. slopinthebag|1d ago|context

I'm excited for GPT-5.7 Pneumonoultramicroscopicsilicovolcanoconiosis, hope they drop it soon
100. dodslaser|1d ago|context

GPT-5.8 Llanfairpwllgwyngyll
101. w4yai|1d ago|context

You mean Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch ?
102. derwiki|1d ago|context

… do you folks listen to Soft Skills Engineering? This has been a running joke on that podcast for a while
103. wasting_time|1d ago|context

What is happening. I feel like I'm getting an aneurysm reading these comments.
104. Schiendelman|1d ago|context

It's the name of a place in Wales, which has made it a running joke for decades!
105. da_grift_shift|1d ago|context

For me, it's GPT-5.9 Year of the Whisper-Quiet Maytag Dishmaster
106. slopinthebag|1d ago|context

I think Aramco GPT Coca Cola 6.0 will be a step change.
107. baq|1d ago|context

Musk steals Dario and they both train Epic on Mars. US Space Force promptly finds oil on Mars and launches an armada in the next window. In the meantime rocks painted black drop on Mar-a-Lago.
108. Schiendelman|1d ago|context

Oh man, here inside Ganymede I'm way more excited about the GPT-5.7 Io experiment! Hopefully it won't blow up in our faces!
109. static_motion|1d ago|context

Beltalowda!
110. bijowo1676|1d ago|context

Waiting for @simonw to report on this, before I read and try it
111. claudeIsDown|1d ago|context

I would love to see a more descriptive review from simonw instead of just SVGs generations.
112. simonw|1d ago|context

I try! https://simonwillison.net/2026/Jun/9/claude-fable-5/ and https://simonwillison.net/2026/Jun/11/fable-is-relentlessly-...
113. lossolo|1d ago|context

He is not an ML researcher or engineer, he is a passionate AI enthusiast blogger. He mostly does SVGs and other low effort checks (sometimes with major flaws, as people have pointed out a few times in the HN comments). Properly evaluating the model across all fronts requires a deep understanding of LLMs, how they work, the trade offs behind new architectures and the relevant research papers. It also takes a lot of time to build a proper evaluation framework so basically you can't just vibe code that if you want something that is solid.
114. HPMOR|1d ago|context

He created Django, what do you mean he's not an engineer? Also 'low-effort??' his posts are extremely in-depth, clearly very thought through with a significant amount of time and energy. Additionally he does perform multifaceted checks across LLMs in many of his other blog posts.
115. lossolo|1d ago|context

> He created Django, what do you mean he's not an engineer?
I specifically said that he is not an ML engineer (emphasis on ML), so I'm not sure what Python web frameworks have to do with anything.
> Also 'low-effort??' his posts are extremely in-depth, clearly very thought through with a significant amount of time and energy
And yes, low effort. Pelican was low effort, his Fable test was low effort, his HN filter etc. Read the discussion in the comments under the Fable test, it's not just my opinion. There was also another example a few months ago. You can search for it, I don't keep track of these things.
I discussed this with him directly after he called himself an "ML expert" in comments.
This is a classic case of the Gell Mann amnesia effect. I read ML papers and work with ML, but to people outside the industry, his writing can look "extremely in-depth" even though it really isn't. People I work with have the same opinion.
> clearly very thought through with a significant amount of time and energy. Additionally he does perform multifaceted checks across LLMs in many of his other blog posts.
I have never seen an article by him about any model that I would describe that way.
And the most revealing sign that he is not an expert is the type of questions he asks and the mistakes he sometimes makes in the comments here. They show why he is not capable of doing any technically in depth evaluation (at least with his current knowledge level).
If you actually want to learn something as a layperson, read articles written by ML PhDs like Sebastian Raschka or watch Stephen from Welch Labs etc. that are directed at general audience.
116. algoth1|1d ago|context

We at HN: https://xkcd.com/2501/ to basically say that I think you might be considering low-effort what’s actually an attempt at simplifying - which is arguably higher effort
117. lossolo|1d ago|context

> you might be considering low-effort what’s actually an attempt at simplifying - which is arguably higher effort
I'm not saying that simplifying complex topics is low-effort, good simplification can obviously require a lot of work and I fully agree here.
What I meant is more that some of these tests feel methodologically sloppy, they are too shallow, miss important technical context, do not control for enough variables etc, yet the conclusions are sometimes presented lets just say... too strongly, as I don't want to be too harsh.
118. algoth1|1d ago|context

Oh, i see. That’s entirely correct. I think the pelican test is more of a meme at this point, similar to Ethan’s Otter on an airplane for video models
119. shwaj|1d ago|context

> ML researcher or engineer
The charitable reading is that they meant “ML researcher or ML engineer” with the latter meaning, I guess, an engineer who works on developing LLMs not just using them.
120. lossolo|1d ago|context

Yes, thank you.