NewsLab
Jun 28 17:12 UTC

Previewing GPT‑5.6 Sol: a next-generation model (openai.com)

1,116 points|by minimaxir||733 comments|Read full story on openai.com
System card: https://deploymentsafety.openai.com/gpt-5-6-preview

Comments (733)

120 shown|More comments
  1. 1. ChrisArchitect||context
  2. 2. rvz||context
    Other than the worst naming I have ever seen (Sol / Terra / Luna), the pricing is still expensive:

    > GPT‑5.6 is priced per 1M tokens across three model sizes:

    > Sol is $5 input / $30 output;

    > Terra is $2.50 input / $15 output

    > Luna is $1 input / $6 output.

    The OpenAI casino has never been more ready to take your money on gambling even more tokens.

  3. 3. minimaxir||context
    Note that GPT 5.5 currently is $5 input / $30 output (short context) so Sol is in the same class, while Terra if the benchmarks are as claimed is indeed a half-price GPT 5.5 at comparable performance.
  4. 4. andrethegiant||context
    What don't you like about the naming?
  5. 5. lwansbrough||context
    I feel like going with Space + Latin is LLM-level creativity.

    Edit: yeah. https://claude.ai/share/06fefe02-4299-44da-8c5a-42607f54ca77

  6. 6. arikrahman||context
    Can't buy cheaper as a selling point when Deepseek is basically free when hitting cache? Unsubsidized too, cloudflare and digital ocean can be the model provider for similar pricing.
  7. 7. Stitch4223||context
    With the $200/month plan I’ve never ran into any limits or issues. The product can be used every day for extensive sessions and development. What is everyone doing that makes them talk about tokens versus dollars?
  8. 8. minimaxir||context
    If you've never hit the limits, why not do the $100/mo plan?
  9. 9. nsingh2||context
    From what my own experiences are, and what's on their checkout page, $100 is 5x base usage and $200 is 20x. If $100 was 10x, then I personally would drop down. They want people to go to the highest tier.
  10. 10. aeonik||context
    You can hit limits with $100 if you use it all day.

    You can do it easily if you use in fast mode.

    I bet you could hit the limits of the $200/month using fast mode if you were using multiple sessions at the same time all day on fast mode.

    The OpenAI tiers seem pretty well tuned.

    I used to use the plus ($20/month), and that was good for a few sessions every once in a while.

    But now that I'm using it to configure my network, monitoring, maintenance, I'm using it every day and I'm on the $100 plan. And I do pretty consistently hit the limits, but it's easy to pace myself.

    I'mam thinking about upgrading to $200/month though. It would be nice not to have to ration it.

  11. 11. ai_slop_hater||context
    I ran out of usage using GPT-5.5 and had to buy a second subscription. I now switched to GPT-5.4 which is basically 2x usage.
  12. 12. fph||context
    But let's put it in perspective: what you're paying them is more than the average salary in many poorer countries.
  13. 13. Stitch4223||context
    Fair. From a business perspective said amount is very reasonable in Europe / USA. For personal use it’s already different. Sometimes the answer is simple, thanks.
  14. 14. kingstnap||context
    Don't forget this.

    > For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate

    Charging for cache writes is cringe and literally only Anthropic did it. Anyway this does mean the "real" prices are +25% on top of what you wrote there.

  15. 15. loufe||context
    "Next generation model"

    If it was the next generation, why isn't it a major version change..?

  16. 16. ryangst_1||context
    LLM devs can't do version control
  17. 17. psychoslave||context
    Semantic is passé, word models moved to the next generation.
  18. 18. dominotw||context
    vibe versioning
  19. 19. cruffle_duffle||context
    To be fair, versioning has always been vibes based.
  20. 20. appplication||context
    Honestly LLMs are the ideal candidate for CalVer. It’s not like there’s any real API so there’s no backwards compatibility to maintain.

    Even Apple adopted and standardized on it for their latest platform releases.

  21. 21. andy12_||context
    I think it makes more sense to make it so that major versions are different pretraining runs, and minor versions are simply the same pretraining run that was finetuned to different degrees. But it seems that that isn't cool anymore.
  22. 22. Kiro||context
    LLM versioning is entirely feelings driven. The ideal versioning is probably just names.
  23. 23. kaizenite||context
    Because if it sucks, they can just default to "It was a minor version change anyways"
  24. 24. goldenarm||context
    They could hold the GPT-6 name for the IPO
  25. 25. GTP||context
    Some assume it was to try to slip under the radar and avoid being limited by the government as they did with Fable.
  26. 26. therepanic||context
    By all appearances, they did not succeed in doing so.
  27. 27. HarHarVeryFunny||context
    AFAIK there is no difference between "generation" and "version". Version naming/numbering depends on how good it turns out to be, and competition. If the competition releases something then you need to push something out too.

    Calling it 5.6 creates the least possible expectations, and therefore more potential for positive feedback.

    The Sol/Terra/Luna naming is interesting. I wonder what Anthropic are considering for their next models? "Terminator", "Armageddon"?

  28. 28. wincy||context
    You gotta check out the new ChatGPT 6.3 Betelgeuse bro
  29. 29. rolph||context
    Heliopause
  30. 30. cyral||context
    If they called it 6.0 and it wasn't AGI, you'd see a lot of complaining here too
  31. 31. tasuki||context
    What is AGI? (I know what the shortcut expands to, I'm curious about your definition. Don't the current models fit?)
  32. 32. ChrisLTD||context
    If it's a new generation why isn't it GPT-6?
  33. 33. win311fwg||context
    It does not introduce incompatibilities with earlier 5.x models? Frontier models are at a point now that there will never be a need for another major version bump, aside from those chasing marketing gimmicks. They are smart enough to adapt.
  34. 34. ChrisLTD||context
    What would it mean to be incompatible with the other 5.x models?
  35. 35. paxys||context
    New request/response schema, new capabilities, or really anything that would break your existing workflows if you changed “5.5” to “5.6” in your application.

    There have been many leaps forward in the past - tool calling, reasoning, agentic loops etc. 5.6 doesn’t have any of this. More intelligence doesn’t necessarily warrant a major version bump.

  36. 36. jurgenburgen||context
    Only speaks Klingon
  37. 37. peab||context
    not true. multimodality is still far from being solved
  38. 38. malnourish||context
    A major bump will be warranted if/when we can truly separate prompt from data.
  39. 39. win311fwg||context
    That is a different product line. It may be recorded as a version bump for marketing purposes, as already mentioned, but semantically begins at 0.
  40. 40. charcircuit||context
    Why would incompatibilities have anything to do with a major version bump?
  41. 41. alcasa||context
    They forgot how to do pretraining.
  42. 42. cleaning||context
    5.5 was a new pretraining run.
  43. 43. paxys||context
    Given the expectations everyone has created GPT-6 has to pretty much be AGI.
  44. 44. tasuki||context
    What is your definition of AGI that the current LLMs don't fit?
  45. 45. paxys||context
    As the old saying goes, I’ll know it when I see it. The current 5.x generation isn’t it.
  46. 46. gordonhart||context
    Autonomously Generating Income (which is why it will never be released to the general public)
  47. 47. koolala||context
    Hopefully it stands for AC Generation Improvements. If it prioritizes income it will bleed the planet dry. It needs to solve how expensive our cost is on the planet first or its entire existence was a mistake.
  48. 48. ThrowawayTestr||context
    When it understands why 6 7 is funny
  49. 49. isomorphic_duck||context
    Continual Learning? Why is this even a question? Isn’t it a well-known glaring issue with the current models? They cannot learn/adapt to new skills (in any permanent sense) once they are deployed.
  50. 50. FromTheFirstIn||context
    You’d have to really stretch the definition of AGI to make the current models fit
  51. 51. LordDragonfang||context
    The definition has already been stretched to not fit the previous models. There is no meaningful, static definition that significantly predates current capabilities.

    There's a reason why ai xrisk doomers had to come up with the term ASI.

    I would seriously suggest that everyone take a look at the wikipedia page for AGI from the month before ChatGPT was released, compare it to the current version, and not come to that conclusion.

    https://en.wikipedia.org/w/index.php?title=Artificial_genera...

  52. 52. FromTheFirstIn||context
    The first sentence is “understand or learn any intellectual task that a human can.” Whatever you think of the benefits of LLMs, they don’t understand and they can only learn during the training period and with very minor adjustments in post training. So, no I don’t think any of these models are generally intelligent.
  53. 53. LordDragonfang||context
    > they don’t understand

    I have not seen any instance of this frequently-made assertion which is at all justified. It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence (they clearly can comprehend even complex tasks well enough to execute on them, and if you won't call that "understanding", you're playing word games rather than stating an objective fact).

    Likewise, agents can literally come to a greater understanding of a problem through trial and error, and there are plenty of mechanisms to retain that knowledge. If you don't want to call that "learning", you're just making a choice to define it in a way more restrictive than how we use it for humans, and intentionally making communication more difficult.

  54. 54. mellosouls||context
    It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence

    "Understanding" has enough philosophical leeway in its use to allow at least the possibility of sentience as a prerequisite.

    This is where the discussion about LLM capabilities becomes genuinely difficult, and dismissing that difficulty as "word games" or "spirituality vs evidence" is not helpful.

  55. 55. FromTheFirstIn||context
    Agents are always combining the same underlying weights to their inputs, relying on the same maps of semi-semantic space and the relationships between those that it was leaning towards at training time. The fact that it’s successful in making lots of people have an Eliza effect doesn’t make it understand something. It’s simulating understanding based on an enormous corpus of text, much of which is people working through things or sharing an understanding of something. Unless you believe that all intellectual activity is about finding the space between words you shouldn’t believe LLMs have any chance at understanding anything.
  56. 56. knollimar||context
    The "it's not X it's Y" where Y qnd X are the same indicates a lack of understanding.
  57. 57. mellosouls||context
    From that same page:

    Various criteria for intelligence have been proposed (most famously the Turing test) but to date, there is no definition that satisfies everyone

  58. 58. 0x696C6961||context
    Always one goalpost away from what we have.
  59. 59. UltraSane||context
    AGI should be able to do every job a human can do using a computer at least as well as the average human.
  60. 60. LordDragonfang||context
    That's already been true for a while, you're overestimating the average human. They just have different failure modes.
  61. 61. Davidzheng||context
    And what is it worse at than an average human today that can be done on a computer?
  62. 62. UltraSane||context
    almost everything? AGI has to be able to completely replace a human in any information worker role indefinitely.
  63. 63. virgildotcodes||context
    I think you're speeding past the word "average" in the sentence. I'd argue that current frontier models already exceed the abilities of average humans across the majority of tasks you can do on a computer, although you might be able to argue that they tend to be a bit slower?

    That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?

  64. 64. UltraSane||context
    " I'd argue that current frontier models already exceed the abilities of average humans " for things that fit in their context window sure but LLMs can't learn over time the way humans can. One example is LLMs are very good at writing a few thousands line of code but they absolutely cannot write coherent million line codebases. By average human I meant the average skill level for the job. AGI would need to be able to pass a interview and get hired and the perform well enough to not get fired.
  65. 65. Davidzheng||context
    Yeah it's not true that for every job, it is better than median worker of that job. But it is conceivable that for almost all jobs it is already better than the median human (not just workers of that job).
  66. 66. isomorphic_duck||context
    You have to understand that the median human is terrible at (almost) everything. Humans, the only examples of general intelligence we know, are economically valuable precisely because they can train themselves to specialise at a (relatively) narrow task over time. You don’t measure how good a coding model is by how well it programs relative to Doctors, or how well it can prove theorems relative to baristas, or how well it can write coherent novels relative to programmers. That would be a dumb metric.
  67. 67. tasuki||context
    > Humans, the only examples of general intelligence we know

    Our intelligence only seems "general" to us, because we're viewing it through our own eyes. Our "intelligence" is specialized to our survival, and we're terrible at most tasks outside that scope.

  68. 68. isomorphic_duck||context
    We operate and think about subjects like Higher Topos Theory, Information Geometry and Algebraic Topology, which are several layers of abstractions removed from anything that can be termed as a skill “specialised to our survival”.
  69. 69. Davidzheng||context
    But in any case, I think more than 10% of information workers today can be replaced by current-generation models indefinitely.
  70. 70. ChrisLTD||context
    It's decent at rote coding tasks, but I haven't seen these things be reliable enough outside of that specific task to make the claim that it can do the work of any information worker.
  71. 71. UltraSane||context
    https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...

    AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.

    Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.

  72. 72. UltraSane||context
    https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...

    AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.

    Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.

  73. 73. leumon||context
    > We plan to make them more broadly available to people using ChatGPT, Codex, and the API soon.

    I hope this means then fable will also get released again.

  74. 74. lanthissa||context
    why would it? if you're the us gov and sam&greg your good boy giving you 25m

    and dario's you naughty boy who you dont agree with politically.

    Let 5.6 free, keep fable chained and anthropic instantly sees rev loss and has to cave.

  75. 75. osti||context
    Sol? Looks like openai is jealous of anthropics good model naming ability and wants to emulate it.
  76. 76. dominotw||context
    sol has no soul
  77. 77. taytus||context
    It's missing u
  78. 78. alcasa||context
    They should have used Figher Jet codenames instead. The MiG-15 one has a nice ring to it.
  79. 79. arizen||context
    Sol Goodman
  80. 80. MrCheeze||context
    TBF, they did it first with ada/babbage/curie/davinci. "Sol" is a much weaker branding, though.
  81. 81. ddp26||context
    I'm going to pre-register my prediction that GPT-5.6 Sol is significantly behind Claude Fable 5, as evaluated by general consensus once time has passed for people to get familiar with both.
  82. 82. hmate9||context
    What is this prediction based on?
  83. 83. gpm||context
    I suspect the same just based on their versioning scheme fwiw.
  84. 84. jstummbillig||context
    solid
  85. 85. Onavo||context
    Fable is allegedly a massive model (estimates between 6-10+ trillion, with a few hundred billion active). If 5.6 is just an incremental upgrade over 5.5 (at the same model size) then it won't be able to fully compete with Fable just yet.
  86. 86. ddp26||context
    Based on my conjecture that Anthropic is ahead on AI research, and that OpenAI doesn't know how to make Fable-class models.
  87. 87. minimaxir||context
    I suspect GPT-5.6 Sol will at-the-least be affordable.
  88. 88. MostlyStable||context
    "Affordable" depends on what you need. When a task is able to be achieved by two different calibers of model, it's obviously more cost effective to use the less capable model, in the same way that you wouldn't hire a math PhD to do simple addition.

    If what you need is only possible with the more capable model then the "affordability" of the less capable model is sort of irrelevant. If what you need is a novel mathematical proof, it doesn't matter that a high school student is "more affodable". You need the math PhD.

    As "old" models get more and more capable, it's going to be an increasingly important skill to be able to adequately recognize when a task requires a frontier model and when it doesn't, so that the less capable (and therefore cheaper) model can be used.

  89. 89. Y_Y||context
    Affordable? I'd settle for available.
  90. 90. dimgl||context
    why
  91. 91. chanbam||context
    Because he likes attention and wants to feel special
  92. 92. CuriouslyC||context
    Claude will win on "vibes" and it'll be close in coding but considering how incremental Fable is above 5.5 in terms of overall smarts, there's no way 5.6 isn't considerably smarter on the whole.
  93. 93. simianwords||context
    I’m countering this prediction by stating that Fable and Sol will be somewhat similar - this has always been the trend and I see no reason why this should stop now.
  94. 94. HarHarVeryFunny||context
    OpenAI may have a model in the works that is similar next-gen size and architecture to Fable, but this isn't necessarily it. I'd guess that 5.6 was more of a hasty reaction to Mythos - same base model (same size, same price) as 5.5 but with additional post-training to make it more competitive with Mythos/Fable in some benchmarks.

    Mythos/Fable is supposedly next generation in size vs Opus, and is rumored to have some architectural innovation in terms of dynamic routing/compute, possibly only fully enabled with Fable which at $10/50 is still twice the price of Sol 5.6's $5/30, but a big reduction from Mythos preview which had been an astronomical $30/150 possibly due to the dynamic routing not yet having been enabled.

  95. 95. ddp26||context
    Is this the trend? There have been various points where one of Anthropic or OpenAI was substantially ahead. Sure, many times they're close, but now doesn't seem like one of them.
  96. 96. nharada||context
    Is that the correct comparison? Fable is twice the price
  97. 97. ddp26||context
    Fair point.
  98. 98. mccoyb||context
    When will GPT-5.6 Protomolecule drop? Me and the boys on Eros can't wait to get our hands on it!
  99. 99. slopinthebag||context
    I'm excited for GPT-5.7 Pneumonoultramicroscopicsilicovolcanoconiosis, hope they drop it soon
  100. 100. dodslaser||context
    GPT-5.8 Llanfairpwllgwyngyll
  101. 101. w4yai||context
    You mean Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch ?
  102. 102. derwiki||context
    … do you folks listen to Soft Skills Engineering? This has been a running joke on that podcast for a while
  103. 103. wasting_time||context
    What is happening. I feel like I'm getting an aneurysm reading these comments.
  104. 104. Schiendelman||context
    It's the name of a place in Wales, which has made it a running joke for decades!
  105. 105. da_grift_shift||context
    For me, it's GPT-5.9 Year of the Whisper-Quiet Maytag Dishmaster
  106. 106. slopinthebag||context
    I think Aramco GPT Coca Cola 6.0 will be a step change.
  107. 107. baq||context
    Musk steals Dario and they both train Epic on Mars. US Space Force promptly finds oil on Mars and launches an armada in the next window. In the meantime rocks painted black drop on Mar-a-Lago.
  108. 108. Schiendelman||context
    Oh man, here inside Ganymede I'm way more excited about the GPT-5.7 Io experiment! Hopefully it won't blow up in our faces!
  109. 109. static_motion||context
    Beltalowda!
  110. 110. bijowo1676||context
    Waiting for @simonw to report on this, before I read and try it
  111. 111. claudeIsDown||context
    I would love to see a more descriptive review from simonw instead of just SVGs generations.
  112. 112. simonw||context
  113. 113. lossolo||context
    He is not an ML researcher or engineer, he is a passionate AI enthusiast blogger. He mostly does SVGs and other low effort checks (sometimes with major flaws, as people have pointed out a few times in the HN comments). Properly evaluating the model across all fronts requires a deep understanding of LLMs, how they work, the trade offs behind new architectures and the relevant research papers. It also takes a lot of time to build a proper evaluation framework so basically you can't just vibe code that if you want something that is solid.
  114. 114. HPMOR||context
    He created Django, what do you mean he's not an engineer? Also 'low-effort??' his posts are extremely in-depth, clearly very thought through with a significant amount of time and energy. Additionally he does perform multifaceted checks across LLMs in many of his other blog posts.
  115. 115. lossolo||context
    > He created Django, what do you mean he's not an engineer?

    I specifically said that he is not an ML engineer (emphasis on ML), so I'm not sure what Python web frameworks have to do with anything.

    > Also 'low-effort??' his posts are extremely in-depth, clearly very thought through with a significant amount of time and energy

    And yes, low effort. Pelican was low effort, his Fable test was low effort, his HN filter etc. Read the discussion in the comments under the Fable test, it's not just my opinion. There was also another example a few months ago. You can search for it, I don't keep track of these things.

    I discussed this with him directly after he called himself an "ML expert" in comments.

    This is a classic case of the Gell Mann amnesia effect. I read ML papers and work with ML, but to people outside the industry, his writing can look "extremely in-depth" even though it really isn't. People I work with have the same opinion.

    > clearly very thought through with a significant amount of time and energy. Additionally he does perform multifaceted checks across LLMs in many of his other blog posts.

    I have never seen an article by him about any model that I would describe that way.

    And the most revealing sign that he is not an expert is the type of questions he asks and the mistakes he sometimes makes in the comments here. They show why he is not capable of doing any technically in depth evaluation (at least with his current knowledge level).

    If you actually want to learn something as a layperson, read articles written by ML PhDs like Sebastian Raschka or watch Stephen from Welch Labs etc. that are directed at general audience.

  116. 116. algoth1||context
    We at HN: https://xkcd.com/2501/ to basically say that I think you might be considering low-effort what’s actually an attempt at simplifying - which is arguably higher effort
  117. 117. lossolo||context
    > you might be considering low-effort what’s actually an attempt at simplifying - which is arguably higher effort

    I'm not saying that simplifying complex topics is low-effort, good simplification can obviously require a lot of work and I fully agree here.

    What I meant is more that some of these tests feel methodologically sloppy, they are too shallow, miss important technical context, do not control for enough variables etc, yet the conclusions are sometimes presented lets just say... too strongly, as I don't want to be too harsh.

  118. 118. algoth1||context
    Oh, i see. That’s entirely correct. I think the pelican test is more of a meme at this point, similar to Ethan’s Otter on an airplane for video models
  119. 119. shwaj||context
    > ML researcher or engineer

    The charitable reading is that they meant “ML researcher or ML engineer” with the latter meaning, I guess, an engineer who works on developing LLMs not just using them.

  120. 120. lossolo||context
    Yes, thank you.