NewsLab
Apr 29 01:42 UTC

GPT-5.5 (openai.com)

1,575 points|by rd||1,052 comments|Read full story on openai.com

Comments (1052)

120 shown|More comments
  1. 1. luqtas||context
    they are using ethical training weights this time!!! /j
  2. 2. meetpateltech||context
  3. 3. applfanboysbgon||context
    If there's a bingo card for model releases, "our [superlative] and [superlative] model yet" is surely the free space.
  4. 4. xnx||context
    "our newest and most expensive model yet"
  5. 5. tom1337||context
    Do "our [superlative] and [superlative] [product] yet" and you have pretty much every product launch
  6. 6. SequoiaHope||context
    I love when Apple says they’re releasing their best iPhone yet so I know the new model is better than the old ones.
  7. 7. taspeotis||context
  8. 8. sigmoid10||context
    That's at least genuine to some degree. Like, ok, good to know it's not officially a step back... But stuff like "smallest notch ever in an iPhone" is outright misleading consumers when there are other brands out there that easily beat them.
  9. 9. SequoiaHope||context
    It’s genuine but hilarious to consider the alternative where a new iPhone is not quite as good as the old model and apple states as much.
  10. 10. ertgbnm||context
    can't wait for "our worst and dumbest model yet"
  11. 11. Nition||context
    Apple should have used that one for the 2016 MacBook.
  12. 12. wiseowise||context
    "Best iPhone ever"
  13. 13. ZeroCool2u||context
    Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?
  14. 14. qsort||context
    It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.

    Will be interesting to try.

  15. 15. minimaxir||context
    The more interesting part of the announcement than "it's better at benchmarks":

    > To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.

    The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.

  16. 16. amrrs||context
    Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
  17. 17. minimaxir||context
    In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).

    A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).

  18. 18. squibonpig||context
    Yeah but like what if they're sorta embellishing it or just lying? That's the issue with not being reproducible.
  19. 19. jstanley||context
    Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
  20. 20. girvo||context
    That's easily explained by those being two different people with two different opinions?
  21. 21. 2goomba1stage||context
    And together they make one single community that s effectively NEVER happy.
  22. 22. theptip||context
    The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.

    OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.

    It’s an engineering result, not a scientific one.

  23. 23. xiphias2||context
    There's already KernelBench which tests CUDA kernel optimizations.

    On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.

  24. 24. dash2||context
    Is that true? I would have guessed research breakthroughs might be a more plausible way to win.
  25. 25. xtracto||context
    So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.

    I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.

    LLM do not stop amazing me every day.

  26. 26. ativzzz||context
    I like that they waited for opus 4.7 to come out first so they had a few days to find the benchmarks that gpt 5.5 is better at
  27. 27. eknkc||context
    Well anectodally, 5.4 was already better than opus 4.7 so it should not have been hard.
  28. 28. wahnfrieden||context
    I like that Anthropic rushed 4.7 out to get a couple days of coverage before 5.5 hit
  29. 29. spprashant||context
    Everything since that launch to this release has been a PR disaster for Anthropic.
  30. 30. dandaka||context
    I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
  31. 31. gck1||context
    Correct. Anthropic has been on disaster train since January and they can't seem to get off that train.
  32. 32. nullbyte||context
    82.7% on Terminal Bench is crazy
  33. 33. toephu2||context
    Is it? There are 5 other models near ~80% and it was achieved in March... which in AI-world seems like a century ago.

    https://www.tbench.ai/leaderboard/terminal-bench/2.0

  34. 34. ejpir||context
    those are not verified. I've tried forgecode and I cannot believe they didn't do something to influence the benchmarks
  35. 35. GodelNumbering||context
    Yup, they were found to be sneaking the answer key using agents.md

    https://debugml.github.io/cheating-agents/#sneaking-the-answ...

  36. 36. astlouis44||context
    A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools

    The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.

    It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.

  37. 37. ZeWaka||context
    I personally don't think the gameplay itself is that impressive.
  38. 38. 0x62||context
    FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.

    It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.

    In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.

    Excited to test 5.5 and see how it is in practice.

  39. 39. CSMastermind||context
    > It still struggles to create shaders from scratch

    Oh just like a real developer

  40. 40. accrual||context
    Much respect for shader developers, it's a different way of thinking/programming
  41. 41. import||context
    Using Claude for the same context and it’s doing really well with the glsl. since like last September
  42. 42. Pym||context
    One struggle I'm having (with Claude) is that most of what it knows about Three.js is outdated. I haven't used GPT in a while, is the grass greener?

    Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?

  43. 43. vunderba||context
    I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.

    It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.

    Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.

  44. 44. kingstnap||context
    The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.

    What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.

      Game created by Pietro Schirano, CEO of MagicPath
    
      Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
      - Think step by step, take a deep breath. Repeat the question back before answering.
      - Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
      -Then write all the code. Make the game low-poly but beautiful.
      - Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
      - You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.
  45. 45. irthomasthomas||context
    > Think Step By Step

    What is this, 2023?

    I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.

  46. 46. tantalor||context
    It comes across as an elaborate, sparkly motivational cat poster.

    *BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E

  47. 47. skolskoly||context
  48. 48. bredren||context
    The prompt did not specify advanced gameplay.

    I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.

    See up thread for anecdotes [1].

    > Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.

    I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.

    I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.

    This has allowed my workflows to float above the ups and downs of model performance.

    That said, having the AI do the planning for a big request like this internally is not good outside a demo.

    Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.

    [1] https://news.ycombinator.com/item?id=47879819

  49. 49. skirano||context
    Pietro here, I just published a video of it: https://x.com/skirano/status/2047403025094905964?s=20
  50. 50. ahoka||context
    "take a deep breath"

    OMFG

  51. 51. jameshart||context
    Claude would check to see if it had any breathing skills, if it doesn't find any it would start installing npm modules for breathing.
  52. 52. torginus||context
    It's weird how people pep talk the AI - if my Jira tickets looked like this, I would throw a fit.

    I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)

  53. 53. mattgreenrocks||context
    It’s not surprising to me that the same crowd that cheers for the demise of software engineering skills invented its own notion of AI prompting skills.

    Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.

  54. 54. eiksjs||context
    Indeed it is so utterly cringe.
  55. 55. dr_kiszonka||context
    That's quite similar to the AI Studio's prompt. You are a world-class frontend engineer...
  56. 56. eloisant||context
    Yes, this is cargo cult.

    This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.

    Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.

  57. 57. dataviz1000||context
    LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.

    The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.

    I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

  58. 58. snet0||context
    How are you handing the cube state to the model?
  59. 59. dataviz1000||context
    Does this answer the question?

    Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.

    https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...

    edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.

  60. 60. osti||context
    Can't they write a script to solve rubik cubes?
  61. 61. Jensson||context
    That doesn't test whether the model can follow and execute a dynamic plan reliably.
  62. 62. libraryofbabel||context
    I wonder if the difficulties LLMs have with “seeing” complex detail in images is muddying the problem here. What if you hand it the cube state in text form? (You could try ascii art if you want a middle ground.)

    If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.

  63. 63. Torkel||context
    *yet
  64. 64. embedding-shape||context
    > I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.

    Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!

  65. 65. Melatonic||context
    What about a model designed for robotics and vision? Seems like an LLM trained on text would inherently not be great for this.

    DeepMinds other models however might do better?

  66. 66. versteegen||context
    Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.
  67. 67. William_BB||context
    > what's needed is knowledge of solution techniques

    That's definitely in the training data

  68. 68. dataviz1000||context
    > how to make consistent progress towards a solution

    A 7 year old child can learn six sequences of a few moves and over a weekend solve the Rubik Cube. It is a solved algorithm something LLM should be very very good at. What it can't do is reason about spacial relationships.

  69. 69. variodot||context
    I’ve had a similar experience building a geometry/woodworking-flavored web app with Three.js and SVG rendering. It’s been kind of wild how quickly the SOTA models let me approach a new space in spatial development and rendering 3d (or SA optimization approaches, for that matter). That said, there are still easy "3d app" mistakes it makes like z-axis flipping or misreading coordinate conventions. But these models make similar mistakes with CSS and page awareness. Both require good verification loops to be effective.
  70. 70. dataviz1000||context
    I think there is a pattern. It has a hard time with temporal and spatial.

    Temporal. I had a research project where the LLM had no concept about preventing data from the future to leak in. I eventually had to create a wall clock and an agent that would step through every line of code and ensure by writing that lines logic and why there is no future of the wall clock data leaking.

    Spatial. I created a canvas for rendering thinking model's attention and feedforward layers for data visualization animations. It was having a hard time working with it until I pointed Opus 4.7 to some ancient JavaScript code [0] about projecting 3d to 2d and after searching Github repositories. It worked perfect with pan zoom in one shot after that.

    No matter how hard I tried I couldn't get it to stack all the layers correctly. It must have remembered all the parts for projecting 3d to 2d because it could not figure out how to position the layers.

    There is a ton of information burnt into the weights during training but it can not reason about it. When it does work well with spatial and temporal it is more slight of hand than being able to generalize.

    People say, why not just do reinforcement learning? That can't generalize in the same way a LLM can. I'm thinking about doing the Rubik's Cube because if people can solve that it might open up solutions for working temporal and spatial problems.

    [0] https://jakesgordon.com/writing/javascript-racer-v1-straight...

  71. 71. holoduke||context
    I bet I can even do it with the smallest gemma 4 model using a prompt of max 500 characters.
  72. 72. mindhunter||context
    A friend is building Jamboree[1] (prev name "Spielwerk") for iOS. An app to build and share games. They're all web based so they're easy to share.

    [1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...

  73. 73. nemo44x||context
    It’s like all these things though - it’s not a real production worthy product. It’s a super-demo. It looks amazing until you realize there’s many months of work to make it something of quality and value.

    I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.

  74. 74. peder||context
    > It really seems like we could be at the dawn of a new era similiar to flash

    We've been there for a while.... creativity has been the primary bottleneck

  75. 75. objektif||context
    Are there faster mini/nano versions as well?
  76. 76. tedsanders||context
    Not this time, no.
  77. 77. abi||context
    Usually, those get released a few weeks later.
  78. 78. jdw64||context
    GPT is really great, but I wish the GPT desktop app supported MCP as well.

    You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.

  79. 79. throwaway911282||context
    Use codex app
  80. 80. jryio||context
    Their 'Preparedness Framework'[1] is 20 pages and looks ChatGPT generated, I don't feel prepared reading it.

    https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...

  81. 81. cmrdporcupine||context
    Not rolled out to my Codex CLI yet, but some users on Reddit claiming it's on theirs.
  82. 82. cynicalpeace||context
    It's possible that "smarter" AI won't lead to more productivity in the economy. Why?

    Because software and "information technology" generally didn't increase productivity over the past 30 years.

    This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.

    But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.

    AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.

    If you give AI a body... well, maybe that changes.

  83. 83. aiaiai177||context
    Downvoted by the AI Nazis. They are running a tight ship before the IPOs.
  84. 84. cbg0||context
    I downvoted it because it doesn't add anything useful to the conversation, and I don't own any AI stock.
  85. 85. cynicalpeace||context
    It's a hypothesis that "smarter" AI models, ie GPT-5.5, may not be a great boon to productivity. Given that this is the raison d'etre of AI models, and improving them, I don't see why it is any less useful than any other discussion.
  86. 86. aerhardt||context
    > "information technology" generally didn't increase productivity

    Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.

  87. 87. cynicalpeace||context
    Productivity metrics were better when businesses were run on just pen and paper. Of course, there could be many confounding factors, but there are also many reasons why this could be so. Just a few hypotheses:

    - Pen and paper become a limiting factor on bureaucratic BS

    - Pen and paper are less distracting

    - Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive

    etc etc

  88. 88. theLiminator||context
    > Productivity metrics were better when businesses were run on just pen and paper

    What metrics are these?

  89. 89. cynicalpeace||context
    Productivity growth. If you take rolling averages from this chart, it clearly demonstrate higher productivity growth before the adoption of software. This is a well established fact in econ circles.

    https://fred.stlouisfed.org/graph/?g=1V79f

  90. 90. simianwords||context
    I think this is a classic case of reading into specific arguments too deeply without understanding what they really mean in the grand picture. Few points to easily disprove this argument

    - if it were true that software paradoxically reduces productivity, you can just start a competing company that doesn't use software. Obviously this is ridiculous - top 20 companies by market cap are mostly Software based. Every other non IT company is heavily invested in software

    - if you might say the problem is it at the country level, it is obvious that every country that has digitised has had higher productivity and GDP growth. Take Italy vs USA for instance.

    - if you are saying that the problem is even more global, take the whole world - the GDP per is still pretty high since the IT revolution (and so have other metrics)

    If you still think there's something more to it, you are probably deep in some conspiracy rabbit hole

  91. 91. eiksjs||context
    Is there a way to mute people who are clearly AI boosters? ^
  92. 92. simianwords||context
    ? you are literally commenting on the release of a new model from OpenAI in a tech focused community. Have you considered what should be normal here?
  93. 93. cynicalpeace||context
    The data clearly shows that productivity growth is flat or even declining. What is your accounting of why software hasn't offset those numbers?
  94. 94. simianwords||context
    You don't have a counterfactual to suggest that it would have continued increasing had it not been for technology. Is there _any_ credible economist who suggests that we might have higher productivity without tech?
  95. 95. cynicalpeace||context
    There is no counterfactual needed. Productivity growth has declined, despite the expectation that software would accelerate productivity. I'm asking you why this didn't happen.
  96. 96. simianwords||context
    There is a counterfactual needed because it is not clear whether the growth would not have declined even more without Software.

    Again I'm asking - is there a single credible economist who says that the growth would have been higher without technology?

  97. 97. cynicalpeace||context
    I'm not even proposing that growth would have been higher without "technology". I said information technology has not increased productivity growth compared to the past. This is an observation of fact.
  98. 98. simianwords||context
    > Productivity metrics were better when businesses were run on just pen and paper

    This is what you said.

  99. 99. cynicalpeace||context
    Again, that is a simple observation of fact. No counterfactual needed. I said it had confounding factors, and I offered hypotheses

    I asked you for alternative hypotheses and you've offered none.

  100. 100. aerhardt||context
    Of course a counterfactual is needed, absent clear separation of causes and links to effects, neither of which the productivity metrics on their own establish. This is also widely known and talked about in econ circles in the face of this very data.
  101. 101. ewrs||context
    Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.

    But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.

    Which effect dominates? Difficult to say.

    Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.

  102. 102. hol4b||context
    25 years of shipping software, and IT absolutely increased productivity - just not for everyone, not everywhere. Some workflows got 10x faster, others got slower from meetings about the new tools.

    AI feels the same. I'm shipping indie apps solo now that would have needed a small team five years ago. But in bigger orgs I see people spending 20 minutes verifying 15-minute AI output that used to be a 30-minute task they'd just do. Depends where you sit.

  103. 103. tedsanders||context
    Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.

    (I work at OpenAI.)

  104. 104. pixel_popping||context
    can't wait! Thanks guys. PS: when you drop a new model, it would be smart to reset weekly or at least session limits :)
  105. 105. cmrdporcupine||context
    Limits were just reset two days ago.
  106. 106. wahnfrieden||context
    And yet there was an outage last night
  107. 107. lawgimenez||context
    And they're having an outage right now.
  108. 108. pietz||context
    OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
  109. 109. cactusplant7374||context
    There is absolutely nothing wrong with asking or suggesting. They are adults. I'm sure they can handle it.
  110. 110. pixel_popping||context
    The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
  111. 111. Petersipoi||context
    Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
  112. 112. gardenhedge||context
    Strange behaviour to defend a company like this. Reset the limits openai!
  113. 113. qsort||context
    Great stuff! Congrats on the release!
  114. 114. motoboi||context
    Please next time start with azure foundry lol thanks!
  115. 115. vlovich123||context
    Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
  116. 116. moralestapia||context
    Why would you be confused?

    The UI tells you which model you're using at any given time.

  117. 117. ModernMech||context
    I don't see what model I'm using on the Codex web interface, where is that listed?
  118. 118. dude250711||context
    With Anthropic, newer models often lead to quality degradation. Will you keep GPT 5.4 available for some time?
  119. 119. endymi0n||context
    Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”

    I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.

  120. 120. pixel_popping||context
    GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))