That's at least genuine to some degree. Like, ok, good to know it's not officially a step back... But stuff like "smallest notch ever in an iPhone" is outright misleading consumers when there are other brands out there that easily beat them.
Benchmarks are favorable enough they're comparing to non-OpenAI models again. Interesting that tokens/second is similar to 5.4. Maybe there's some genuine innovation beyond bigger model better this time?
It's behind Opus 4.7 in SWE-Bench Pro, if you care about that kind of thing. It seems on-trend, even though benchmarks are less and less meaningful for the stuff we expect from models now.
The more interesting part of the announcement than "it's better at benchmarks":
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
Honestly the problem with these is how empirical it is, how someone can reproduce this? I love when Labs go beyond traditional benchies like MMLU and friends but these kind of statements don't help much either - unless it's a proper controlled study!
In a sense it's better than a benchmark: it's a practical, real-world, highly quantifiable improvement assuming there are no quality regressions and passes all test cases. I have been experimenting with this workflow across a variety of computational domains and have achieved consistent results with both Opus and GPT. My coworkers have independently used Opus for optimization suggestions on services in prod and they've led to much better performance (3x in some cases).
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
Oh, come on, if they do well on benchmarks people question how applicable they are in reality. If they do well in reality people complain that it's not a reproducible benchmark...
The tension here is that what customers need to reproduce is this result on their own problem. To measure this you need extensive evals on private data.
OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.
There's already KernelBench which tests CUDA kernel optimizations.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
So, im working in some high performance data processing in Rust. I had hit some performance walls, and needed to improve in the 100x or more scale.
I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.
I can argue that disaster started mid-4.6, when they started juggling with rate limits while hitting uptime problems. Great we have some healthy competition and waiting for the next move from Deepmind.
A playable 3D dungeon arena prototype built with Codex and GPT models. Codex handled the game architecture, TypeScript/Three.js implementation, combat systems, enemy encounters, HUD feedback, and GPT‑generated environment textures. Character models, character textures, and animations were created with third-party asset-generation tools
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
FWIW I've been experimenting with Three.js and AI for the last ~3 years, and noticed a significant improvement in 5.4 - the biggest single generation leap for Three.js specifically. It was most evident in shaders (GLSL), but also apparent in structuring of Three.js scenes across multiple pages/components.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
I’ve had a lot of success using LLMs to help with my Three.js based games and projects. Many of my weird clock visualizations relied heavily on it.
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
The meshes look interesting, but the gameplay is very basic. The tank one seems more sophisticated with the flying ships and whatnot.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
Game created by Pietro Schirano, CEO of MagicPath
Prompt: Create a 3D game using three.js. It should be a UFO shooter where I control a tank and shoot down UFOs flying overhead.
- Think step by step, take a deep breath. Repeat the question back before answering.
- Imagine you're writing an instruction message for a junior developer who's going to go build this. Can you write something extremely clear and specific for them, including which files they should look at for the change and which ones need to be fixed?
-Then write all the code. Make the game low-poly but beautiful.
- Remember, you are an agent: please keep going until the user's query is completely resolved before ending your turn and yielding back to the user. Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved. You must be prepared to answer multiple queries and only finish the call once the user has confirmed they're done.
- You must plan extensively in accordance with the workflow steps before making subsequent function calls, and reflect extensively on the outcomes of each function call, ensuring the user's query and related sub-requests are completely resolved.
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
It's weird how people pep talk the AI - if my Jira tickets looked like this, I would throw a fit.
I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
It’s not surprising to me that the same crowd that cheers for the demise of software engineering skills invented its own notion of AI prompting skills.
Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
LLM models can not do spacial reasoning. I haven't tried with GPT, however, Claude can not solve a Rubik Cube no matter how much I try with prompt engineering. I got Opus 4.6 to get ~70% of the puzzle solved but it got stuck. At $20 a run it prohibitively expensive.
The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.
I wonder if the difficulties LLMs have with “seeing” complex detail in images is muddying the problem here. What if you hand it the cube state in text form? (You could try ascii art if you want a middle ground.)
If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.
> I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
Interesting (would like to hear more), but solving a Rubiks cube would appear to be a poor way to measure spatial understanding or reasoning. Ordinary human spatial intuition lets you think about how to move a tile to a certain location, but not really how to make consistent progress towards a solution; what's needed is knowledge of solution techniques. I'd say what you're measuring is 'perception' rather than reasoning.
> how to make consistent progress towards a solution
A 7 year old child can learn six sequences of a few moves and over a weekend solve the Rubik Cube. It is a solved algorithm something LLM should be very very good at. What it can't do is reason about spacial relationships.
I’ve had a similar experience building a geometry/woodworking-flavored web app with Three.js and SVG rendering. It’s been kind of wild how quickly the SOTA models let me approach a new space in spatial development and rendering 3d (or SA optimization approaches, for that matter). That said, there are still easy "3d app" mistakes it makes like z-axis flipping or misreading coordinate conventions. But these models make similar mistakes with CSS and page awareness. Both require good verification loops to be effective.
I think there is a pattern. It has a hard time with temporal and spatial.
Temporal. I had a research project where the LLM had no concept about preventing data from the future to leak in. I eventually had to create a wall clock and an agent that would step through every line of code and ensure by writing that lines logic and why there is no future of the wall clock data leaking.
Spatial. I created a canvas for rendering thinking model's attention and feedforward layers for data visualization animations. It was having a hard time working with it until I pointed Opus 4.7 to some ancient JavaScript code [0] about projecting 3d to 2d and after searching Github repositories. It worked perfect with pan zoom in one shot after that.
No matter how hard I tried I couldn't get it to stack all the layers correctly. It must have remembered all the parts for projecting 3d to 2d because it could not figure out how to position the layers.
There is a ton of information burnt into the weights during training but it can not reason about it. When it does work well with spatial and temporal it is more slight of hand than being able to generalize.
People say, why not just do reinforcement learning? That can't generalize in the same way a LLM can. I'm thinking about doing the Rubik's Cube because if people can solve that it might open up solutions for working temporal and spatial problems.
It’s like all these things though - it’s not a real production worthy product. It’s a super-demo. It looks amazing until you realize there’s many months of work to make it something of quality and value.
I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.
GPT is really great, but I wish the GPT desktop app supported MCP as well.
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
It's possible that "smarter" AI won't lead to more productivity in the economy. Why?
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
It's a hypothesis that "smarter" AI models, ie GPT-5.5, may not be a great boon to productivity. Given that this is the raison d'etre of AI models, and improving them, I don't see why it is any less useful than any other discussion.
> "information technology" generally didn't increase productivity
Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
Productivity metrics were better when businesses were run on just pen and paper. Of course, there could be many confounding factors, but there are also many reasons why this could be so. Just a few hypotheses:
- Pen and paper become a limiting factor on bureaucratic BS
- Pen and paper are less distracting
- Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
Productivity growth. If you take rolling averages from this chart, it clearly demonstrate higher productivity growth before the adoption of software. This is a well established fact in econ circles.
I think this is a classic case of reading into specific arguments too deeply without understanding what they really mean in the grand picture. Few points to easily disprove this argument
- if it were true that software paradoxically reduces productivity, you can just start a competing company that doesn't use software. Obviously this is ridiculous - top 20 companies by market cap are mostly Software based. Every other non IT company is heavily invested in software
- if you might say the problem is it at the country level, it is obvious that every country that has digitised has had higher productivity and GDP growth. Take Italy vs USA for instance.
- if you are saying that the problem is even more global, take the whole world - the GDP per is still pretty high since the IT revolution (and so have other metrics)
If you still think there's something more to it, you are probably deep in some conspiracy rabbit hole
You don't have a counterfactual to suggest that it would have continued increasing had it not been for technology. Is there _any_ credible economist who suggests that we might have higher productivity without tech?
There is no counterfactual needed. Productivity growth has declined, despite the expectation that software would accelerate productivity. I'm asking you why this didn't happen.
I'm not even proposing that growth would have been higher without "technology". I said information technology has not increased productivity growth compared to the past. This is an observation of fact.
Of course a counterfactual is needed, absent clear separation of causes and links to effects, neither of which the productivity metrics on their own establish. This is also widely known and talked about in econ circles in the face of this very data.
Its quite possible the use of LLMs means that we are using less effort to produce the same output. This seems good.
But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
Which effect dominates? Difficult to say.
Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
25 years of shipping software, and IT absolutely increased
productivity - just not for everyone, not everywhere. Some
workflows got 10x faster, others got slower from meetings about
the new tools.
AI feels the same. I'm shipping indie apps solo now that would
have needed a small team five years ago. But in bigger orgs
I see people spending 20 minutes verifying 15-minute AI output
that used to be a 30-minute task they'd just do. Depends where
you sit.
Just as a heads up, even though GPT-5.5 is releasing today, the rollout in ChatGPT and Codex will be gradual over many hours so that we can make sure service remains stable for everyone (same as our previous launches). You may not see it right away, and if you don't, try again later in the day. We usually start with Pro/Enterprise accounts and then work our way down to Plus. We know it's slightly annoying to have to wait a random amount of time, but we do it this way to keep service maximally stable.
OpenAI has been very generous with limit resets. Please don't turn this into a weird expectation to happen whenever something unrelated happens. It would piss me off if I were in their place and I really don't want them to stop.
The suggestion wasn't about general limit resets when there is bugs or outages, but commercially useful to let users try new models when they have already reached their weekly limits.
Sorry but why should we care if very reasonable suggestions "piss [them] off"? That sounds like a them problem. "Them" being a very wealthy business. I think OpenAI will survive this very difficult time that GP has put them through.
Conceivably you could have a public-facing dashboard of the rollout status to reduce confusion or even make it visible directly in the UI that the model is there but not yet available to you. The fanciest would be to include an ETA but that's presumably difficult since it's hard to guess in case the rollout has issues.
Did you guys do anything about GPT‘s motivation? I tried to use GPT-5.4 API (at xhigh) for my OpenClaw after the Anthropic Oauthgate, but I just couldn‘t drag it to do its job. I had the most hilarious dialogues along the lines of „You stopped, X would have been next.“ - „Yeah, I‘m sorry, I failed. I should have done X next.“ - „Well, how about you just do it?“ - „Yep, I really should have done it now.“ - “Do X, right now, this is an instruction.” - “I didn’t. You’re right, I have failed you. There’s no apology for that.”
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.
GPT 5.4 is really good at following precise instructions but clearly wouldn't innovate on its own (except if the instructions clearly state to innovate :))
https://deploymentsafety.openai.com/gpt-5-5
Will be interesting to try.
> To better utilize GPUs, Codex analyzed weeks’ worth of production traffic patterns and wrote custom heuristic algorithms to optimally partition and balance work. The effort had an outsized impact, increasing token generation speeds by over 20%.
The ability for agentic LLMs to improve computational efficiency/speed is a highly impactful domain I wish was more tested than with benchmarks. From my experience Opus is still much better than GPT/Codex in this aspect, but given that OpenAI is getting material gains out of this type of performancemaxxing and they have an increasing incentive to continue doing so given cost/capacity issues, I wonder if OpenAI will continue optimizing for it.
A more empirical test would be good for everyone (i.e. on equal hardware, give each agent the goal to implement an algorithm and make it as fast as possible, then quantify relative speed improvements that pass all test cases).
OpenAI simply won’t share the data you need to reproduce this in the way you’d hope for an academic paper.
It’s an engineering result, not a scientific one.
On the other hand all companies know that optimizing their own infrastructure / models is the critical path for ,,winning'' against the competition, so you can bet they are serious about it.
I remembered the famous FizzBuzz Intel codegolf optimizations, and gave it to gemini pro, along with my code and instructions to "suggest optimizations similar to those, maybe not so low level, but clever" and it's suggestions were veerry cool.
LLM do not stop amazing me every day.
https://www.tbench.ai/leaderboard/terminal-bench/2.0
https://debugml.github.io/cheating-agents/#sneaking-the-answ...
The game that this prompt generated looks pretty decent visually. A big part of this likely due to the fact the meshes were created using a seperate tool (probably meshy, tripo.ai, or similiar) and not generated by 5.5 itself.
It really seems like we could be at the dawn of a new era similiar to flash, where any gamer or hobbyist can generate game concepts quickly and instantly publish them to the web. Three.js in particular is really picking up as the primary way to design games with AI, in spite of the fact it's not even a game engine, just a web rendering library.
It still struggles to create shaders from scratch, but is now pretty adequate at editing existing shaders.
In 5.2 and below, GPT really struggled with "one canvas, multiple page" experiences, where a single background canvas is kept rendered over routes. In 5.4, it still takes a bit of hand-holding and frequent refactor/optimisation prompts, but is a lot more capable.
Excited to test 5.5 and see how it is in practice.
Oh just like a real developer
Have you tried any skills like cloudai-x/threejs-skills that help with that? Or built your own?
It might not be a game engine, but it’s the de facto standard for doing WebGL 3D. And since it’s been around forever, there’s a massive amount of training data available for it.
Before LLMs were a thing, I relied more on Babylon.js, since it’s a bit higher level and gives you more batteries included for game development.
What's strange is that this Pietro Schirano dude seems to write incredibly cargo cult prompts.
What is this, 2023?
I feel like this was generated by a model tapping in to 2023 notions of prompt engineering.
*BELIEVE!* https://www.youtube.com/watch?v=D2CRtES2K3E
I do not see instructions to assist in task decomposition and agent ~"motivation" to stay aligned over long periods as cargo culting.
See up thread for anecdotes [1].
> Decompose the user's query into all required sub-requests and confirm that each one is completed. Do not stop after completing only part of the request. Only terminate your turn when you are sure the problem is solved.
I see this as a portrayal of the strength of 5.5, since it suggests the ability to be assigned this clearly important role to ~one shot requests like this.
I've been using a cli-ai-first task tool I wrote to process complex "parent" or "umberella" into decomposed subtasks and then execute on them.
This has allowed my workflows to float above the ups and downs of model performance.
That said, having the AI do the planning for a big request like this internally is not good outside a demo.
Because, you want the planning of the AI to be part of the historical context and available for forensics due to stalls, unwound details or other unexpected issues at any point along the way.
[1] https://news.ycombinator.com/item?id=47879819
OMFG
I guess these people think they have special prompt engineering skills, and doing it like this is better than giving the AI a dry list of requirements (fwiw, they might be even right)
Too bad they can veer sharply into cringe territory pretty fast: “as an accomplished Senior Principal Engineer at a FAANG with 22 years of experience, create a todo list app.” It’s like interactive fanfiction.
This remind me of so called "optimization" hacks that people keep applying years after their languages get improved to make them unnecessary or even harmful.
Maybe at one point it helped to write prompts in this weird way, but with all the progress going on both in the models and the harness if it's not obsolete yet it will soon be. Just crufts that consumes tokens and fills the context window for nothing.
The point is if we can prompt an LLM to reason about 3 dimensions, we likely will be able to apply that to math problems which it isn't able to solve currently.
I should release my Rubiks Cube MCP server with the challenge to see if someone can write a prompt to solve a Rubik's Cube.
Opus 4.6 got the cross and started to get several pieces on the correct faces. It couldn't reason past this. You can see the prompts and all the turn messages.
https://gist.github.com/adam-s/b343a6077dd2f647020ccacea4140...
edit: I can't reply to message below. The point isn't can we solve a Rubik's Cube with a python script and tool calls. The point is can we get an LLM to reason about moving things in 3 dimensions. The prompt is a puzzle in the way that a Rubik's Cube is a puzzle. A 7 year old child can learn 6 moves and figure out how to solve a Rubik's Cube in a weekend, the LLM can't solve it. However, can, given the correct prompt, a LLM solve it? The prompt is the puzzle. That is why it is fun and interesting. Plus, it is a spatial problem so if we solve that we solve a massive class of problems including huge swathes of mathematics the LLMs can't touch yet.
If you want to isolate the issue, try getting the LLM itself to turn the images into a text representation of the cube state and check for accuracy. If it can’t see state correctly it certainly won’t be able to solve.
Do it, I'm game! You nerdsniped me immediately and my brain went "That sounds easy, I'm sure I could do that in a night" so I'm surely not alone in being almost triggered by what you wrote. I bet I could even do it with a local model!
DeepMinds other models however might do better?
That's definitely in the training data
A 7 year old child can learn six sequences of a few moves and over a weekend solve the Rubik Cube. It is a solved algorithm something LLM should be very very good at. What it can't do is reason about spacial relationships.
Temporal. I had a research project where the LLM had no concept about preventing data from the future to leak in. I eventually had to create a wall clock and an agent that would step through every line of code and ensure by writing that lines logic and why there is no future of the wall clock data leaking.
Spatial. I created a canvas for rendering thinking model's attention and feedforward layers for data visualization animations. It was having a hard time working with it until I pointed Opus 4.7 to some ancient JavaScript code [0] about projecting 3d to 2d and after searching Github repositories. It worked perfect with pan zoom in one shot after that.
No matter how hard I tried I couldn't get it to stack all the layers correctly. It must have remembered all the parts for projecting 3d to 2d because it could not figure out how to position the layers.
There is a ton of information burnt into the weights during training but it can not reason about it. When it does work well with spatial and temporal it is more slight of hand than being able to generalize.
People say, why not just do reinforcement learning? That can't generalize in the same way a LLM can. I'm thinking about doing the Rubik's Cube because if people can solve that it might open up solutions for working temporal and spatial problems.
[0] https://jakesgordon.com/writing/javascript-racer-v1-straight...
[1] https://apps.apple.com/uz/app/jamboree-game-maker/id67473110...
I think people are starting to catch on to where we really are right now. Future models will be better but we are entering a trough of dissolution and this attitude will be widespread in a few months.
We've been there for a while.... creativity has been the primary bottleneck
You can kind of use connectors like MCP, but having to use ngrok every time just to expose a local filesystem for file editing is more cumbersome than expected.
https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbdde...
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
Do you think it'd be viable to run most businesses on pen and paper? I'll give you email and being able to consume informational websites - rest is pen and paper.
- Pen and paper become a limiting factor on bureaucratic BS
- Pen and paper are less distracting
- Pen and paper require more creative output from the user, as opposed to screens which are mostly consumptive
etc etc
What metrics are these?
https://fred.stlouisfed.org/graph/?g=1V79f
- if it were true that software paradoxically reduces productivity, you can just start a competing company that doesn't use software. Obviously this is ridiculous - top 20 companies by market cap are mostly Software based. Every other non IT company is heavily invested in software
- if you might say the problem is it at the country level, it is obvious that every country that has digitised has had higher productivity and GDP growth. Take Italy vs USA for instance.
- if you are saying that the problem is even more global, take the whole world - the GDP per is still pretty high since the IT revolution (and so have other metrics)
If you still think there's something more to it, you are probably deep in some conspiracy rabbit hole
Again I'm asking - is there a single credible economist who says that the growth would have been higher without technology?
This is what you said.
I asked you for alternative hypotheses and you've offered none.
But the less effort exertion also conditions you to be weaker, and less able to connect deeply with the brain to grind as hard as once did. This is bad.
Which effect dominates? Difficult to say.
Of course this is absolutely possible. Ultimately there was a time where physical exertion was a thing and nobody was over-weight. That isn't the case anymore is it.
AI feels the same. I'm shipping indie apps solo now that would have needed a small team five years ago. But in bigger orgs I see people spending 20 minutes verifying 15-minute AI output that used to be a 30-minute task they'd just do. Depends where you sit.
(I work at OpenAI.)
The UI tells you which model you're using at any given time.
I literally wasn’t able to convince the model to WORK, on a quick, safe and benign subtask that later GLM, Kimi and Minimax succeeded on without issues. Had to kick OpenAI immediately unfortunately.