DeepSeek v4 (api-docs.deepseek.com)

2,086 points|by impact_sy|4d ago|1,601 comments|Read full story on api-docs.deepseek.com

https://api-docs.deepseek.com/

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...

Comments (1601)

120 shown|More comments

1. luyu_wu|4d ago|context

For those who didn't check the page yet, it just links to the API docs being updated with the upcoming models, not the actual model release.
2. talim|4d ago|context

Weights are on Huggingface FWIW. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/tree/main
3. cmrdporcupine|4d ago|context

My submission here https://news.ycombinator.com/item?id=47885014 done at the same time was to the weights.
dang, probably the two should be merged and that be the link
4. culi|4d ago|context

there's no pinging. Someone's gotta email dang
5. cmrdporcupine|4d ago|context

beh. instead of merging they just marked mine as dupe, even tho it was submitted at same time and had (for a long time) about the same votes and a better target page
6. seanobannon|4d ago|context

Weights available here: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
7. BoorishBears|4d ago|context

https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-Base https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro-Base
And we got new base models, wonderful, truly wonderful
8. nthypes|4d ago|context

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
Model was released and it's amazing. Frontier level (better than Opus 4.6) at a fraction of the cost.
9. sergiotapia|4d ago|context

The dragon awakes yet again!
10. kindkang2024|4d ago|context

There appears a flight of dragons without heads. Good fortune.
That's literally what the I Ching calls "good fortune."
Competition, when no single dragon monopolizes the sky, brings fortune for all.
11. rapind|4d ago|context

Pop?
12. onchainintel|4d ago|context

How does it compare to Opus 4.7? I've been immersed in 4.7 all week participating in the Anthropic Opus 4.7 hackathon and it's pretty impressive even if it's ravenous from a token perspective compared to 4.6
13. greenknight|4d ago|context

The thing is, it doesnt need to beat 4.7. it just needs to do somewhat well against it.
This is free... as in you can download it, run it on your systems and finetune it to be the way you want it to be.
14. johnmaguire|4d ago|context

... if you have 800 GB of VRAM free.
15. inventor7777|4d ago|context

I remember reading about some new frameworks have been coming out to allow Macs to stream weights of huge models live from fast SSDs and produce quality output, albeit slowly. Apart from that...good luck finding that much available VRAM haha
16. p1esk|4d ago|context

Do you think a lot of people have “systems” to run a 1.6T model?
17. applfanboysbgon|4d ago|context

No, but businesses do. Being able to run quality LLMs without your business, or business's private information, being held at the mercy of another corp has a lot of value.
18. choldstare|4d ago|context

Not really - on prem llm hosting is extremely labor and capital intensive
19. applfanboysbgon|4d ago|context

But can be, and is, done. I work for a bootstrapped startup that hosts a DeepSeek v3 retrain on our own GPUs. We are highly profitable. We're certainly not the only ones in the space, as I'm personally aware of several other startups hosting their own GLM or DeepSeek models.
20. wuschel|4d ago|context

Why a retrain? What are you using the model for?
21. forrestthewoods|4d ago|context

What type of system is needed to self host this? How much would it cost?
22. p1esk|4d ago|context

Depends on fast you want it to be. I’m guessing a couple of $10k mac studio boxes could run it, but probably not fast enough to enjoy using it.
23. disiplus|4d ago|context

Depends how many users you have and what is "production grade" for you but like 500k gets you a 8x B200 machine.
24. fragmede|4d ago|context

One GB200 NVL72 from Nvidia would do it. $2-3 million, or so. If you're a corporation, say Walmart or PayPal, that's not out of the question.
If you want to go budget corporate, 7 x H200 is just barely going to run it, but all in, $300k ought to do it.
25. gloflo|4d ago|context

How many users can you serve with that?
26. fragmede|4d ago|context

For the H200, between 150-700. The GB200 gets you something like 2-10k users.
27. forrestthewoods|4d ago|context

Whoa. How on earth can one system serve 2000 potentially concurrent users?
28. CamperBob2|4d ago|context

$20K worth of RTX 6000 Blackwell cards should let you run the Flash version of the model.
29. CJefferson|4d ago|context

To me, the important thing isn't that I can run it, it's that I can pay someone else to run it. I'm finding Opus 4.7 seems to be weirdly broken compared to 4.6, it just doesn't understand my code, breaks it whenever I ask it to do anything.
Now, at the moment, i can still use 4.6 but eventually Anthropic are going to remove it, and when it's gone it will be gone forever. I'm planning on trying Deepseek v4, because even if it's not quite as good, I know that it will be available forever, I'll always be able to find someone to run it.
30. muyuu|4d ago|context

Yep, it's wild how little emphasis is there on control and replicability in these posts.
Already these models are useful for a myriad of use cases. It's really not that important if a model can 1-shot a particular problem or draw a cuter pelican on a bike. Past a degree of quality, process and reliability are so much more important for anything other than complete hands-off usage, which in business it's not something you're really going to do.
The fact that my tool may be gone tomorrow, and this actually has happened before, with no guarantees of a proper substitute... that's a lot more of a concern than a point extra in some benchmark.
31. kelseyfrog|4d ago|context

What's the hardware cost to running it?
32. slashdave|4d ago|context

"if you have to ask..."
33. redox99|4d ago|context

Probably like 100 USD/hour
34. bbor|4d ago|context

I was curious, and some [intrepid soul](https://wavespeed.ai/blog/posts/deepseek-v4-gpu-vram-require...) did an analysis. Assuming you do everything perfectly and take full advantage of the model's MoE sparsity, it would take:
- To run at full precision: "16–24 H100s", giving us ~$400-600k upfront, or $8-12/h from [us-east-1](https://intuitionlabs.ai/articles/h100-rental-prices-cloud-c...).
- To run with "heavy quantization" (16 bits -> 8): "8xH100", giving us $200K upfront and $4/h.
- To run truly "locally"--i.e. in a house instead of a data center--you'd need four 4090s, one of the most powerful consumer GPUs available. Even that would clock in around $15k for the cards alone and ~$0.22/h for the electricity (in the US).
Truly an insane industry. This is a good reminder of why datacenter capex from since 2023 has eclipsed the Manhattan Project, the Apollo program, and the US interstate system combined...
35. zargon|4d ago|context

That article is a total hallucination.
"671B total / 37B active"
"Full precision (BF16)"
And they claim they ran this non-existent model on vLLM and SGLang over a month and a half ago.
It's clickbait keyword slop filled in with V3 specs. Most of the web is slop like this now. Sigh.
36. oceanplexian|4d ago|context

All these number are peanuts to a mid sized company. A place I worked at used to spend a couple million just for a support contract on a Netapp.
10 years from now that hardware will be on eBay for any geek with a couple thousand dollars and enough power to run it.
37. onchainintel|4d ago|context

Completely agree, not suggesting it needs ot just genuinely curious. Love that it can be run locally though. Open source LLMs punching back pretty hard against proprietary ones in the cloud lately in terms of performance.
38. libraryofbabel|4d ago|context

> you can download it, run it on your systems
In theory, sure, but as other have pointed out you need to spend half a million on GPUs just to get enough VRAM to fit a single instance of the model. And you’d better make sure your use case makes full 24/7 use of all that rapidly-depreciating hardware you just spent all your money on, otherwise your actual cost per token will be much higher than you think.
In practice you will get better value from just buying tokens from a third party whose business is hosting open weight models as efficiently as possible and who make full use of their hardware. Even with the small margin they charge on top you will still come out ahead.
39. hsbauauvhabzb|4d ago|context

Sure, but that’s an incredibly short term viewpoint.
40. oceanplexian|4d ago|context

There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.
And that GPU wouldn’t run one instance, the models are highly parallelizable. It would likely support 10-15 users at once, if a company oversubscribed 10:1 that GPU supports ~100 seats. Amortized over a couple years the costs are competitive.
41. libraryofbabel|4d ago|context

> There are a lot of companies who would gladly drop half a million on a GPU to have private inference that Anthropic or OpenAI can’t use to steal their data.
Obviously, and certainly companies do run their own models because they place some value on data sovereignty for regulatory or compliance or other reasons. (Although the framing that Anthropic or OpenAI might "steal their data" is a bit alarmist - plenty of companies, including some with _highly_ sensitive data, have contracts with Anthropic or OpenAI that say they can't train future models on the data they send them and are perfectly happy to send data to Claude. You may think they're stupid to do that, but that's just your opinion.)
> the models are highly parallelizable. It would likely support 10-15 users at once.
Yes, I know that; I understand LLM internals pretty well. One instance of the model in the sense of one set of weights loaded across X number of GPUs; of course you can then run batch inference on those weights, up to the limits of GPU bandwidth and compute.
But are those 100 users you have on your own GPUs usings the GPUs evenly across the 24 hours of the day, or are they only using them during 9-5 in some timezone? If so, you're leaving your expensive hardware idle for 2/3 of the day and the third party providers hosting open weight models will still beat you on costs, even without getting into other factors like they bought their GPUs cheaper than you did. Do the math if you don't believe me.
42. dannyw|4d ago|context

There's stuff like SOC controls and enterprise contracts with enforceable penalties if clauses are breached. ZDR is a thing.
The most significant value of open source models come from being able to fine-tune; with a good dataset and limited scope; a finetune can be crazily worth it.
43. rvz|4d ago|context

It is more than good enough and has effectively caught up with Opus 4.6 and GPT 5.4 according to the benchmarks.
It's about 2 months behind GPT 5.5 and Opus 4.7.
As long as it is cheap to run for the hosting providers and it is frontier level, it is a very competitive model and impressive against the others. I give it 2 years maximum for consumer hardware to run models that are 500B - 800B quantized on their machines.
It should be obvious now why Anthropic really doesn't want you to run local models on your machine.
44. colordrops|4d ago|context

What's going to change in 2 years that would allow users to run 500B-800B parameter models on consumer hardware?
45. DiscourseFan|4d ago|context

I think its just an estimate
46. indigodaddy|4d ago|context

But the question remains
47. snovv_crash|4d ago|context

With the ability of the Qwen3.6 27B, I think in 2 years consumers will be running models of this capability on current hardware.
48. deaux|4d ago|context

Vibes > Benchmarks. And it's all so task-specific. Gemini 3 has scored very well in benchmarks for very long but is poor at agentic usecases. A lot of people prefering Opus 4.6 to 4.7 for coding despite benchmarks, much more than I've seen before (4.5->4.6, 4->4.5).
Doesn't mean Deepseek v4 isn't great, just benchmarks alone aren't enough to tell.
49. spaceman_2020|4d ago|context

Tbh I was more productive with 4.6 than ever before and if AI progress locks in permanently at 4.6 tier, I’d be pretty happy
50. doctoboggan|4d ago|context

Is it honestly better than Opus 4.6 or just benchmaxxed? Have you done any coding with an agent harness using it?
If its coding abilities are better than Claude Code with Opus 4.6 then I will definitely be switching to this model.
51. madagang|4d ago|context

Their Chinese announcement says that, based on internal employee testing, it is not as good as Opus 4.6 Thinking, but is slightly better than Opus 4.6 without Thinking enabled.
52. mchusma|4d ago|context

I appreciate this, makes me trust it more than benchmarks.
53. deaux|4d ago|context

That's super interesting, isn't Deepseek in China banned from using Anthropic models? Yet here they're comparing it in terms of internal employee testing.
54. renticulous|4d ago|context

They use VPN to access. Even Google Deepmind uses Anthropic. There was a fight within Google as to why only DeepMind is allowed to Claude while rest of the Google can't.
55. computably|4d ago|context

> That's super interesting, isn't Deepseek in China banned from using Anthropic models? Yet here they're comparing it in terms of internal employee testing.
I don't see why Deepseek would care to respect Anthropic's ToS, even if just to pretend. It's not like Anthropic could file and win a lawsuit in China, nor would the US likely ban Deepseek. And even if the US gov would've considered it, Anthropic is on their shitlist.
56. ibic|4d ago|context

In case people wonder where the announcement is (you can easily translate it via browser if you don't read Chinese): https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg
It's still a "preview" version atm.
57. anentropic|4d ago|context

Who uses Opus without thinking though...?
58. bokkies|4d ago|context

Apparently glm5.1 and qwen coder latest is as good as opus 4.6 on benchmarks. So I tried both seriously for a week (glm Pro using CC) and qwen using qwen companion. Thought I could save $80 a month. Unfortunately after 2 days I had switched back to Max. The speed (slower on both although qwen is much faster) and errors (stupid layout mistakes, inserting 2 footers then refusing to remove one, not seeing obvious problems in screenshots & major f-ups of functionality), not being able to view URLs properly, etc. I'll give deepseek a go but I suspect it will be similar. The model is only half the story. Also been testing gpt5.4 with codex and it is very almost as good as CC... better on long running tasks running in background. Not keen on ChatGPT codex 'personality' so will stick to CC for the most part.
59. NitpickLawyer|4d ago|context

> (better than Opus 4.6)
There we go again :) It seems we have a release each day claiming that. What's weird is that even deepseek doesn't claim it's better than opus w/ thinking. No idea why you'd say that but anyway.
Dsv3 was a good model. Not benchmaxxed at all, it was pretty stable where it was. Did well on tasks that were ood for benchmarks, even if it was behind SotA.
This seems to be similar. Behind SotA, but not by much, and at a much lower price. The big one is being served (by ds themselves now, more providers will come and we'll see the median price) at 1.74$ in / 3.48$ out / 0.14$ cache. Really cheap for what it offers.
The small one is at 0.14$ in / 0.28$ out / 0.028$ cache, which is pretty much "too cheap to matter". This will be what people can run realistically "at home", and should be a contender for things like haiku/gemini-flash, if it can deliver at those levels.
60. slopinthebag|4d ago|context

Anthropic fans would claim God itself is behind Opus by 3-6 months and then willingly be abused by Boris and one of his gaslighting tweets.
LMAO
61. NitpickLawyer|4d ago|context

> Anthropic fans ...
I have no idea why you'd think that, but this is straight from their announcement here (https://mp.weixin.qq.com/s/8bxXqS2R8Fx5-1TLDBiEDg):
> According to evaluation feedback, its user experience is better than Sonnet 4.5, and its delivery quality is close to Opus 4.6's non-thinking mode, but there is still a certain gap compared to Opus 4.6's thinking mode.
This is the model creators saying it, not me.
62. 0xbadcafebee|4d ago|context

I don't think we need to compare models to Opus anymore. Opus users don't care about other models, as they're convinced Opus will be better forever. And non-Opus users don't want the expense, lock-in or limits.
As a non-Opus user, I'll continue to use the cheapest fastest models that get my job done, which (for me anyway) is still MiniMax M2.5. I occasionally try a newer, more expensive model, and I get the same results. I have a feeling we might all be getting swindled by the whole AI industry with benchmarks that just make it look like everything's improving.
63. kmarc|4d ago|context

This resonates with me a lot.
I do some stuff with gemini flash and Aider, but mostly because I want to avoid locking myself into a walled garden of models, UIs and company
64. post-it|4d ago|context

What do you run these on? I've gotten comfortable with Claude but if folks are getting Opus performance for cheaper I'll switch.
65. slopinthebag|4d ago|context

Try Charm Crush first, it's a native binary. If it's unbearable, try opencode, just with the knowledge your system will probably be pwned soon since it's JS + NPM + vibe coding + some of the most insufferable devs in the industry behind that product.
If you're feeling frisky, Zed has a decent agent harness and a very good editor.
66. post-it|4d ago|context

I've downloaded Zed but haven't used it much, maybe this is my sign. Thanks!
67. oceanplexian|4d ago|context

You can just use Claude Code with a few env vars, most of these providers offer an Anthropic compatible API
68. ind-igo|4d ago|context

Agree with your assessment, I think after models reached around Opus 4.5 level, its been almost indistinguishable for most tasks. Intelligence has been commoditized, what's important now is the workflows, prompting, and context management. And that is unique to each model.
69. wuschel|4d ago|context

This is not true for some cases e.g. there are stark differences in the correctness of answers in certain type of case work.
70. vidarh|4d ago|context

Same for me. There are tasks when I want the smartest model. But for a whole lot of tasks I now default to Sonnet, or go with cheaper models like GLM, Kimi, Qwen. DeepSeek hasn't been in the mix for a while because their previous model had started lagging, but will definitely test this one again.
The tricky part is that the "number of tokens to good result" does absolutely vary, and you need a decent harness to make it work without too much manual intervention, so figuring out which model is most cost-effective for which tasks is becoming increasingly hard, but several are cost-effective enough.
71. szundi|4d ago|context

I don’t know what people are doing but Minimax produced 16 bugreports which of 15 was false positives (literally a mistake).
In contrast ChatGPT 5.3 and also Opus has a 90% rate at least on this same project. (Embedded)
All other tests were the same. What are you doing with these models?
72. versteegen|4d ago|context

Which model's best depends on how you use it. There's a huge difference in behaviour between Claude and GPT and other models which makes some poor substitutes for others in certain use cases. I think the GPT models are a bad substitute for Claude ones for tasks such as pair-programming (where you want to see the CoT and have immediate responses) and writing code that you actually want to read and edit yourself, as opposed to just letting GPT run in the background to produce working code that you won't inspect. Yes, GPT 5.4 is cheap and brilliant but very black-box and often very slow IME. GPT-5.4 still seems to behave the same as 5.1, which includes problems like: doesn't show useful thoughts, can think for half an hour, says "Preparing the patch now" then thinks for another 20 min, gives no impression of what it's doing, reads microscopic parts of source files and misses context, will do anything to pass the tests including patching libraries...
73. sandGorgon|4d ago|context

actually this is not the reason - the harness is significantly better. There is no comparable harness to Claude Code with skills, etc.
Opencode was getting there, but it seems the founders lost interest. Pi could be it, but its very focused on OpenClaw. Even Codex cli doesnt have all of it.
which harness works well with Deepseek v4 ?
74. darkwater|4d ago|context

What's the issue with OC? I tried it a bit over 2 months ago, when I was still on Claude API, and it actually liked more that CC (i.e. the right sidebar with the plan and a tendency at asking less "security" questions that CC). Why is it so bad nowadays?
75. avereveard|4d ago|context

eh idk. until yesterday opus was the one that got spatial reasoning right (had to do some head pose stuff, neither glm 5.1 nor codex 5.3 could "get" it) and codex 5.3 was my champion at making UX work.
So while I agree mixed model is the way to go, opus is still my workhorse.
76. gunalx|4d ago|context

I find gemini pretty good ob spatial reasoning.
77. avereveard|4d ago|context

Yeah but gemini has a hard time discussing about solutions it just jump to implementation which is great if it gets it right and not so great if it goes down the wrong path.
Not saying it is better or worse, but the way I perpersonally prefer is to design in chat, to make sure all unknown unknown are addressed
78. sandos|4d ago|context

Is Opus nerfed somehow in Copilot? Ive tried it numerous times, it has never reallt woved me. They seem to have awfully small context windows, but still. Its mostly their reasoning which has been off
Codex is just so much better, or the genera GPT models.
79. specproc|4d ago|context

Opus just got killed in Copilot. I always found it great, FWIW.
https://github.blog/news-insights/company-news/changes-to-gi...
80. spaceman_2020|4d ago|context

I found Opus 4.7 to be actually worse than Opus 4.6 for my use case
Substantially worse at following instructions and overoptimized for maximizing token usage
81. bbor|4d ago|context

For the curious, I did some napkin math on their posted benchmarks and it racks up 20.1 percentage point difference across the 20 metrics where both were scored, for an average improvement of about 2% (non-pp). I really can't decide if that's mind blowing or boring?
Claude4.6 was almost 10pp better at at answering questions from long contexts ("corpuses" in CorpusQA and "multiround conversations" in MRCR), while DSv4 was a staggering 14pp better at one math challenge (IMOAnswerBench) and 12pp better at basic Q&A (SimpleQA-Verified).
82. Quasimarion|4d ago|context

FWIW it's also like 10x cheaper.
83. creamyhorror|4d ago|context

No, the Deepseek V4 paper itself says that DS-V4-Pro-Max is close to Opus 4.5 in their staff evaluations, not better than 4.6:
> In our internal evaluation, DeepSeek-V4-Pro-Max outperforms Claude Sonnet 4.5 and approaches the level of Opus 4.5.
84. taosx|4d ago|context

MErge? https://news.ycombinator.com/item?id=47885014
85. gbnwl|4d ago|context

I’m deeply interested and invested in the field but I could really use a support group for people burnt out from trying to keep up with everything. I feel like we’ve already long since passed the point where we need AI to help us keep up with advancements in AI.
86. wordpad|4d ago|context

The players barely ever change. People don't have problems following sports, you shouldn't struggle so much with this once you accept top spot changes.
87. ehnto|4d ago|context

It is funny seeing people ping pong between Anthropic and ChatGPT, with similar rhetoric in both directions.
At this point I would just pick the one who's "ethics" and user experience you prefer. The difference in performance between these releases has had no impact on the meaningful work one can do with them, unless perhaps they are on the fringes in some domain.
Personally I am trying out the open models cloud hosted, since I am not interested in being rug pulled by the big two providers. They have come a long way, and for all the work I actually trust to an LLM they seem to be sufficient.
88. DiscourseFan|4d ago|context

I find ChatGPT annoying mostly
89. awakeasleep|4d ago|context

Open settings > personalization. Set it to efficient base style. Turn off enthusiasm and warmth. You’re welcome
90. 2ndorderthought|4d ago|context

Yea but even then it's still annoying. "It's not about the enthusiasm and warmth but the general tone"
91. layer8|4d ago|context

Setting “base style and tone” to “efficient” works fine for me.
92. dannyw|4d ago|context

Their financial projections that to a big part their valuation and investor story is built on involves actually making money, and lots of money, at some point. That money has to come from somewhere.
93. gbnwl|4d ago|context

I didn't express this well but my interest isn't "who is in the top spot", and is more _why and _how various labs get the results they do. This is also magnified by the fact that I'm not only interested in hosted providers of inference but local models as well. What's your take on the best model to run for coding on 24GB of VRAM locally after the last few weeks of releases? Which harness do you prefer? What quants do you think are best? To use your sports metaphor it's more than following the national leagues but also following college and even high school leagues as well. And the real interest isn't even who's doing well but WHY, at each level.
94. renticulous|4d ago|context

Follow the AI newsletters. They bundle the news along with their Op-Ed and summarize it better.
95. anonymousDan|4d ago|context

Can you suggest some good ones?
96. yorwba|4d ago|context

https://jack-clark.net/
97. ayewo|3d ago|context

Thanks for this!
Link to direct newsletter subscription: https://importai.substack.com/
98. namnnumbr|4d ago|context

I really like latent.space and simonwillison.com.
Also (shameless self-promo) I publish a 2x weekly blog just to force myself to keep up: https://aimlbling-about.ninerealmlabs.com/treadmill/
99. stef25|4d ago|context

Tips on what newsletters are worth signing up for ?
100. yorwba|4d ago|context

The technical report discussing the why and how is here: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main...
101. trueno|4d ago|context

holy shit im right there with you
102. satvikpendem|4d ago|context

Don't keep up. Much like with news, you'll know when you need to know, because someone else will tell you first.
103. vessenes|4d ago|context

This is only good advice if you don’t have the need to understand what’s happening on the edge of the frontier. If you do, then you’ll lose on compounding the knowledge from staying engaged with the major developments.
104. satvikpendem|4d ago|context

Not all developments are equal. Many are experimental branches of testing things out that usually get merged back into the core, so to speak. For example, I knew someone who was full into building their own harness and implementing the Ralph loop and various other things, spending a lot of time on it and now, guess what? All of that is in Claude Code or another harness and I didn't have to spend any amount of time on it because ultimately they're implementation details.
It's like ricing your Linux distro, sure it's fun to spend that time but don't make the mistake of thinking it's productive, it's just another form of procrastination (or perhaps a hobby to put it more charitably).
105. vessenes|2d ago|context

I agree that a full linux distro compile as a matter of practice is a waste of time. But, doing it a few times is good if you want to understand your tools.
I don’t believe that top tier engineers just skip learning things because they might turn out to be dead-ends or incorporated into tools by someone else; in my experience they tend to be extremely interested in things that seem like minutiae to others when working on the bleeding edge, often implementing their own systems just to more fully understand the problem space.
If it’s a day job for someone and they are not ambitious, fine. But we are at hacker news. I would bet 99%+ of top tier software talent could tell you practical experience with ralph loops this year, or a homegrown variety, simply because they are an attempt to solve a very real engineering problem (early exit, shitty code/incorrect responses, poor context window length and capacity), and top tier software people expect more control of their engineering environment, and success using their tools than they’d get by just saying ‘meh, whatever, I don’t get this and I’ll just wait it out.’
106. roughly|4d ago|context

This one’s been particularly hard to sit out because the executive and managerial class are absolutely mainlining this stuff and pushing it hard on the rest of the organization, and so whether or not I want to keep up, I need to, because my job is to actually make stuff work and this stuff is a borderline existential risk to the quality of the systems I’m responsible for and rely on.
107. hnfong|3d ago|context

Thus, in the situation you described, "someone else will tell you first" is your boss.
108. vrganj|4d ago|context

It honestly has all kinda felt like more of the same ever since maybe GPT4?
New model comes out, has some nice benchmarks, but the subjective experience of actually using it stays the same. Nothing's really blown my mind since.
Feels like the field has stagnated to a point where only the enthusiasts care.
109. ifwinterco|4d ago|context

For coding Opus 4.5 in q3 2025 was still the best model I've used.
Since then it's just been a cycle of the old model being progressively lobotomised and a "new" one coming out that if you're lucky might be as good as the OG Opus 4.5 for a couple of weeks.
Subjective but as far as I can tell no progress in almost a year, which is a lifetime in 2022-25 LLM timelines
110. dannyw|4d ago|context

Another annoyance (for more API use) is summarized/hidden reasoning traces. It makes prompt debugging and optimization much harder, since you literally don't have much visibility into the real thinking process.
111. _air|3d ago|context

Opus 4.5 was released on Nov 24 last year. It’s only been 5 months!
112. ifwinterco|3d ago|context

Wow you're right, okay not so bad then.
That brief two week period when Opus could eat entire tickets was simultaneously fantastic and a bit alarming
113. hnfong|3d ago|context

I don't trust the benchmarks either, so I maintained a set of benchmarks myself. I'm mostly interested in local models, and for the past 2 years they have steadily gotten better.
Can't argue with subjective experience, but if there were some tasks that you thought LLMs can't do two years ago, maybe try again today. You might be surprised.
114. dnnddidiej|4d ago|context

https://commoncog.com/how-to-make-sense-of-ai/
115. notatoad|3d ago|context

I’m very satisfied with being three months behind everything in AI. That’s a level that’s useful, the overhyped nonsense gets found out before I need to care, and it’s easy enough to keep up with.
116. jdeng|4d ago|context

Excited that the long awaited v4 is finally out. But feel sad that it's not multimodal native.
117. Alifatisk|4d ago|context

Was that expected?
118. fblp|4d ago|context

There's something heartwarming about the developer docs being released before the flashy press release.
119. onchainintel|4d ago|context

Insert obligatory "this is the way" Mando scene. Indeed!
120. necovek|4d ago|context

Where's the training data and training scripts since you are calling this open source?
Edit: it seems "open source" was edited out of the parent comment.