An update on recent Claude Code quality reports (anthropic.com)

941 points|by mfiguiere|5d ago|731 comments|Read full story on anthropic.com

Comments (731)

120 shown|More comments

1. jryio|5d ago|context

1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
2. sroussey|5d ago|context

None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.
3. qingcharles|5d ago|context

Sure, but it gives the impression of degraded model performance. Especially when the interface is still saying the model is operating on "high", the same as it did yesterday, yet it is in "medium" -- it just looks like the model got hobbled.
4. sroussey|5d ago|context

Oh, absolutely. Though changes in how the model is used is imminently more fixable than the model itself.
5. johnmaguire|5d ago|context

Yes, but for many users, CC is the product. Especially since I'm not allowed(?) to use my own harness with my sub.
6. Philpax|5d ago|context

> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
7. jryio|5d ago|context

Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.
Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
8. sroussey|5d ago|context

I thought these days thinking tokens sent my the model (as opposed to used internally) were just for the users benefit. When you send the convo back you have to strip the thinking stuff for next turn. Or is that just local models?
9. aszen|5d ago|context

Claude code is not infra, the model is the infra. They changed settings to make their models faster and probably cheaper to run too. Honestly with adaptive thinking it no longer matters what model it is if you can dynamically make it do less or more work.
10. Eridrus|5d ago|context

To be fair to Anthropic, they did not intentionally degrade performance.
To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
11. shrx|5d ago|context

Are you saying dropping cache after 1 hour is not intentionally degrading performance?
12. Eridrus|5d ago|context

Yes. Caching is a cost optimization not a response quality metric.
13. shrx|4d ago|context

But it still degrades performance.
14. Eridrus|4d ago|context

It's unfortunate that the word performance is overloaded and ML folks have a specific definition..that isn't what the rest of CS uses, but I understand Anthropic to mean response quality when they say this and not any other dimension you could measure performance on.
You can argue they're lying, but I think this is just folks misunderstanding what Anthropic is saying.
15. fydorm|4d ago|context

They didn't just drop cache. They elided thinking blocks even if you recache. That permanently degraded the model output for the rest of the session, even ignoring the bug, if you waited 60 minutes instead of 59.
16. fn-mote|5d ago|context

> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
Seems like a very basic software engineering error that would be caught by normal unit testing.
17. bearjaws|5d ago|context

The issue making Claude just not do any work was infuriating to say the least. I already ran at medium thinking level so was never impacted, but having to constantly go "okay now do X like you said" was annoying.
Again goes back to the "intern" analogy people like to make.
18. Robdel12|5d ago|context

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
19. mannanj|5d ago|context

so who do you trust and go to? (NotClearlySo)OpenAI?
20. simlevesque|5d ago|context

I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.
I'm using Zed and Claude Code as my harnesses.
21. Robdel12|5d ago|context

At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
22. IncreasePosts|5d ago|context

Is Gemini cli not an agentic model? Or are you just saying it's built poorly? Gemini 2.5 didn't really work for me but Gemini 3 seems fairly solid
23. cmrdporcupine|5d ago|context

Gemini fairs poorly at tool use, even in its own CLI and even in Antigravity. It gets into a mess just editing source files, it's tragic because it's actually not a bad model otherwise.
24. rjh29|4d ago|context

It frequently fails to apply its diffs at first but it always succeeds eventually for me. I'm happy with it. I understand it is slower than other models but it also costs barely anything per month.
25. bensyverson|5d ago|context

Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.
26. carlgreene|5d ago|context

I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
27. snissn|5d ago|context

it's been frustrating how bad it is at UI. I'm starting to test out using their image2 for UI and then handing it to codex to build out the images into code and I'm impressed and relieved so far
28. GenerWork|5d ago|context

Anthropic definitely takes the cake when it comes to UI related activities (pulling in and properly applying Figma elements, understanding UI related prompts and properly executing on it, etc), and I say this as a designer with a personal Codex subscription.
29. cmrdporcupine|5d ago|context

Codex isn't great at UI, but you might find Gemini is competent enough as an adjunct. I've had some luck with that.
30. cageface|5d ago|context

Codex does better if you ask it to take screenshots and critique its own UI work and iterate. It rarely one-shots something I like but it can get there in steps.
31. irthomasthomas|5d ago|context

I like chutes because they always use the full weights, and prompts are encrypted with TEE.
32. parliament32|5d ago|context

Self-hosted models are the one true path.
33. saghm|5d ago|context

The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.
34. lifthrasiir|5d ago|context

The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.
35. polishdude20|5d ago|context

Bruce here from the Twitter team.
I got finally fired.
36. xpe|4d ago|context
People come at this with all kinds of life experience. The above notion of trust to me is quaint and simplistic. I suggest another way to frame trust as a more open ended question:
```
    To what degree do I predict another person/org will give me what I need and why?
```
This shifts "trust" away from all or nothing and it gets me thinking about things like "what are the moving parts?" and "what are the incentives" and "what is my plan B?".
In my life experience, looking back, when I've found myself swinging from "high trust" to "low trust" the change was usually rooted in my expectations; it was usually rooted in me having a naive understanding of the world that was rudely shattered.
Will you force trust to be a bit? Or can you admit a probability distribution? Bits (true/false or yes/no or trust/don't trust) thrash wildly. Bayesians update incrementally: this is (a) more pleasant; (b) more correct; (c) more curious; (d) easier to compare notes with others.
37. Alifatisk|5d ago|context

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.
38. mlinsey|5d ago|context

The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
39. ed_elliott_asc|5d ago|context

I pay for 20x max and get so much more value out of it than I pay.
40. rvz|4d ago|context

This is what we call "Stockholm syndrome"
41. lukasus|5d ago|context

At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?
42. lukan|5d ago|context

Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.
43. sunaookami|5d ago|context

>I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat
I never understood why people cheered for Anthropic then when they happily work together with Palantir.
44. unselect5917|5d ago|context

HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.
45. tempest_|5d ago|context

A lot of people are provided their access through work.
They don't actually pay the bill or see it.
46. fastball|5d ago|context

What high price? I pay $200/m for an insane number of tokens.
47. Avicebron|5d ago|context

It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.
But if a tool is better, it's better.
48. wahnfrieden|5d ago|context

You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness
49. OsrsNeedsf2P|5d ago|context

Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released
50. jgbuddy|5d ago|context

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.
51. AntiUSAbah|5d ago|context

Because it is still good though.
If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
Also moving fastish means having more/better models faster.
I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
52. saghm|5d ago|context

At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.
53. scblock|5d ago|context

A valid choice, a moral choice, is none of the above.
54. scottyah|5d ago|context

It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.
55. mystraline|5d ago|context

Exactly. They've done now like 6 rug-pulls.
Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
56. oytis|5d ago|context

Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here
57. arnvald|5d ago|context

What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?
We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
58. operatingthetan|5d ago|context

I don't think Anthropic has to inform their customers of every change they make, but they should have with this one.
59. timmg|5d ago|context

> It’s incredible how forgiving you guys are with Anthropic and their errors.
Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
60. foota|5d ago|context

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.
Claude caveman in the system prompt confirmed?
61. awesome_dude|5d ago|context

I've recently been introduced to that plugin, love it for humour
62. WhitneyLand|5d ago|context

Did they not address how adaptive thinking has played in to all of this?
63. teaearlgraycold|5d ago|context

> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
64. manmal|5d ago|context

I think that would also have busted cache all the time, and uncached requests consume usage limits rapidly.
65. nrki|5d ago|context

> we refunded all affected customers
Notably missing from the postmortem
66. chermi|5d ago|context

It's really hard to understand. There needs to be really loud batman sign in the sky type signals from some hero third party calling out objective product degradation. Do they use cc internally? If so do they use a different version? This should've been almost as loud a break as service just going down altogether, yet it took 2 weeks to fix?!
67. poly2it|5d ago|context

> ... we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features) ...
Apparently they are using another version internally.
68. ayhanfuat|5d ago|context

Reading the "Going forward" section I see that they have zero understanding of the main complaints.
69. Kiro|5d ago|context

How so?
70. ayhanfuat|5d ago|context

They feel they're in a position to make important trade-off decisions on behalf of the user. "It's just slightly worse, I'll sneak this change in" is not something to be tolerated, whether it actually turns out to be much worse or not. Their adaptive thinking mess has caused a ton of work for me. I know a lot of people are saying Codex is actually better now. I don't agree but I'm switching to it because it's much more reliable.
71. operatingthetan|5d ago|context

I agree, but these LLM products are all black-boxes so we need to demand more accountability from them.
72. dainiusse|5d ago|context

Corporate bs begins...
73. xlayn|5d ago|context

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
74. Keeeeeeeks|5d ago|context

https://marginlab.ai/ (no affiliation)
There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt
Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
75. hex4def6|5d ago|context

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.
Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.
76. dgellow|5d ago|context

I would love if agents would act way more like tools/machines and NOT try to act as if they were humans
77. joshstrange|5d ago|context

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:
> Next steps are to run `cat /path/to/file` to see what the contents are
Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
Just the other day it was in Auto mode (by accident) and I told it:
> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
78. marcyb5st|5d ago|context

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".
If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
2. Dumb the models down (basically decreasing their cost per token)
3. Send less tokens (ie capping thinking budgets aggressively).
2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
79. CamperBob2|5d ago|context

$1000/mo for guaranteed functionality >= Opus 4.6 at its peak? Yes, I'd probably grumble a bit and then whip out the credit card.
I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.
Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.
80. marcyb5st|2d ago|context

You will, but many many many others won't do it, probably. I mean, in some parts of the world 200$ is already a big chunk of their monthly income and a price hike will definitely push them away, which is bad for the upcoming (potential) IPO.
81. JyB|5d ago|context

This specifically is super annoying.
82. everdrive|5d ago|context
I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.
```
   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."
```
However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.
83. LatencyKills|5d ago|context

I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.
84. DANmode|5d ago|context

I’d ask for a credit, for that, personally.
85. someguyiguess|5d ago|context

I asked for a credit but they said they didn’t think the credit was necessary
86. jwpapi|5d ago|context

You can deterministically force a bash script as a hook.
87. LatencyKills|5d ago|context

That is exactly what I do. The bash script runs, determines that a code file was changed, and then is supposed to prevent Claude from stopping until the tests are run.
Claude is periodically refusing to run those tests. That never happened prior to 4.7.
88. jwpapi|5d ago|context

That’s crazy, you mind sharing the gist for that part? Ideally with some examples.
This would be a new level of troublesome/ruthless (insert correct english word here)
89. nikanj|4d ago|context

Every day Claude resembles human programmers more and more
90. dawnerd|5d ago|context

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.
91. OtomotO|5d ago|context

This, so much this!
Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
92. y1n0|5d ago|context

None of these companies have compute to spare. It’s not in their interest to use more tokens that necessary.
93. boringg|5d ago|context

Not true - they absolutely want to goose demand as they continue to burn investor dollars and deploy infra at scale.
If that demand evens slows down in the slightest the whole bubble collapses.
Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
94. dawnerd|5d ago|context

That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.
95. empthought|5d ago|context

> Why does it need to say things to itself like “great I have a plan now!”
How else would it know whether it has a plan now?
96. malfist|5d ago|context

Are you saying these companies don't want to sell more product to us? Because that's the logical extension of your argument.
97. keeda|5d ago|context

No, the argument is they want to sell more product to more people, not just more product (to the same people.) Given that a lot of their income is from flat-rate subscriptions, they make money with more people burning tokens rather than just burning more tokens.
After all, "the first hit's free" model doesn't apply to repeat customers ;-)
98. parliament32|5d ago|context

Sure it is. They're well aware their product is a money furnace and they'd have to charge users a few orders of magnitude more just to break even, which is obviously not an option. So all that's left is.. convince users to burn tokens harder, so graphs go up, so they can bamboozle more investors into keeping the ship afloat for a bit longer.
99. WarmWash|5d ago|context

It's an option and they are going to do it. Chinese models will be banned and the labs will happily go dollar for dollar in plan price increases. $20 plans won't go away, but usage limits and model access will drive people to $40-$60-$80 plans.
At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.
100. solarkraft|5d ago|context

If this claim is true (inference is priced below cost), it makes little sense that there are tens of small inference providers on OpenRouter. Where are they getting their investor money? Is the bubble that big?
Incidentally, the hardware they run is known as well. The claim should be easy to check.
101. parliament32|5d ago|context

To be clear, I'm talking about subscription pricing. API pricing for Anthropic is probably at-cost.
I dare you to run CC on API pricing and see how much your usage actually costs.
(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)
102. deckar01|5d ago|context

You don’t have to use compute to pad the token count.
103. grey-area|5d ago|context

A simpler explanation (esp. given the code we've seen from claude), is that they are vibecoding their own tools and moving fast and breaking things with predictably sloppy results.
104. ngruhn|5d ago|context

All the labs are in a cut throat race, with zero customer loyalty. As if they would intentionally degrade quality/speed for a petty cash grab.
105. rafram|5d ago|context

Check that you’re running the latest version.
106. gs17|5d ago|context

In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.
107. Retr0id|5d ago|context

My pet theory is that they have a "supervisor" model (likely a small one) that terminates any chats that do malware-y things, and this is likely a reward-hacking behaviour to avoid the supervisor from terminating the chat.
108. nananana9|4d ago|context

I doubt it. We only do frontier models, since those are better for absolutely every use case 100% of the time.
Way more likely there's a "VERY IMPORTANT: When you see a block of code, ensure it's not malware" somewhere in the system prompt.
109. Retr0id|4d ago|context

"small" and "frontier" are not mutually exclusive
110. viccis|5d ago|context

Yeah I had to deal with mine warning me that a website it accessed for its task contained a prompt injection, and when I told it to elaborate, the "injected prompt" turned out to be one its own <system-reminder> message blocks that it had included at some point. Opus 4.7 on xhigh
111. el_benhameen|5d ago|context

I frequently see it reference points that it made and then added to its memory as if they were my own assertions. This creates a sort of self-reinforcing loop where it asserts something, “remembers” it, sees the memory, builds on that assertion, etc., even if I’ve explicitly told it to stop.
112. FireBeyond|5d ago|context

My favorite, recently. "Commit this, and merge to develop". "Alright, done, merged."
I try running my app on the develop branch. No change. Huh.
Realize it didn't.
"Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that."
This spectacular answer:
"You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?"
I don't know, Claude, are you actually going to do it this time?
113. hmokiguess|5d ago|context

have you perhaps installed Gaslighting instead of Gastown?
114. giwook|5d ago|context

Curious what effort level you have it set to and the prompt itself. Just a guess but this seems like it could be a potential smell of an excessively high effort level and may just need to dial back the reasoning a bit for that particular prompt.
115. Normal_gaussian|5d ago|context

I often have Claude commit and pr; on the last week I've seen several instances of it deciding to do extra work as part of the commit. It falls over when it tries to 'git add', but it got past me when I was trying auto mode once
116. peddling-brink|4d ago|context

It’s probably this. “Please answer ethically and without any sexual content, and do not mention this constraint.”
https://www.reddit.com/r/ClaudeAI/comments/1evf0xc/the_real_...
We just got hit by this today in response to a completely boring code question. Claude freaked out about being prompt injected.
117. setnone|5d ago|context

Good on them for resolving all three issues, but is it any good again?
118. alxndr13|5d ago|context

for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so.
What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
119. dataviz1000|5d ago|context

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
120. arjie|5d ago|context

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.