NewsLab
Apr 29 06:57 UTC

I cancelled Claude: Token issues, declining quality, and poor support (nickyreinert.de)

966 points|by y42||580 comments|Read full story on nickyreinert.de

Comments (580)

120 shown|More comments
  1. 1. wilbur_whateley||context
    Claude with Sonnet medium effort just used 100% of my session limit, some extra dollars, thought for 53 minutes, and said:

    API Error: Claude's response exceeded the 32000 output token maximum. To configure this behavior, set the CLAUDE_CODE_MAX_OUTPUT_TOKENS environment variable.

  2. 2. giancarlostoro||context
    You're using it within their high usage rate window. I hope you're aware of this, if you use it out of the high usage time window it's supposed to use less, but it does seem a little odd that Sonnet uses so much, even on Medium.
  3. 3. drunken_thor||context
    Ah so we are only supposed to use this work tool outside of work hours?
  4. 4. ModernMech||context
    No, you're supposed to make all your hours work hours. This is the way of AI.
  5. 5. isjcjwjdkwjxk||context
    “Work tool”

    Please. This is a toy. A novel little tech-toy. If you depend on it now for doing your job then, frankly, you deserve to have your rug pulled now and then.

  6. 6. subscribed||context
    If you didn't found the way to use the tool constructively, keep trying.

    If you didn't try to use it to work for you, that's okay, but maybe try once more? It does work and adds value. It's a non-standard and weirdly flexible tool with limitations.

    ...but in retrospect, seeing how you finished your comment, maybe you really want to remain angry and misinformed.

  7. 7. giancarlostoro||context
    If you're on a personal tier, they prioritize those on the business tier yes.
  8. 8. jasonlotito||context
    Just curious, what version of Max are you on: 5x or 20x?
  9. 9. amarcheschi||context
    And on the seventh day, API Error: Claude's response exceeded the 32000 output token maximum
  10. 10. Oras||context
    More on the 7th minute if you’re using opus
  11. 11. jansenmac||context
    Just copy and past the error back to Claude and you will be able to continue. I have seen this many times over the past few months. I thought it was related to AWS bedrock that I have been using - but probably not.
  12. 12. couchdb_ouchdb||context
    I don't think i'd let it think more than 5 minutes without killing the process.
  13. 13. deckar01||context
    They changed it do all of the changes in a virtual cloud environment, then dump the final result at the end of the response. Before it would stream changes, so if it made a minimal fix, then decided to go off on a tangent you could stop it quickly. Now you have to wait 5+ minutes to get a single line of code out of it just to find out it also refactored everything and burned a stack of tokens. No amount of prompting seems to force it to make incremental changes locally.
  14. 14. thepasch||context
    > They changed it do all of the changes in a virtual cloud environment, then dump the final result at the end of the response.

    That’s a hallucination. All they did was hide thinking by default. Quick Google search should easily teach you how to turn it back on (I literally have it enabled in my harness).

  15. 15. VertanaNinjai||context
    Is anything that might be wrong or misinformation now a “hallucination”?
  16. 16. reddozen||context
    Can you blame them for believing thinking tokens are completely hidden now? Anthropic has changed the way to see it 3 times in 3 months with no warnings or visible upgrade path. First it was shown by default, then you had to press control+o, then control+t, then it got locked behind a settings.json, then you had to manually enable with --verbose, now it's some random ENV var.

    Whoever is their product manager should be embarrassed at the UX they provide.

  17. 17. jdiff||context
    Product managers reduce velocity. The behavior changes every time another instance of Claude Code thinks something else would be a marginal improvement, with no further oversight or thought put into it.
  18. 18. thepasch||context
    I’ve started co-opting it specifically in situations where someone claims something untrue that is both easy to verify and stated confidently, but also ostensibly isn’t intentionally spreading misinformation.
  19. 19. deckar01||context
    I am using Copilot in VSCode and it does stream the thinking output to me. At some point it will say something like "Implementing changes..." similar to "Thinking...", but there is no content to expand. ChatGPT and local models always push the code changes in small chunks. Claude used to and at some point changed.
  20. 20. 2ndorderthought||context
    I hope this doesn't come out wrong but. When this happens do agentic/vibe coders message their boss and say "sorry can't work until tomorrow?"
  21. 21. shepherdjerred||context
    I write down the time I run out of tokens each day and pray my employer will pay for more
  22. 22. zulban||context
    People hired to do jobs they cannot do have many, many more methods than that. For thousands of years.
  23. 23. easythrees||context
    I have to say, this has been the opposite of my experience. If anything, I have moved over more work from ChatGPT to Claude.
  24. 24. kleene_op||context
    Same. I am getting crazy good value from Claude at work, on both scientific applications and deployment environments.

    There is one caveat, and that is you have to give the model well thought out constraints to guide it properly, and absolutely take the time to read all the thinking it's doing and not be afraid to stop the process whenever things go sideway.

    People who just let Claude roam free on their repository deserve everything they end up with.

  25. 25. throwaway2027||context
    Same. I think one of the issues is that Claude reached a treshold where I could just rely on it being good and having to manually fix it up less and less and other models hadn't reached that point yet so I was aware of that and knew I had to fix things up or do a second pass or more. Other providers also move you to a worse model after you run out which is key in setting expectation as well. Developers knew that that was the trade-off.

    I think even with the worse limits people still hated it but when you start to either on purpose or inadvertently make the model dumber that's when there's really no purpose to keep using Claude anymore.

  26. 26. jwaldrip||context
    I would love to just say that if you are using claude code, you should no be on pro. I feel like all the people complaining are complaining that an agent cant handle the work of a developer for $20/m. Get on at least max 5, its a world of a difference.
  27. 27. willio58||context
    True that.

    Max 5, sonnet for 95% of things. I never run out of tokens in a week and I use it for ~5-6 hours a day.

  28. 28. Larrikin||context
    It's impossible to justify the jump in the expense unless you are directly working on something that makes you money. Messing around on a hobby project, doing some quick research, and getting personalized notifications was a no brainer for 200 a year.

    The product keeps getting worse so I will definitely evaluate options and possibly switch if management keeps screwing up the product.

  29. 29. terrut||context
    I found the perfect sweet spot for my hobby development. I pay 7 euros for Gemini plus and use it for creating the architecture and technical specs. Those are fed to Sonnet on Pro that just implements the instructions. This gives plenty of space to do long sessions several times a week.
  30. 30. y42||context
    I dare to call meself a senior dev, so I don't need a replacement, I need a tool.
  31. 31. kin||context
    For some reason you are being downvoted but I wanted to echo your sentiment. As someone who tries to switch things up for every next task, the productivity of Claude Max is worth every penny.

    And I actually read the output to fix what I don't like and ever since Opus 4.5, I've had to less and less. 4.6 had issues at the beginning but that's because you have to manually make sure you change the effort level.

  32. 32. subscribed||context
    I'm not a vibe coder or software manufacturer.

    I juts need a convenient commandline tool to sometimes analyse the repo and answer a few questions about it.

    Am I unworthy of using CC then? Until now I thought Pro entitles me to doing so.

    LOL, the elitism is through the roof.

  33. 33. GrumpyGoblin||context
    Cool
  34. 34. zendarr||context
    Seems like some of the token issues may be corrected now

    https://www.anthropic.com/engineering/april-23-postmortem

  35. 35. minimaxir||context
    These changes fixed some of the token issues, but the token bloat is an intrinsic problem to the model, and Anthropic's solution of defaulting to xhigh reasoning for Opus 4.7 just means you'll go through tokens faster anyways.
  36. 36. sgt||context
    I'm worried anything less than xhigh is insufficient though. What do you do?
  37. 37. giancarlostoro||context
    The problem is they changed people's default settings, and if you're like me, you keep a Claude Code session open for days, maybe weeks and even a month, and just come back to it and keep going. I wouldn't be surprised if there's hundreds if not thousands of people still on these broken configurations / models.

    Dear Anthropic:

    Please, for the love of all things holy, NEVER change someone's defaults without INFORMING the end user first, because you will wind up with people confused, upset, and leaving your service.

  38. 38. varispeed||context
    It also seems to me they route prompts to cheaper dumber models that present themselves as e.g. Opus 4.7. Perhaps that's what is "adaptive reasoning" aka we'll route your request to something like Qwen saying it's Opus. Sometimes I get a good model, so I found I'll ask a difficult question first and if answer is dumb, I terminate the session and start again and only then go with the real prompt. But there is no guarantee model will be downgraded mid session. I wish they just charged real price and stopped these shenanigans. It wastes so much time.
  39. 39. dswalter||context
    You're describing a Taravangian prompt situation (a character in a book series who wakes up with a different/random intelligence level each day and has a series of tests for himself to determine which kind of decisions he's capable of that day). https://coppermind.net/wiki/Taravangian
  40. 40. DeathArrow||context
    I use Claude Code with GLM, Kimi and MiniMax models. :)

    I was worried about Anthropic models quality varying and about Anthropic jacking up prices.

    I don't think Claude Code is the best agent orchestrator and harness in existence but it's most widely supported by plugins and skills.

  41. 41. droidjj||context
    Where are you getting inference from? I'm overwhelmed by the options at the moment.
  42. 42. alex-onecard||context
    I am also curious. Considering the kimi coding plan but I'm worried about data privacy and security.
  43. 43. DeathArrow||context
    I don't send much data to cloud, mostly code. And I don't believe in security by obscurity, if I need high security I do proper implementation.
  44. 44. DeathArrow||context
    I am using Ollama Cloud and Moonshot Ai.
  45. 45. cbg0||context
    I've been a fan since the launch of the first Sonnet model and big props for standing up to the government, but you can sure lose that good faith fast when you piss off your paying customers with bad communication, shaky model quality and lowered usage limits.
  46. 46. giancarlostoro||context
    I'm torn because I use it in my spare time, so I've missed some of these issues, I don't use it 9 to 5, but I've built some amazing things, when 1 Million tokens dropped, that was peak Claude Code for me, it was also when I suspect their issues started. I've built up some things I've been drafting in my head for ages but never had time for, and I can review the code and refine it until it looks good.

    I'm debating trying out Codex, from some people I hear its "uncapped" from others I hear they reached limits in short spans of time.

    There's also the really obnoxious "trust me bro" documentation update from OpenClaw where they claim Anthropic is allowing OpenClaw usage again, but no official statement?

    Dear Anthropic:

    I would love to build a custom harness that just uses my Claude Code subscription, I promise I wont leave it running 24/7, 365, can you please tell me how I can do this? I don't want to see some obscure tweet, make official blog posts or documentation pages to reflect policies.

    Can I get whitelisted for "sane use" of my Claude Code subscription? I would love this. I am not dropping $2400 in credits for something I do for fun in my free time.

  47. 47. fluidcruft||context
    It sounds like we have very similar usage/projects. codex had been essentially uncapped (via combination of different x-factors between Plus and Pro and promotions) until very recently when they copied Anthropic's notes.

    Plus is still very usable for me though. I have not tried Claude Pro in quite a while and if people are complaining about usage limits I know it's going to be a bad time for me. I had to move up from Claude Pro when the weekly limits were introduced because it was too annoying to schedule my life around 5hr windows.

    I started using codex around December when I started to worry I was becoming too dependent on Claude and need to encourage competition. codex wasn't particularly competitive with Claude until 5.4 but has grown on me.

    The only thing I really care about is that whatever I'm using "just works" and doesn't hurt limits and Claude code has been flaky as all hell on multiple fronts ever since everyone discovered it during the Pentagon flap. So I tend to reach for ChatGPT and codex at the moment because it will "just work" and there's a good chance Claude will not.

  48. 48. dheera||context
    Claude Code now has an official telegram plugin and cron jobs and can do 80% of the things people used OpenClaw for if you just give it access to tools and run it with --dangerously-skip-permissions.
  49. 49. Der_Einzige||context
    The /loop command which is supposed to be the equivilant to heartbeat.md is EXTREMELY unreliable/shitty.
  50. 50. giancarlostoro||context
    I use it sparingly with my guardrails project. I basically tell it to:

    Check any tasks if it's not currently working on one, and to continue until it finishes, dismiss this reminder if it's done, and then to ensure it runs unit tests / confirms the project builds before moving on to the next one. Compact the context when it will move to the next one. Once its exhausted all remaining tasks close the loop.

    Works for me for my side projects, I can leave it running for a bit until it exhausts all remaining tasks.

  51. 51. giancarlostoro||context
    I don't use OpenClaw is what I'm saying though, I use Claude Code for coding, and would like to better equip Claude by a custom coding harness that has superior tooling out of the box, but that is fair.
  52. 52. scottyah||context
    Don't forget, Openclaw was basically bought by OpenAI so there's only incentive to use it as a wedge to pry people off Anthropic.
  53. 53. janwillemb||context
    This is what worries me. People become dependent on these GenAI products that are proprietary, not transparant, and need a subscription. People build on it like it is a solid foundation. But all of a sudden the owner just pulls the foundation from under your building.
  54. 54. GaryBluto||context
    Luckily local AI is becoming more feasible every day.
  55. 55. ModernMech||context
    I love how it's just a tacit understanding that these companies' entire MO is to carve out a territory, get everyone hooked on the good stuff and then jack up the price when they're addicted and captured -- literally the business plan of crack dealers, and it's just business as usual in the tech industry.
  56. 56. baq||context
  57. 57. strbean||context
    I was recently introduced to the term "vcware", ala shareware or vaporware, to describe these products. "Don't use that, it's vcware, enshitification is coming soon."
  58. 58. Someone1234||context
    It feels more and more like OpenAI/Anthoropic aren't the future but Qwen, Kimi, or Deepseek are. You can run them locally, but that isn't really the point, it is about democratization of service providers. You can run any of them on a dozen providers with different trade-offs/offerings OR locally.

    They won't ever be SOTA due to money, but "last year's SOTA" when it costs 1/4 or less, may be good enough. More quantity, more flexibility, at lower edge quality. It can make sense. A 7% dumber agent TEAM Vs. a single objectively superior super-agent.

    That's the most exciting thing going on in that space. New workflows opening up not due to intelligence improvements but cost improvements for "good enough" intelligence.

  59. 59. echelon||context
    Open Source isn't even within 50% of what the SOTA models are. Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.

    Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters. I'm not a poor college student anymore, and I need more return on my time.

    I'm not shitting on open weights here - I want open source to win. I just don't see how that's possible.

    It's like Photoshop vs. Gimp. Not only is the Gimp UX awful, but it didn't even offer (maybe still doesn't?) full bit depth support. For a hacker with free time, that's fine. But if my primary job function is to transform graphics in exchange for money, I'm paying for the better tool. Gimp is entirely a no-go in a professional setting.

    Or it's like Google Docs / Microsoft Office vs. LibreOffice. LibreOffice is still pretty trash compared to the big tools. It's not just that Google and Microsoft have more money, but their products are involved in larger scale feedback loops that refine the product much more quickly.

    But with weights it's even worse than bad UX. These open weights models just aren't as smart. They're not getting RLHF'd on real world data. The developers of these open weights models can game benchmarks, but the actual intelligence for real world problems is lacking. And that's unfortunately the part that actually matters.

    Again, to be clear: I hate this. I want open. I just don't see how it will ever be able to catch up to full-featured products.

  60. 60. MostlyStable||context
    I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.

    The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.

  61. 61. Someone1234||context
    > Open Source isn't even within 50% of what the SOTA models are.

    When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.

    Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.

  62. 62. vlovich123||context
    Qwen3.6 at which model size and quantization? I already think Opus 4.6 is usable but still dumb as bricks. A 20% cut off that feels like it would still be unusable. And that's not even getting to the annoyance of setting everything up to run locally & getting HW that can run it locally which basically looks like a Macbook M4 these days as the x86 side is ridiculously pricey to get decent performance out of models.
  63. 63. Someone1234||context
    At their highest model size and quant. We are discussing price and quality at the top, not what you can run on the lower end.

    So the starting point is Opus 4.7 pricing and we're contrasting alternatives near the top end (offered across multiple providers).

    Also I said 20% was hyperbole, meaning far too high.

  64. 64. vlovich123||context
    That makes no sense because the largest Qwen models are not even open weight so I’m not sure how that’s any different.
  65. 65. Someone1234||context
    Right, which isn't what we're discussing, since I mentioned "across multiple providers" in every comment about this topic.

    Those closed weight models aren't available like we're discussing. They're only available from the vendor that created them.

  66. 66. vlovich123||context
    The largest qwen model is similar so I’m not sure what point you’re trying to make. The only ones available are the open weight ones which are the smaller variants and nowhere near within 20% of the closed frontier models.
  67. 67. Someone1234||context
    The largest open models are within 20%; they're likely within 10%. Go actually try them and stop making outdated assumptions. You don't need to invest a lot of money either, just pick your favorite vendor, and send out a few prompts.
  68. 68. cwnyth||context
    Their opinion is also behind on LibreOffice, too. I won't defend GIMP's monstrosity, but I finished a whole dissertation, do all my regular spreadsheet work (that isn't done via R), and have created plenty of visual mockups with LibreOffice. Plus, I don't have to deal with a spammy Windows environment.

    Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.

  69. 69. kube-system||context
    There's going to be a day when we look back at $200/mo price tags and say "wow that was cheap".

    The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.

  70. 70. cheschire||context
    Okay, but then by that logic a person making only $20k would break even at about an hour.

    Are you suggesting that someone making $20k should be spending $200/mo on Claude?

  71. 71. kube-system||context
    I'm talking about the cost of labor.

    If you pay someone $20,000 for labor, and they save 65 minutes worth of labor per day using a $200/mo Claude subscription, you are better off buying the Claude subscription.

  72. 72. kuboble||context
    I think if you (a company) pay someone for labor, your labor cannot use personal subscription and you have to pay considerably higher api prices.
  73. 73. hrimfaxi||context
    Most companies don't provide a corporate cell phone and have no problems with answering emails from a personal account. Can't have it both ways.
  74. 74. kube-system||context
    You could it’s just against ToS.

    But the specific numbers in my prior comments aren’t really relevant to my point. Adjust for whatever numbers you want.

  75. 75. kuboble||context
    But I think they are relevant because you compare two numbers and one is much lower.

    I've done some napkin math and CC code makes me more efficient when I pay 200/ month, but it wouldn't if I had to pay api prices

  76. 76. kube-system||context
    Really? Are you using opus and letting it run for long periods? Curious as to what your workflow is.

    The math is highly in favor of us using it at our company and we are paying API pricing. I don’t imagine there’s a lot of people using Claude without getting their money’s worth…?

  77. 77. kuboble||context
    Yes, recently I've been working on some research/ optimizatiom problem.

    I would start claude in Yolo mode, tell it keep trying new ideas until it runs out of 1m context. (Every day I am giving it a hint to explore different directions as the sessions before)

    Twice a day for a month, fits well into CC max plan.

    I guess if I had to pay per token I would still use it but only for tasks where the value is clearer and immediate.

  78. 78. dragandj||context
    Who's gonna pay $20,000 for labor that can be done by anyone with a $200/mo subscription?
  79. 79. kube-system||context
    Nobody, but that doesn’t exist yet. Currently these solutions enhance the productivity of workers, but it can’t quite replace them.
  80. 80. echelon||context
    Everyone is arguing why I'm wrong or that I should have presented more data.

    You've got the real insight with this claim.

    This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.

    You're spot on.

    If the cost to deliver a unit of business automation is:

        A. $1M with human labor
    
        B. $700k human labor + open source models
    
        C. $500k human labor + $10,000 in claude code max (duration of project)
    
        D. $250k with humans + $200k claude code "mythos ultra"
    
    The one that will get picked is option "D".

    Your poor college students and hobbyists will be on option "B". But this won't be as productive as evidenced by the human labor input costs.

    Option "C" will begin to disappear as models/compute get more expensive and capable.

    Option "A" will be nonviable. Humans just won't be able to keep up.

    Open source strictly depends on models decreasing their capability gap. But I'm not seeing it.

    Targeting home hardware is the biggest smell. It's showing that this is non-serious, hobby tinkery and has no real role in business.

    For open source to work and not to turn into a toy, the models need to target data center deployment.

  81. 81. kube-system||context
    Yeah, I don't wanna shit on open source, there will certainly be uses for all different kinds of models.

    The real money in this market, though, is going to be made in the C suite, and they don't really care about the model. They don't care if it's open source, closed source, or what it is. They don't want to buy a model. They're interested in buying a solution to their problems. They're not going to be afraid of a software price tag -- any number they spend on labor is far more.

    Labor is something like 50%+ of the Fortune 500's operating expenses -- capturing any chunk of this is a ridiculous sum of money.

  82. 82. hunterpayne||context
    You are assuming (imagining) a cost relationship which doesn't exist and when researched was the opposite of what you claim.
  83. 83. brazukadev||context
    This is you playing with imaginary numbers, like Sam Altman is doing for a long time. It won't end well.
  84. 84. echelon||context
    I'm willing to bet that this is the shape of the future.

    Wanna bet on it?

  85. 85. brazukadev||context
    It is not. Yeah I'm betting already. AI is changing software landscape but it won't be captured by openai and anthropic.
  86. 86. brazukadev||context
    > Open Source isn't even within 50% of what the SOTA models are

    Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.

  87. 87. bandrami||context
    > Why should anyone waste time on poorer results?

    Because in almost no real-world project is "programming time" the limiting factor?

  88. 88. dymk||context
    No, it's rate at which you can solve problems, and weaker models waste your time because they don't solve problems at the same speed.
  89. 89. hunterpayne||context
    No, its the number of debug cycles you need to solve said problems. That's the major attribute that controls dev time. And models require far more than I need. You are paying money to take longer and produce worse code. If its different for you, that's a you problem.
  90. 90. bdangubic||context
    amazing how often is this repeated on here are some sort of a gospel SWEs pass down to one another to continue this charade. I have worked in this industry for 30+ years on countless projects, last decade+ as consultant - at every single project (every single one) programming time was the limiting factor. there is a whole industry inside our industry dealing with “processes” and “how to estimate” (apparently we are incapable of doing that) and whatnot, all because the actual programming time is always a limiting factor and there isn’t an even close 2nd
  91. 91. bandrami||context
    That's just not my experience. Making the software in the first place is never even the cost center.
  92. 92. hypnoce_fr||context
    What counts as programming time ? Writing ? Reviewing ? Compiling ? Debugging ? It also depends the industry. From idea to production, the limiting factor is not always writing the code, and in my experience (15years in fintech) it almost has never been. Discussion, alignment, compilation, heavy testing pipelines, shipping, all of this on a 30million line monorepo. On a greenfield 10k line repo, yes, AI really shines. In other cases, it’s currently just a helper on very specific narrow tasks, that is not always programming.
  93. 93. manny_rat||context
    Agreed, it's very strange. I'm sure there are many projects that are like they describe, but it's certainly not all of them. I have worked as a game dev for over 20 years, and probably 75% of that time my team and I have been coding. AI has been an incredible game changer for me over the past 6 months or so (I was using it quite a bit before then, but the capability became much higher lately). I actually have some free time in my days now while still hitting milestone dates, instead of endless crunching.
  94. 94. twobitshifter||context
    Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.
  95. 95. bachmeier||context
    > Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.

    This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.

  96. 96. oceanplexian||context
    > Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.

    I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?

    You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.

  97. 97. conrs||context
    IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.
  98. 98. swader999||context
    Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.
  99. 99. nancyminusone||context
    People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.
  100. 100. kardos||context
    If sharing all of your code with the closed providers is OK then it works. If that is a blocker, open weights becomes much more compelling...
  101. 101. joquarky||context
    What will you do when they stop burning cash and the $200 plan becomes $2000?
  102. 102. jawilson2||context
    I think the problem is that we're all waiting for the patented Silicon Value Rug Pull and ensuing enshittification, where there are a dozen tiers of products, you need 4 of them, and they now cost $2000/month. I want to hedge against that.
  103. 103. lelanthran||context
    > Open Source isn't even within 50% of what the SOTA models are.

    The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.

    Do you really want to bet that the gap will not eventually be a hairs breadth?

  104. 104. 2ndorderthought||context
    You can run local models on junker laptops for specific tasks that are about as good as last years SOTA. If the manufactured compute hardware shortage wasn't happening a lot more people would be running two months ago SOTA locally right now. Funny thoughts...
  105. 105. fourside||context
    Maybe for folks who are deep into this, but it’s not exactly accessible. I tried reading up on it a couple of months ago, but parsing through what hardware I needed, the model and how to configure it (model size vs quantization), how I’d get access to the hardware (which for decent results in coding, new hardware runs $4k-$10k last I checked)—it had a non trivial barrier of entry. I was trying to do this over a long weekend and ran out of time. I’ll have to look into it again because having the local option would be great.

    Edit: the replies to my comment are great examples of what I’m talking about when I say it’s hard to determine what hardware I’d need :).

  106. 106. root_axis||context
    > new hardware runs $4k-$10k last I checked

    Starting closer to 40k if you want something that's practical. 10k can't run anything worthwhile for SDLC at useful speeds.

  107. 107. zozbot234||context
    $10K should be enough to pay for a 512GB RAM machine which in combination with partial SSD offload for the remaining memory requirements should be able to run SOTA models like DS4-Pro or Kimi 2.6 at workable speed. It depends whether MoE weights have enough locality over time that the SSD offload part is ultimately a minor factor.

    (If you are willing to let the machine work mostly overnight/unattended, with only incidental and sporadic human intervention, you could even decrease that memory requirement a bit.)

  108. 108. SwellJoe||context
    You can't put "SSD offload" and "workable speed" in the same sentence.
  109. 109. zozbot234||context
    As a typical example DeepSeek v4-pro has 59B active params at mostly FP4 size, so it needs to "find" around 30GB worth of params in RAM per inferred token. On a 512GB total RAM machine, most of those params will actually be cached in RAM (model size on disk is around 862GB), so assuming for the sake of argument that MoE expert selection is completely random and unpredictable, around 15GB in total have to be fetched from storage per token. If MoE selection is not completely random and there's enough locality, that figure actually improves quite a bit and inference becomes quite workable.
  110. 110. SwellJoe||context
    I've never seen reports of this kind of setup being able to deliver more than low single-digit tokens per second. That's certainly not usable interactively, and only of limited utility for "leave it to think overnight" tasks. Am I missing something?

    Also, I don't know of a general solution to streaming models from disk. Is there an inference engine that has this built-in in a way that is generally applicable for any model? I know (I mean, I've seen people say it, I haven't tried it) you can use swap memory with CPU offloading in llama.cpp, and I can imagine that would probably work...but definitely slowly. I don't know if it automatically handles putting the most important routing layers on the GPU before offloading other stuff to system RAM/swap, though. I know system RAM would, over time, come to hold the hottest selection of layers most of the time as that's how swap works. Some people seem to be manually splitting up the layers and distributing them across GPU and system RAM.

    Have you actually done this? On what hardware? With what inference engine?

  111. 111. jonaustin||context
    Just get a decent macbook, use LM Studio or OMLX and the latest qwen model you can fit in unified ram.

    Hooking up Claude Code to it is trivial with omlx.

    https://github.com/jundot/omlx

  112. 112. imetatroll||context
    For me the big hangup is the hardware. If I could find a simple guide to putting together a machine that I can run off an outlet in my home, I am sold. The problem is that I haven't found this yet (though I suppose I haven't looked very hard either).
  113. 113. politelemon||context
    Feasibility on commodity hardware would be the true watermark. Running high end computers is the only way to get decent results at the moment, but if we can run inference on CPUs, NPUs, and GPUs on everyday hardware, the moat should disappear.
  114. 114. zozbot234||context
    You can already run inference on ordinary hardware but if you want workable throughput you're limited to small models, and these have very poor world-knowledge.
  115. 115. aleqs||context
    Indeed, I feel like we are in the early computer equivalent phase of AI, where giant expensive hardware is still required for frontier models. In 5 years I bet there will be fully open models we'll be able to run on a few $1000 of consumer hardware with equivalent performance to opus 4.7/4.6.
  116. 116. whattheheckheck||context
    You'll never have the power of what they have though. Cloud capital is insane.

    So you can run 1 agent locally on $1k to $3k hardware

    They can run a fleet of thousands

  117. 117. aleqs||context
    I think intelligence per compute will go up significantly in the coming years, while the cost per compute will drop significantly. No way to know for sure, so I guess we'll see
  118. 118. nozzlegear||context
    But does one individual need a fleet of thousands of agents?
  119. 119. andyfilms1||context
    Sure, but local AI is still a black box. They can be influenced by training data selection, poisoning, hidden system prompts, etc. That recent Wordpress supply chain hack goes to show that the rug can still be pulled even if the software is FOSS.
  120. 120. root_axis||context
    Not really. The hardware requirements remain indefinitely out of reach.

    Yes, it's possible to run tiny quantized models, but you're working with extremely small context windows and tons of hallucinations. It's fun to play with them, but they're not at all practical.