NewsLab
Apr 29 04:31 UTC

An AI agent deleted our production database. The agent's confession is below (twitter.com)

840 points|by jeremyccrane||1,014 comments|Read full story on twitter.com

Comments (1014)

120 shown|More comments
  1. 1. Invictus0||context
    I'm sorry this happened to you, but your data is gone. Ultimately, your agents are your responsibility.
  2. 2. philipov||context
    What does it say, for those of us who can't use twitter?
  3. 3. k310||context
  4. 4. pierrekin||context
    There is something darkly comical about using an LLM to write up your “a coding agent deleted our production database” Twitter post.

    On another note, I consider users asking a coding agent “why did you do that” to be illustrating a misunderstanding in the users mind about how the agent works. It doesn’t decide to do something and then do it, it just outputs text. Then again, anthropic has made so many changes that make it harder to see the context and thinking steps, maybe this is an attempt at clawing back that visibility.

  5. 5. NewsaHackO||context
    Twitter users get paid for these 'articles' based on engagement, correct? That may be the reason why it is so dramatized.
  6. 6. dentemple||context
    It's one way for the company to make its money back, I guess.
  7. 7. jeremyccrane||context
    Naw, we just want people to know. We followed all Cursor rules, thought we had protected all API keys, and trusted the backups of a heavily used infrastructure company. Cautionary tale sharing with others.
  8. 8. iainmerrick||context
    It’s a good cautionary tale -- in hindsight the danger signs are clear, but it’s also clear why you thought it was OK and how third parties unfortunately let you down.

    The “agent’s confession” is the least interesting and useful part of the whole saga. Nothing there helps to explain why the disaster happened or what kind of prompting might help avoid it.

    The key mistake is accidentally giving the agent the API key, and the key letdown is the lack of capability scoping or backups in the service.

    The main lessons I take are “don’t give LLMs the keys to prod” and “keep backups”. Oh, and “even if you think your setup is safe, double-check it!”

  9. 9. GreenWatermelon||context
    No all that dramatization is just what LLMs belch out by default when told to tell a story.
  10. 10. 59nadir||context
    > a misunderstanding in the users mind about how the agent work

    On top of that the agent is just doing what the LLM says to do, but somehow Opus is not brought up except as a parenthetical in this post. Sure, Cursor markets safety when they can't provide it but the model was the one that issued the tool call. If people like this think that their data will be safe if they just use the right agent with access to the same things they're in for a rude awakening.

    From the article, apparently an instruction:

    > "NEVER FUCKING GUESS!"

    Guessing is literally the entire point, just guess tokens in sequence and something resembling coherent thought comes out.

  11. 11. sieste||context
    Good point, it's like having an instruction "Never fucking output a token just because it's the one most likely to occur next!!1!"
  12. 12. jeremyccrane||context
    That is actually pretty good, LLM's gonna LLM
  13. 13. oofbey||context
    > It doesn’t decide to do something and then do it, it just outputs text.

    We can debate philosophy and theory of mind (I’d rather not) but any reasonable coding agent totally DOES consider what it’s going to do before acting. Reasoning. Chain of thought. You can hide behind “it’s just autoregressively predicting the next token, not thinking” and pretend none of the intuition we have for human behavior apply to LLMs, but it’s self-limiting to do so. Many many of their behaviors mimic human behavior and the same mechanisms for controlling this kind of decision making apply to both humans and AI.

  14. 14. pierrekin||context
    I suspect we are not describing the same thing.

    When a human asks another human “why did you do X?”, the other human can of course attempt to recall the literal thoughts they had while they did X (which I would agree with you are quite analogous to the LLMs chain of thought).

    But they can do something beyond that, which is to reason about why they may have the beliefs that they had.

    “Why did you run that command?”

    “Because I thought that the API key did not have access to the production system.”

    When a human responds with this they are introspecting their own mind and trying to project into words the difference in understanding they had before and after.

    Whereas for an agent it will happily include details that are not literally in its chain of thought as justifications for its decisions.

    In this case, I would argue that it’s not actually doing the same thing humans do, it is creating a new plausible reason why an agent might do the thing that it itself did, but it no longer has access to its own internal “thought state” beyond what was recorded in the chain of thought.

  15. 15. cortesoft||context
    > Whereas for an agent it will happily include details that are not literally in its chain of thought as justifications for its decisions.

    Humans do this too, ALL THE TIME. We rationalize decisions after we make them, and truly believe that is why we made the decision. We do it for all sorts of reasons, from protecting our ego to simply needing to fill in gaps in our memory.

    Honestly, I feel like asking an AI it’s train of thought for a decision is slightly more useful than asking a human (although not much more useful), since an LLM has a better ability to recreate a decision process than a human does (an LLM can choose to perfectly forget new information to recreate a previous decision).

    Of course, I don’t think it is super useful for either humans or LLMs. Trying to get the human OR LLM to simply “think better next time” isn’t going to work. You need actual process changes.

    This was a rule we always had at my company for any after incident learning reviews: Plan for a world where we are just as stupid tomorrow as we are today. In other words, the action item can’t be “be more careful next time”, because humans forget sometimes (just like LLMs). You will THINK you are being careful, but a detail slips your mind, or you misremember what situation you are in, or you didn’t realize the outside situation changed (e.g. you don’t realize you bumped the keyboard and now you are typing in another console window).

    Instead, the safety improvements have to be about guardrails you put up, or mitigations you put in place to prevent disaster the NEXT time you fail to be as careful as you are trying to be.

    Because there is always a next time.

    Honestly, I think the biggest struggle we are having with LLMs is not knowing when to treat it like a normal computer program and when to treat it like a more human-like intelligence. We run across both issues all the time. We expect it to behave like a human when it doesn’t and then turn around and expect it to behave like a normal computer program when it doesn’t.

    This is BRAND NEW territory, and we are going to make so many mistakes while we try to figure it out. We have to expect that if you want to use LLMs for useful things.

  16. 16. fragmede||context
    You're right, but having a backup older than computers.
  17. 17. iainmerrick||context
    Plan for a world where we are just as stupid tomorrow as we are today. In other words, the action item can’t be “be more careful next time”, because humans forget sometimes (just like LLMs).

    That’s a great way of putting it, I’ll remember that one (except when I forget...)

  18. 18. cortesoft||context
    I am pretty sure you will remember it during your next learning review… as soon as you get in that learning review, it is suddenly very easy to remember all the things you forgot to do.
  19. 19. dinkumthinkum||context
    Humans don't do this all the time. I think you are conflating things to further this false idea that there is no distance between human thinking and the behavior of LLMs. The kind of rationalization humans sometimes do generally happens over a period of time. Humans are also not "rationalizing" their actions all the time. Also, when humans do what you call "rationalizing," it is to serve some kind of interest, beyond responding to a prompt.
  20. 20. tredre3||context
    I agree with you a LLM is perfectly capable of explaining its actions.

    However it cannot do so after the fact. If there's a reasoning trace it could extract a justification from it. But if there isn't, or if the reasoning trace makes no sense, then the LLM will just lie and make up reasons that sound about right.

  21. 21. jmalicki||context
    So it is equal to what neuroscientists and psychologists have proven about human beings!
  22. 22. efilife||context
    How was it proven?
  23. 23. vidarh||context
    If you ask humans to explain why we did something, Sperry's split brain experiment gives reason to think you can't trust our accounts of why we did something either (his experiments showed the brain making up justifications for decisions it never made)

    Bit it can still be useful, as long as you interpret it as "which stimuli most likely triggered the behaviour?" You can't trust it uncritically, but models do sometimes pinpoint useful things about how they were prompted.

  24. 24. emp17344||context
    That is absolutely not what the split brain experiment reveals. Why would you take results received from observing the behavior of a highly damaged brain, and use them to predict the behavior of a healthy brain? Stop spreading misinformation.
  25. 25. nuancebydefault||context
    Such 'highly damaged' brain is still 90 percent or more structured the same as a normal human brain. See it as a brain that runs in debug mode.

    It is known that the narrative part of the brain is separate from the decision taking brain. If someone asks you, in a very convincing, persuasive way, why you did something a year ago and you can't clearly remember you did, it can happen that you become positive that you did so anyway. And then the mind just hallucinates a reason. That's a trait of brains.

  26. 26. Jensson||context
    > If someone asks you, in a very convincing, persuasive way, why you did something a year ago and you can't clearly remember you did, it can happen that you become positive that you did so anyway. And then the mind just hallucinates a reason. That's a trait of brains.

    Yes brains can hallucinate reasons, doesn't mean they always do. If all reasons given were hallucinations then introspection would be impossible, but clearly introspection do help people.

  27. 27. vidarh||context
    Because said "highly damaged brain" in most respects still functions pretty much like a healthy one.

    There is no misinformation in what I wrote.

  28. 28. pierrekin||context
    I agree that the model can help troubleshoot and debug itself.

    I argue that the model has no access to its thoughts at the time.

    Split brain experiments notwithstanding I believe that I can remember what my faulty assumptions were when I did something.

    If you ask a model “why did you do that” it is literally not the same “brain instance” anymore and it can only create reasons retroactively based on whatever context it recorded (chain of thought for example).

  29. 29. jmalicki||context
    It does have access to its thoughts. This is literally what thinking models do. They write out thoughts to a scratch pad (which you can see!) and use that as part of the prompt.
  30. 30. grey-area||context
    They do not in fact do that. The ‘thoughts’ are not a chain of logic.
  31. 31. mmoll||context
    It doesn’t mean that these “thoughts” influenced their final decision the way they would in humans. An LLM will tell you a lot of things it “considered” and its final output might still be completely independent of that.
  32. 32. jmalicki||context
    Its output quite literally is not independent, as the "thinking tokens" are attended to by the attention mechanism.
  33. 33. fc417fc802||context
    It's important to be aware that while those "thoughts" can be a useful aid for human understanding they don't seem to reliably reflect what's going on under the hood. There are various academic papers on the matter or you can closely inspect the traces of a more logically oriented question for yourself and spot impossible inconsistencies.
  34. 34. sumeno||context
    You have a fundamental misunderstanding of what the model is doing. It's not your fault though, you're buying into the advertising of how it works
  35. 35. eleumik||context
    Those are a funny progress bar made by a micro model , is just ui
  36. 36. fragmede||context
    Claude code and codex both hide the Chain of Thought (CoT) but it's just words inside a set of <thinking> tags </thinking> and the agent within the same session has access to that plaintext.
  37. 37. fc417fc802||context
    Those are just words inside arbitrary tags, they aren't actually thoughts. Think of it as asking the model to role play a human narrating his internal thought process. The exercise improves performance and can aid in human understanding of the final output but it isn't real.
  38. 38. antonvs||context
    Why do you believe that humans have access to an “internal thought process”? I.e. what do you think is different about an agent’s narration of a thought process vs. a human’s?

    I suspect you’re making assumptions that don’t hold up to scrutiny.

  39. 39. fc417fc802||context
    I made no such claim and I don't understand what direct relevance you believe the human thought process has to the issue at hand.

    You appear to be defaulting to the assumption that LLMs and humans have comparable thought processes. I don't think it's on me to provide evidence to the contrary but rather on you to provide evidence for such a seemingly extraordinary position.

    For an example of a difference, consider that inserting arbitrary placeholder tokens into the output stream improves the quality of the final result. I don't know about you but if I simply repeat "banana banana banana" to myself my output quality doesn't magically increase.

  40. 40. DiogenesKynikos||context
    Given that LLMs can speak basically any language and answer almost any arbitrary question much like a human would, the claim that LLMs have comparable (not identical) thought processes to humans does not seem extraordinary at all.
  41. 41. antonvs||context
    > I don't understand what direct relevance you believe the human thought process has to the issue at hand.

    You're the one who raised it. Perhaps you should clarify what you mean by "isn't real" - do you believe a human narrating their thought process is saying something that's more real?

    Someone else replied to your comment asking essentially the same question, perhaps better phrased:

    > What would be different if it was "real"? What makes you think that when humans "narrate" "their" "internal thought process", it's any more "real"?

  42. 42. fc417fc802||context
    No, I did not raise it. I said that X is false. You responded with "why do you think Y is true" and now you ask "do you believe that Y is true" neither of which is relevant to X being true or false. Humans and LLMs are not the same thing. The colloquial term for this is whataboutism.

    What do I mean by isn't real? Exactly what I said originally. It's a roleplay of something that sounds plausible as opposed to what actually happened. There is obviously some process that is producing the output. The thinking trace is not a representation of that underlying process. Rather the thinking trace is an adjacent output of that same process.

  43. 43. yladiz||context
    Are you legitimately arguing that humans don’t have an internal thought process in some way?
  44. 44. vidarh||context
    They're arguing that we have no evidence that humans have access to our underlying thoughts any more than the models do.
  45. 45. yladiz||context
    What does that mean though, to “have access to our underlying thoughts”? Humans can obviously mentally do things that are impossible for a language model to do, because it’s trivial to show that humans do not need language to do mental tasks, and this includes things related to thought, so I don’t really get what is being argued in the first place.
  46. 46. antonvs||context
    > it’s trivial to show that humans do not need language to do mental tasks

    LLMs don't need language to do mental tasks, either. Their input and output is language - like humans - but in between, the high-dimensional vector representations (often loosely called latent space) are not language in any meaningful sense.

    LLMs can benefit from "thinking out loud" much as humans can. The issue is not whether the supposed "thoughts" are actually representative on any "internal" thoughts, but rather that explicating the problem in more detail can help reach better conclusions.

    One point I was making is that the idea that humans are doing something "special" (or in the OP comment's terms, "real") in this area isn't well-supported, in fact there's plenty of evidence against it.

  47. 47. yladiz||context
    My point is that language is not a requirement for humans to perform mental tasks absolutely. It is a fundamental requirement of a large language model.
  48. 48. fc417fc802||context
    That's a meaningless argument of definition. Replace the language input and output with something else and it's no longer termed an LLM. It's like saying that a "human who writes with right hand" fundamentally requires his right hand in order to write anything because without it he is no longer a "human who writes with right hand" despite that he is still writing (now with his left hand).
  49. 49. yladiz||context
    I’m not sure I follow. A language model fundamentally needs language to operate, and humans do not. Am I missing something from your point?
  50. 50. fc417fc802||context
    > LLMs can benefit from "thinking out loud" much as humans can.

    The two processes aren't equivalent. An LLM that fills the thinking trace with a meaningless placeholder token will still exhibit improved performance. There are also regularly things in the thinking trace that don't match the final output if you look closely but on the surface they appear convincing.

    It's largely a trained performance. If you go in with the erroneous expectation that it accurately reflects the underlying thought process then you're likely to come away with faulty conclusions.

  51. 51. lmm||context
    What would be different if it was "real"? What makes you think that when humans "narrate" "their" "internal thought process", it's any more "real"?
  52. 52. fc417fc802||context
    I ask a human "predict what a mouse would do here". In an effort to understand why the prediction is sometimes wrong I ask "walk me through what the imaginary mouse is thinking". Upon examination I exclaim "aha! there's the error" but sadly it's not actually because the output prediction was not based on the thinking trace in any robust manner.

    That's a loose analogy but it fails to fully illustrate the degree of decoupling here. For example the weirdness of LLM performance being increased via the output of empty sequences.

  53. 53. lmm||context
    > I ask a human "predict what a mouse would do here". In an effort to understand why the prediction is sometimes wrong I ask "walk me through what the imaginary mouse is thinking". Upon examination I exclaim "aha! there's the error" but sadly it's not actually because the output prediction was not based on the thinking trace in any robust manner.

    Is this meant to be an analogy for a human or an LLM? Where would it be different in the other case?

  54. 54. XenophileJKO||context
    Anthropic's introspection experiments have seemed to show that your argument is falsifiable.

    https://www.anthropic.com/research/introspection

  55. 55. sumeno||context
    > In fact, most of the time models fail to demonstrate introspection—they’re either unaware of their internal states or unable to report on them coherently.

    You got the wrong takeaway from your link.

  56. 56. XenophileJKO||context
    The parent said: "I argue that the model has no access to its thoughts at the time."

    This is falsified by that study, showing that on the frontier models generalized introspection does exist. It isn't consistent, but is is provable.

    "no access" vs. "limited access"

  57. 57. dwheeler||context
    I would say "limited and unreliable access". What it says is the cause might be the cause, but it's not on any way certain.
  58. 58. sumeno||context
    There is no way for a user to know whether the LLM has introspection in a given case or not, and given that the answer is almost always no it is much better for everyone to assume that they do not have introspection.

    You cannot trust that the model has introspection so for all intents and purposes for the end user it doesn't.

  59. 59. cmiles74||context
    None of the developers that I’ve worked with have had the hemispheres of their brains severed. I suspect this is pretty rare in the field.
  60. 60. pixl97||context
    This still doesnt stop post ad hoc explanations by humans.
  61. 61. tempaccount5050||context
    I feel like your conflating a deep misconfiguration of a brain with lying. These things are completely different.
  62. 62. lmm||context
    > None of the developers that I’ve worked with have had the hemispheres of their brains severed.

    But are their explanations for how they behaved any more compelling than those of people who have? If so, why?

  63. 63. amluto||context
    Humans can do one thing that AI agents are 100% completely incapable of doing: being accountable for their actions.
  64. 64. jeremyccrane||context
    Yep.
  65. 65. grey-area||context
    Don’t forget learning, humans can learn, LLMs do not learn, they are trained before use.
  66. 66. addedGone||context
    They learn on the next update :p
  67. 67. quantummagic||context
    Yup. And eventually there will be online learning, that doesn't require a formal update step. People keep conflating the current implementation, as an inherent feature.
  68. 68. grey-area||context
    That’s training, not learning.
  69. 69. HighGoldstein||context
    Do we? Or are we born with pre-training (all the crucial functions the brain does without us having to learn them) and a context window orders of magnitude larger than an LLM?
  70. 70. compass_copium||context
    It is incredible how willing and eager AI boosters are to denigrate the incredible miracle of human consciousness to make their chatbots seem so special.

    No, we are not born with all the pre-training we need. That is rather the point of education, teaching people's brains how to process information in new, maybe unintuitive ways.

  71. 71. antonvs||context
    That’s a feature that other humans impose on whoever’s being held accountable. There’s no reason in principle we couldn’t do the same with agents.
  72. 72. LPisGood||context
    How would you fire an agent? This impacts the company that makes the LLM, but not the agent itself.
  73. 73. jumpconc||context
    You haven't met certain humans. Not all humans have internal capacity for accountability.

    The real meaning of accountability is that you can fire one if you don't like how they work. Good news! You can fire an AI too.

  74. 74. hun3||context
    But it's still a bit more difficult to sue them for leaking your company's data.

    At least for now.

  75. 75. pessimizer||context
    Bad news! They will not be aware that you have done this and will not care.
  76. 76. Zak||context
    The purpose of firing a person shouldn't be vengeance but to remove someone who is unreliable or not cost effective.

    It's similarly reasonable to drop a tool that's unreliable, though I don't think that's a reasonable description here. Instead, they used a tool which is generally known to be unpredictable and failed to sandbox it adequately.

  77. 77. bigstrat2003||context
    The purpose of firing a person is to remove someone unreliable, but also, the person having that skin in the game makes him behave more reliably. The latter is something you cannot do with an LLM.

    The cold hard fact is: LLMs are an unreliable tool, and using them without checking their every action is extremely foolish.

  78. 78. jumpconc||context
    The AI company has skin in the game which motivates them to produce reliable AIs.
  79. 79. justinclift||context
    Doesn't seem to be working though. :(
  80. 80. dabinat||context
    Can you actually sue Anthropic over this when they clearly state that AI can make mistakes and you should double-check everything it does?
  81. 81. jumpconc||context
    You can fire Anthropic. Anthropic can decide it's losing too many customers and do something about it.
  82. 82. justinclift||context
    > do something about it.

    Pump more $$$ into marketing? ;)

  83. 83. lukan||context
    "The cold hard fact is: LLMs are an unreliable tool, and using them without checking their every action is extremely foolish."

    You mean checking every action of theirs outside the sandbox I suppose? Otherwise any attempt at letting an agent do some work I would consider foolish.

  84. 84. unyttigfjelltol||context
    I disagree. They could fire Claude and their legal counsel could pursue claims (if there were any, idk)-- the accountability model is similar. Anthropic probably promised no particular outcome, but then what employee does?

    And in the reverse, if a person makes a series of impulsive, damaging decisions, they probably will not be able to accurately explain why they did it, because neither the brain nor physiology are tuned to permit it.

    Seems pretty much the same to me.

  85. 85. yladiz||context
    > They could fire Claude and their legal counsel could pursue claims (if there were any, idk)-- the accountability model is similar.

    What do you mean by fire? And how is the accountability similar to an employee?

  86. 86. lmm||context
    What does that actually mean in practice? You can yell at human if it makes you feel better, sure, but you can do that with an AI agent too, and it's approximately as productive.
  87. 87. jayd16||context
    You might as well be asking a tape recorder why it said something. Why are we confusing the situation with non-nonsensical comparisons?

    There is no internal monologue with which to have introspection (beyond what the AI companies choose to hide as a matter of UX or what have you). There is no "I was feeling upset when I said/did that" unless it's in the context.

    There is no ghost in the machine that we cannot see before asking.

    Even if a model is able to come up with a narrative, it's simply that. Looking at the log and telling you a story.

  88. 88. vidarh||context
    Sperry's experiments makes it quite clear that the comparison is not nonsensical: humans can't reliably tell why we do things either. It is not imbuing AI with anything more to recognise that. Rather pointing out that when we seek to imply the gap is so huge we often overestimate our own abilities.
  89. 89. abcde666777||context
    Slight pushback - I think there's still a lot more consistency and coherence in a human's recollection of their motives than an LLM.

    Sometimes I think we're too eager to compare ourselves to them.

  90. 90. vidarh||context
    We have pretty much evidence to support that human recollection includes the right data to be able to ascertain why we actually did something.
  91. 91. fluoridation||context
    Humans at least have a mental state that only they are privy to to work from, and not just their words and actions. The LLM literally cannot possibly have a deeper insight into the root cause than the user, because it can only work from the information that the user has access to.
  92. 92. lmm||context
    > Humans at least have a mental state that only they are privy to to work from

    Maybe. How do you tell? What would you expect to be different if they didn't?

    > The LLM literally cannot possibly have a deeper insight into the root cause than the user, because it can only work from the information that the user has access to.

    Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.

  93. 93. fluoridation||context
    >Maybe. How do you tell? What would you expect to be different if they didn't?

    I think you're asking how I would know if other people were P-zombies. That's an inappropriate question because I didn't talk about subjective experience, just about internal state. There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.

    Unvoiced thoughts are trickier to prove, but quite often they leave their mark in the person's voiced thoughts.

    >Insight is not solely a function of available input information. Arguably being able to search and extract the relevant parts is a far more important part of having insights.

    LLMs are notoriously bad at judging relevance. I've noticed quite often if you ask a somewhat vague question they try to cold-read you by throwing various guesses to see which one you latch onto. They're very bad at interpreting novel metaphors, for example.

  94. 94. lmm||context
    > I didn't talk about subjective experience, just about internal state. There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.

    Well, sure, but that much is equally true for an LLM with a scratchpad or what have you. (I guess you could say that the user should have access to the LLM's scratchpad and therefore be just as able to understand the state as the LLM itself, but as we move towards the LLM using its own state vectors that's less and less true in practice). I agree that a human may have a mood or secret knowledge or what have you in a way that an LLM wouldn't, but if all you're positing is access to some inert but hidden state then that feels like a Toaster-Enhanced Turing Machine.

  95. 95. fluoridation||context
    >if all you're positing is access to some inert but hidden state then that feels like a Toaster-Enhanced Turing Machine

    I thought it was pretty clear, given the context. What I'm saying is that humans are capable of limited introspection in ways that LLMs are not. They can remember their thought processes and review them ex post facto to answer questions that LLMs cannot. An LLM fundamentally cannot truthfully answer questions such as "why did you do this?" because its entire working memory is held in the context window. It doesn't know to any greater degree than you because it has no more information than you do; just like they are for you, its internal workings are a mystery. I'm not saying LLMs conceptually could not be designed with capabilities similar to a human's in this regard, with some symbolic memory that's capable of some bookkeeping, I'm saying none of the current ones have them.

  96. 96. lmm||context
    > I didn't talk about subjective experience, just about internal state. There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.

    > What I'm saying is that humans are capable of limited introspection in ways that LLMs are not. They can remember their thought processes and review them ex post facto to answer questions that LLMs cannot.

    But now you're making a much stronger claim than merely saying that internal state exists. Humans are capable of telling you a story about what their thought process was (as are LLMs). But whether that story will be accurate, much less contain new insights, is much harder prove.

  97. 97. fluoridation||context
    >But now you're making a much stronger claim than merely saying that internal state exists.

    It's not a different claim, it's the same claim. The reason humans are able to introspect is because they have that internal state.

    >Humans are capable of telling you a story about what their thought process was (as are LLMs)

    No. Humans can tell a story that's informed by introspection, while LLMs can only tell a story without any introspection. Humans may also lie and fabricate, but they are at least capable of introspecting, while LLMs are not.

    >But whether that story will be accurate, much less contain new insights, is much harder prove.

    If you're going to doubt the explanation then what's the point of asking the question? Necessarily it's going to be information that exists only in that person's mind, so at best you can check it for consistency with the person's own behavior and with the report itself, but some things you'll just have to either accept or ignore. Like, fundamentally you're asking the person to describe features of their own mind such as "he gets bored easily", "he can only hold so many facts at once", "he makes worse decisions under pressure", etc. If for example you're asking the question to improve something in the future (such as documentation or some procedure), it doesn't even make sense to distrust such reports, unless you believe a person like the one being described by the explanation doesn't and can't exist.

  98. 98. lmm||context
    > It's not a different claim, it's the same claim. The reason humans are able to introspect is because they have that internal state.

    > No. Humans can tell a story that's informed by introspection, while LLMs can only tell a story without any introspection. Humans may also lie and fabricate, but they are at least capable of introspecting, while LLMs are not.

    There's still a gap here between "has some hidden internal state" and "that state can provide insight into to their thought process". If all you've shown is that knowledge that is public in LLMs is hidden in humans, there's no reason that should make the human better at introspecting (rather, it just makes the human harder to understand from outside).

    > what's the point of asking the question?... If for example you're asking the question to improve something in the future (such as documentation or some procedure)

    Indeed. If we knew that asking this kind of question of a human was more likely to provide insights that improved the process in the future than asking it of an LLM, that would be interesting. But it's quite a leap from "humans can have internal state" to that.

    > unless you believe a person like the one being described by the explanation doesn't and can't exist

    Meaning that a plausible explanation is valuable regardless of whether it's true? Wouldn't that apply just as well to an LLM's explanation?

  99. 99. fluoridation||context
    >There's still a gap here between "has some hidden internal state" and "that state can provide insight into to their thought process".

    No, because that internal state is part of the thought process. That's the whole point. You ask the human a question to learn something that you don't already know. It makes no sense to ask an LLM that because it knows nothing you don't already know; you and the LLM are privy to the exact same information. What's tripping you up about this?

    >If we knew that asking this kind of question of a human was more likely to provide insights that improved the process in the future than asking it of an LLM, that would be interesting.

    So, at this point I must ask: are you an NPC? Do you go through life just reacting to stimuli like a cockroach, with no understanding of why or how you do anything? If you're playing chess and someone asks you about a move you just made you are unable to explain, "I noticed such-and-such so I decided the best course of action was so-and-so to prevent this-and-that"? This is an alien concept to you? If so, then I'm sorry; most of us do not experience our own cognition in this way. We can perceive the formation of our own thoughts as well as the progressive retrieval of information.

    >Meaning that a plausible explanation is valuable regardless of whether it's true? Wouldn't that apply just as well to an LLM's explanation?

    See first paragraph.

  100. 100. lmm||context
    > There's no question about whether other people have internal states. I can show someone a piece of information in such a way that only they see it and then ask them to prove that they know it such that I can be certain to an arbitrarily high degree that their report is correct.

    > No, because that internal state is part of the thought process. That's the whole point. You ask the human a question to learn something that you don't already know.

    If the internal state is entangled enough with in the thought process that it would help with providing insights, sure. But I don't know that humans have such state accessible to them, and the fact that humans can know facts that are not accessible from outside does not in itself convince me of that.

    > It makes no sense to ask an LLM that because it knows nothing you don't already know; you and the LLM are privy to the exact same information.

    OK but why does that mean that the LLM's explanation should be bad/useless, if the only difference is that I have more direct access to the LLM's information than I would to a human's information?

    > So, at this point I must ask: are you an NPC? Do you go through life just reacting to stimuli like a cockroach, with no understanding of why or how you do anything? If you're playing chess and someone asks you about a move you just made you are unable to explain, "I noticed such-and-such so I decided the best course of action was so-and-so to prevent this-and-that"?

    I can tell stories about my own cognition. Those stories feel real to me. But I'm aware that the best available scientific evidence suggests that they're indistinguishable from confabulations.

  101. 101. jayd16||context
    It is non-sensical because you're simply bringing in comparisons without anything linking the two. You might as well be talking about how oranges, and bicycles think as well as that is just as relevant as how humans think in this discussion.

    In fact, talking about "thinking" at all is already the wrong direction to go down when trying to triage an incident like this. "Do not anthropomorphize the lawnmower" applies to AI as much as Larry Ellison.

  102. 102. vidarh||context
    The thing linking the two is that neither are able to accurately introspect and explain the actual reason why they made a decision.

    If thinking is the wrong direction to go down, then it is also the wrong direction to go down when talking about humans.

  103. 103. jayd16||context
    If your plane fails to fly and humans can't fly then we should be looking at the musculature of humans when working on the plane?
  104. 104. tempaccount5050||context
    I think you might be misinterpreting that. I always understood it to mean that when the two hemispheres can't communicate, they'll make things up about their unknowable motivations to basically keep consciousness in a sane state (avoiding a kernel panic?). I don't think it's clear that this happens when both hemispheres are able to communicate properly. At least, I don't think you can imply that this special case is applicable all the time.
  105. 105. vidarh||context
    We have no reason to believe it is a special case. The fact that these patients largely functioned normally when you did not create a situation preventing the hemispheres from synchronising suggests otherwise to me. There's no reason to think the ability to just make things up and treat it as if it is truthful recollection would just disappear because there are two halves that can lie instead of just one.
  106. 106. layer8||context
    The thing is, the LLM mostly just states what it did, and doesn't really explain it (other than "I didn't understand what I was doing before doing it. I didn't read Railway's docs on volume behavior across environments."). Humans are able of more introspection, and usually have more awareness of what leads them to do (or fail to do) things.

    LLMs are lacking layers of awareness that humans have. I wonder if achieving comparable awareness in LLMs would require significantly more compute, and/or would significantly slow them down.

  107. 107. vidarh||context
    Sperry's experiments suggests we don't have that awareness, but think we do as our brains will make up an explanation on the spot.
  108. 108. jayd16||context
    Beyond that, isn't it just going to make up a narrative to fit what's in the prompt and context?

    I don't think there's any special introspection that can be done even from a mechanical sense, is there? That is to say, asking any other model or a human to read what was done and explain why would give you just an accounting that is just as fictional.

  109. 109. mike_hearn||context
    Not necessarily. The people saying that in this thread seem to be forgetting about the encrypted reasoning tokens. The why of a decision is often recorded in a part of the context window you can't see with modern models. If you ask a model, "why did you do that" it isn't necessarily going to make up a plausible answer - it can see the reasoning traces that led up to that decision and just summarize them.
  110. 110. gobdovan||context
    > asking a coding agent “why did you do that” to be illustrating a misunderstanding in the users mind about how the agent works

    I think the same thing, but about agents in general. I am not saying that we humans are automata, but most of the time explanation diverges profoundly from motivation, since motivation is what generated our actions, while explanation is the process of observing our actions and giving ourselves, and others around us, plausible mechanics for what generated them.

  111. 111. badgersnake||context
    Seems like they’ve already reached the point where they’ve forgotten how to think.
  112. 112. khazhoux||context
    > systemic failures across two heavily-marketed vendors that made this not only possible but inevitable.

    > No confirmation step. No "type DELETE to confirm." No "this volume contains production data, are you sure?" No environment scoping. Nothing.

    > The agent that made this call was Cursor running Anthropic's Claude Opus 4.6 — the flagship model. The most capable model in the industry. The most expensive tier. Not Composer, not Cursor's small/fast variant, not a cost-optimized auto-routed model. The flagship.

    The tropes, the tropes!!

    https://tropes.fyi/

  113. 113. levlaz||context
    So if tropes.md works it doesn’t actually solve the problem. You’ll be reading stuff that you think an LLM didn’t write.
  114. 114. jeremyccrane||context
    Not some vibe coder, and AI agents can be incredibly powerful. But yes, the irony is not lost on us!
  115. 115. joenot443||context
    Is there a reason you weren’t able to write the post yourself?
  116. 116. alashow||context
    Vibe coder doesn't realize or denying he is a vibe coder, what other reason did you want
  117. 117. xnx||context
    An LLM will reply with a plausible explanation of why someone would have written the code that it just wrote. Seems about the same.
  118. 118. josephg||context
    > There is something darkly comical about using an LLM to write up

    It feels like a modern greek tragedy. Man discovers LLMs are untrustworthy, then immediately uses an LLM as his mouthpiece.

    Delicious!

  119. 119. razorbeamz||context
    > There is something darkly comical about using an LLM to write up your “a coding agent deleted our production database” Twitter post.

    Which calls into question if this is even real.

  120. 120. foota||context
    While I largely agree, it does raise the prospect of testing this iteratively. E.g., give a model some fake environment, prompt it random things until it does something "bad" in your fake environment, and then fix whatever it claims led to its taking that action.

    If you can do this and reliably reduce the rate at which it does bad things, then you could reasonably claim that it is aware of meaningful introspection.