I'd guess the same has always been true for READMEs / human dev docs. Of course it doesn't transfer directly but still feels incredible to be in an age where we can measure such (previously) theoretical things with synthetic programmers.
Interesting that they had a 100% read rate of agents.md. In my test repo lower down agents.md files were occasionally missed by vscode copilot. That fact put me off putting too much effort into nesting agents.md files too much within the repo and I've been focusing on agent skills instead.
IME, multiple (good) AGENTS.md is even better. I mostly see them only at the root of a repository, but I spread more out into important subdirectories. They act as a table of contents and spark notes. Putting more focussed AGENTS.md in important places has been even more helpful.
Bonus points if you can force them into context without needing the agent to make a tool call, based on touching the files or systems near them. (my homegrown agent has this feature)
I suspect the harness (of which AGENTS and skills and similar things) should be abstracted for better overall performance. This article doesn't really go into detail about model preferences, but some other benchmarks show that different models have differnt preferences of how to use certain tools (probably related to their post training material), and it should really be managed invisibly to me as the end user.
Also curious how well LLMs can self-reflect in a loop, in terms of, here's how the previous iteration went, here's what didn't go well, here's feedback from the human, how do I modify the docs I use in a way that I know I'll do better next time.
I know you can somewhat hillclimb via DSPy but that's hard to generalize.
Bonus points if you can force them into context without needing the agent to make a tool call, based on touching the files or systems near them. (my homegrown agent has this feature)
Also curious how well LLMs can self-reflect in a loop, in terms of, here's how the previous iteration went, here's what didn't go well, here's feedback from the human, how do I modify the docs I use in a way that I know I'll do better next time.
I know you can somewhat hillclimb via DSPy but that's hard to generalize.