Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview (github.com)
Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things
1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever
2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)
3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.
I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.
HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...
It is astounding how much the harness matters, based on this and other experiments I have done.
Harness was https://www.npmjs.com/package/dirac-cli
Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models
As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout
1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)
2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads
3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)
4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate
5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next
Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window
It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.
Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.
I would not be comfortable doing an on-the-fly "rewrite all subtrees that match this pattern" kind of edit.
It seems like a tool that's good for LLM's though.
https://www.context-master.dev/blog/deterministic-semantic-c...
Let me know, what you think
I saw the tools page where if I understand right, `get-symbol-context` is actually the main useful tool for what you provide? The others seem more metadata it's easy to get already (?) but that tool provides the extra info.
I had been working on exposing mine as more high-level, ie multiple APIs to query different kinds of metadata about symbols, types, etc. But I am still not sure of the best approach, where my thinking was about not overloading the AI with too many different tools. They accumulate quickly.
I would say the main two tools are get-symbol-context and get-repository-overview. The latter is actually the more complex and sophisticated one. I’m running some graph algorithms to rank the symbols in terms of relative importance based on centrality metrics, I.e. how well connected they are in the symbol graph.
The idea behind that is to allow the llm to infer the general structure and architecture of the project with just one tool call.
I guess you could reach a similar thing if you had some good Agents.md or docs detailing that for your project, but this was more meant to reach that on the fly.
The symbol-context tool is basically a graph query tool (without a dsl or cipher support yet), but yeah here the question is also whether it makes more sense to give the ai the possibility to run cipher queries itself or abstract it away in a thinner api.
The main underlying factor of my tool is however the graph that I’m building and the metadata which can be extracted from that (connections, type of connection, etc. ) :)
Whats the metadata you have in mind?
So a query on a symbol would:
* Return its type declaration, not (just) location (and I'm considering some kind of summary version where it pulls in the ancestors too, so you directly see everything it has available not just the actual declaration, because leaf nodes in inheritance often don't add much and the key behaviour is elsewhere)
* Return info about inheritance, the shape of how this modifies other code and other code modifies it.
With variations when the symbol is a variable, a type, etc etc. I'm currently using treesitter for this, to bypass LSP and (for the language I'm working on) build a full symbol table and more, to get something closer to the LSP info you mention in your blog but not limited to what LSP makes available. I don't want to rely on a LSP server; I think first-class support per language is better. It's probably possible to generate this with a set of LSP calls, perhaps, but it might take some heuristics and guesswork and... :/
I do have a graph of file-level dependencies, but not yet a graph of what calls what at the symbol or type or method level. And while I build an index of all symbols I haven't yet sorted that by count.
I get the sense we're thinking along similar lines, with slightly different approaches?
Edit: if you would like to chat on this, I'm up for it! You can find me at my username at gmail (easy to lose emails there due to volume and spam!) or my profile has my website which has my LinkedIn (horribly, more reliable :D)
It sure sounds like we have similar things in mind. I basically try to build the proper graph representation of the code during runtime, so all caller/callee relationships plus type inheritance chains etc. This is basically what I call a semantic code graph in the blog post.
From the things I tried with tree-sitter I think I would have a hard time achieving the same because by nature tree-sitter can only make educated guesses on real connections and will run into problems, if things are named ambiguously.
But yeah, will definitely reach out and am looking forward to chatting :) Hope I find the time during this week!
Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.
Changes that are primarily code refactorings, like breaking up a large module into a bunch of smaller ones, or renaming a commonly-used class, are extremely tedious to review, both in LLM generated diffs and human-written PRs. You still have to do it; LLMs have a habit of mangling comments when moving code across files, while for a human, an unassuming "rename FooAPIClient to LegacyFooAPIClient" PR is the best place to leave a backdoor when taking over a developer's account. Nevertheless, many developers just LGTM changes like this because of the tedium involved in reviewing them.
If one could express such changes as a simple AST-wrangling script in a domain-specific language, which would then be executed in a trusted environment after being reviewed, that would decrease the review burden considerably.
I believe that with agentic development, the most important constraint we have is human time. Making the LLM better and faster won't help us much if the human still needs to spend a majority of their time reading code. We should do what we can to give us less code to read, without losing confidence in the changes that the LLM makes.
It screams inexperience building real software. If I were anthropic I'd hire devs for Claude Code who arent just AI builders, but tool builders, who care about UX and systems.
> Sure they can build ML models, but I see how they improve upon them after years, and its always some really old "lesson learned" elsewhere in the industry. There's a thousand projects that make things like Claude Code use less tokens, and edit more efficiently, and nobody at Anthropic or Codex implements a single one of these approaches.
They have fully internalized the bitter lesson; the result is they get better returns improving the next model over squeezing out performance from the current one.
Looking at Anthropics status info for the last 90 days only serves to prove that they aren't hiring the right people for the right roles.
> They have fully internalized the bitter lesson; the result is they get better returns improving the next model over squeezing out performance from the current one.
Sure, but there's so many things they could be doing that don't require tweaking the model directly to improve it, the community builds all sorts of tools that improve Claude Code directly, and yet nobody at Anthropic takes any initiative in those directions, it feels like either they don't care about building user-facing software, or they don't have any UX experience.
Does that mean that it's only going to work with certain langauges for which it has parsers available?
The agent would work even without a language parser, just that the AST-based functionalities won't work
Congratulations, great work.
I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)
My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default
https://github.com/jahala/tilth
how hard do you think it would be to bring this optimization to oh-my-pi and opencode? I am testing dirac and it's very cool but the tooling isn't there yet comparing to oh-my-pi in terms of UX.
Where the SOTA model just makes a cheaper model to make edits, and it does so.
Curious to know if this has been an issue with your AST approach on larger projects?
The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).
I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.
For AST, it uses tree-sitter WASMs (ships them with the package), and maintains queries (https://github.com/dirac-run/dirac/tree/master/src/services/...)
To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries
Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.
How does this perform in day to day coding tasks, outside of benchmarks?
README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.
Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences
Any ideas?
In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan
Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path
That was the issue, 5.4 works just fine.
Support for service: priority (GPT /fast mode) would also be cool!
Is there a leaderboard out there comparing harness results using the same models?
History indicates you can't tool and harness your way to effectively competing against a smarter model with more compute.
somewhat remarkably, claude code ranks last for Opus 4.6 - which may say something about cc, or say something about the benchmark
[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0
that said context management seem to be solving today model problems, more than being an universal property, and will probably be obsoleted a few model generations down the road, as tool obsoleted RAG context injection from question embeddings.
I created a real world benchmark, for mining, oil&gas, construction ect. called FieldOps-bench and it basically proves that vertical agents and specialized harness, tool, systems outperforms SOTA models alone still.
The problems I’ve experienced are less adept at picking the right bash commands to build and test the Go app, and not following idiomatic Go or code base patterns for changes.
A skill hasn’t helped much.
Will need to try this and open code next.
It is so refreshing to see real FOSS and not a grift. Simple openrouter api key, and I'm going.
This is what I'm using from now on. You are doing the best work in this space.
2. Until then your landing page needs to mention all the numbers are just from running on Gemini 3 Flash. Currently there's no mention at all of Gemini.
3. Assuming that cheaper also means faster in this case where model is equal? If so, then why not add this to the benchmarks to highlight another advantage - time until completion of the tasks. If it's the opposite and it takes longer (seems unlikely), then it would be transparent to note this.
4. Would be good to note if it does or does not support skills, (nested) AGENTS.md, MCP and so on for people considering migrating.
1. I have been trying to benchmark openweights models but keep running into timeouts due to slow inference (terminal bench tasks have strict timeouts that you are not allowed to modify). Posted my frustration here https://www.reddit.com/r/LocalLLaMA/comments/1stgt39/the_fru...
2. Done (updated github readme)
3. Yes, on an average the times were shorter, but I did not benchmark it because at random times, the model outputs get slower, so it is not a rigorous benchmark
4. Added info on this too
3. Maybe you could instead provide a measure of output tokens used (including thinking), as that's a reasonable measure for speed. I guess input tokens would be similar unless the AST usage and hashes etc increases them a lot? Seems unlikely.
I guess it makes sense that models don’t generalize perfectly to arbitrary tools but are biased to those in its training data, especially for a common operation like editing files.
The Gemini family might be a good pick here since it generally underperforms in agentic tasks (due to lack of training data or other reasons) and thus might not have this inherent bias towards specific tools.
- what tool you need?
- what would be parameters for the tool
- what method you want to read?
instead of sending few kilobytes of build output and waiting for response. Oh well.. Good thing someone already did that!
This inspired me to start a "skill distillery" [0] where I take good agent workflow ideas and turning them into small, inspectable/installable skills.
The first one is dirac-workflow, based on Dirac's structural code workflow. It's not a Dirac clone tho, it has no runtime, persistent AST index, hash-anchor editing engine, or benchmark harness. Just a small AST helper and the workflow discipline as a portable skill.
I also dogfooded it on the Dirac repo itself and included a short report.
Would appreciate feedback from the original author, if the prompts and tools [1] are representative.
[0] https://github.com/ouatu-ro/skill-distillery
[1] https://github.com/ouatu-ro/skill-distillery/blob/main/skill...
1. Context management - specifically pruning old tool call responses, truncation of tool output and automatic compaction. Those have worked pretty great for me, benefits of reducing context greatly seem to outweigh gains from "remembering" everything. I always leave short summaries though.
2. "Subagents" - my latest attempts revolve around not exposing any tools for the main agent at all, except for a run_agent tool where the subagent has access to the classic search/execute/fetch tools. My theory is that if subagents return concise summaries this would automatically keep the parent agent context clean for much longer. Still experimenting though, writing prompts for subagents may also be too far outside of the current training sets.
1. Context management - Don't bother with pruning unless your API doesn't support caching. Every prune breaks the cache and you lose the 90% discounted caching rate
2. I did some work improving Cline's subagent feature that Dirac inherited. In my experience, not all models are trained effectively to delegate work, so YMMV. A common pitfall to watch is, what happens if one or more subagents get stuck in a loop or for whatever reason don't return? You need a mechanism to control them from the main agent
2. That's a good hint, I'm currently only trying with tighter turn and token limits for subagents and an error summary on exceeding them. Not sure how else (besides steering and prompt engineering) to ensure the subagent doesn't go wild...
1. Telemetry to dirac.run/v1/event — Sends machine ID, token usage, model info, events, errors (first 500 chars), and platform info. Hardcoded API key. Defaults to opt-in (setting is "unset", not "disabled").
2. Feature flags from dirac.run/v1/event/decide — Polls every 60 minutes with your machine ID. Always enabled, independent of telemetry opt-out. No way to disable without code changes.
3. Web tools route through api.dirac.run — Web search and web fetch tools proxy through Dirac's own API server, sending your request content plus system headers (platform, version, machine ID).
4. Model list fetches — Calls OpenRouter, HuggingFace, Groq, etc. for model listings even when using the Anthropic provider.
This is something that needs to be deprecated entirely. The web fetch tool no longer is used or works. There is nothing even listening at api.dirac.run. This was the result of me stretching my capacity too thin and bulk renaming cline.bot to dirac.run
UPDATE (+1h): both Web search and web fetch tools are now nuked.