Show HN: Caliper – pass@k reliability testing for Claude Code and Codex skills (github.com)
I also faced the same problem, so I tried to build something lightweight to stop doing that. Caliper.
It's a local and lightweight harness that runs a skill k times in isolated environments and gives you a pass@k score (How much times it succeeded in these k times). As a non-deterministic technology, you can't just say "it worked once". You need to answer how much it passed in k times.
You define success in a YAML spec. I picked YAML to keep a schema and make it still readable for a human. You either use a LLM judge, a Python assertion, or both:
Here's an simple evaluation example with a JSON extraction, so you write this in a YAML file:
tasks:
- name: Extracts action items as clean JSON
prompt: "Read /tmp/transcript.txt and write the
action items to /tmp/actions.json."
expect: "A valid JSON array where every item has
owner, task, due. No markdown fences."
assert: |
import json
items = json.load(open("/tmp/actions.json"))
assert isinstance(items, list)
assert all({"owner","task","due"} <= i.keys()
for i in items)
Then with the CLI, you'll run it:caliper run extract-actions.eval.yaml --k 5 --baseline
What's cool about the --baseline flag is that it will re-runs everything without the skill, so you can see whether the skill is doing the work or the base agent was going to pass anyway:
ID Task k(5) pass@k
task-1 Extracts action items as JSON 5/5 100% PASS
With skill 100%
No skill 60%
Delta +40%
Most models know how to get the JSON right most of the time (JSON extraction was solved by 2 years old already). But that's it, "most of the time" is the bug. That delta shows how the skill actually helped. (It's sometimes 0%, sometimes -100%!)I also created two skills you can get started right away with your favorite harness, e.g. Claude Code, Codex or Pi:
- evaluate-skill: run and manage evals without leaving your workflow
- grill-skill: reads your SKILL.md, interviews you about what "good" looks like, writes a 3-task spec (happy path, edge case, adversarial), and runs it
You can install the skill with the command: npx skills@latest add edonadei/caliper
I for now support claude-code, codex, pi, claude-api, openai-api. You can run the agent and the judge as separate backends, so you can run a skill on one and judge with another.
GitHub: https://github.com/edonadei/caliper PyPI: https://pypi.org/project/caliper-eval/
Of course, it's a first step. I think the autorater layer can be vastly improved, more handholding to create and iterate on evaluation specs, supporting more harness, why not including this layer into a self-improvement bigger system?
If you're also building agentic evaluations, I'm genuinely interested to hear how you are handling that.
Comments (1)