How many tokens does it save?
Every figure is a real string — the tool's actual JSON output versus the source an agent would otherwise read, tokenized with a real BPE tokenizer. The suites aresystematic sweeps over RoselineMCP's own source, so the distribution is representative. Weak cases included.
What each tool returns vs. reading the file
Where the file outline wins — and where it can't
The first run of this benchmark caught search_symbols outline costingtokens: it repeated the file path and fully-qualified name on every symbol. That was fixed — the outline now returns a lean projection, and the median went from −45%to +30%.
What remains is a limit, not a bug. Outline wins big on body-heavy files (Program.cs +89%, CodeFixService.cs+76%) but only breaks even on declaration-only files (IDiagnosticFilterService.cs −92%) — an interfaceis already just signatures. Read those directly; outline the big ones.
Two baselines, honestly
B1 (whole-file) = the file(s) an agent must open — realistic but generous.B2 (targeted) = only the relevant lines ±3 (a grep -C3 model) — a conservative floor. “Weak” = tasks that saved under 25% vs B1.
| Tool | Scenario | n | median | pooled | range | vs grep | weak |
|---|---|---|---|---|---|---|---|
| search_symbols | File outline instead of reading the file | 53 | 30% | 39% | −92%…89% | — | 25/53 |
| get_symbol_info | Symbol metadata instead of reading the file | 282 | 69% | 84% | −44%…97% | −539% | 13/282 |
| get_symbol_info | Go-to-definition (includeSource=true) — the honest weaker case | 282 | 53% | 53% | −165%…97% | −693% | 91/282 |
| find_references | Reference list instead of reading every referencing file | 50 | 85% | 85% | 21%…97% | −75% | 2/50 |
| find_implementations | Implementation list instead of reading candidate files | 10 | 83% | 92% | 57%…97% | — | 0/10 |
| get_call_graph | Caller list instead of reading the caller files | 70 | 83% | 81% | 6%…96% | — | 2/70 |
| rename_symbol | Emitting a diff instead of rewriting whole filesillustrative | 3 | 84% | 81% | 70%…88% | — | 0/3 |
Reading the two baselines
- vs. whole files (B1): the read tools cut 54–91% — a file is far bigger than the one fact you needed from it.
- vs. an optimal grep (B2): the tools are roughly comparable in raw tokens (
find_references≈grep -C3). Their edge over grep is structure and precision — resolved symbols, cross-project references, real call edges — not raw count. get_symbol_infowithincludeSource=true(~55%) is shown separately: you get the body back, so it doesn't pretend to replace reading code you're about to edit.
Output tokens: a diff, not a rewrite
A different axis — the tokens an agent must emit.rename_symbol returns a unified diff instead of the full text of every file it touches: a pooled 81%reduction across 3 illustrative renames.
- Every measured number is a real string. The tool output is the exact JSON the MCP tool serializes (System.Text.Json, indented). The baseline is the actual bytes of the source an agent would otherwise read.
- Tokens are counted with the cl100k_base BPE tokenizer (Microsoft.ML.Tokenizers, the gpt-4 encoding) — a documented, reproducible proxy for Claude's tokenizer, which is not published as a library. Character counts are included so nothing hinges on one tokenizer.
- Two baselines: B1 (whole-file) = the full text of the file(s) an agent must open to answer, matching how coding agents actually read. B2 (targeted) = only the relevant lines ±3 (a grep -C3 model), a conservative lower bound on savings.
- The headline pools clear navigation wins (outline, get_symbol_info metadata, find_references, find_implementations, get_call_graph). It excludes get_symbol_info includeSource=true (shown separately as the weaker case) and edit output (an output-token, not context, axis).
- The tools ran against RoselineMCP's own solution (dogfooding). The symbol and file suites are systematic sweeps over every candidate — not a hand-picked selection — so the distribution (min/median/mean/max) is representative, weak cases included.
- Pooled savings weight by size (1 − Σtool ÷ Σbaseline); median savings weight every task equally. Both are reported because they answer different questions.
- The tokenizer is a proxy. Claude's exact counts differ, but code tokenizes similarly across modern BPE tokenizers, so the order of magnitude holds.
- Whole-file (B1) is the realistic baseline but the generous one: a disciplined agent that greps first lands nearer B2. Real agent behavior is between the two — both are shown.
- These tools save tokens on navigation and orientation. They do NOT remove the need to read code you are about to edit in depth; get_symbol_info(includeSource=true) makes that explicit.
- find_references / find_implementations baselines assume you read whole referencing files to be as complete as the tool (which searches the whole solution). If those references sit in large test files, the whole-file baseline is large — the targeted (B2) column keeps that honest.
- Results are specific to this codebase and its file sizes. A repo of tiny files saves less; a repo of large files saves more. Re-run on your own solution to get your numbers: dotnet run --project RoselineMCP.TokenBenchmark.
Run it on your own solution
The harness loads the solution once via MSBuild, runs the real services, tokenizes their output, and writes the JSON this page renders.
git clone https://github.com/Atypical-Consulting/RoselineMCP
cd RoselineMCP
dotnet run --project RoselineMCP.TokenBenchmark -c ReleaseOutput: website/src/data/benchmark-results.json, stamped with the commit and date. Tokenizer: cl100k_base (Microsoft.ML.Tokenizers, gpt-4) — a proxy for Claude's tokenizer.