Measured · reproducible

How many tokens does it save?

Every figure is a real string — the tool's actual JSON output versus the source an agent would otherwise read, tokenized with a real BPE tokenizer. The suites aresystematic sweeps over RoselineMCP's own source, so the distribution is representative. Weak cases included.

81%

pooled reduction · size-weighted

73%

median per task

934,530

tokens saved on the sample

1,148,463

baseline · reading the files

Target RoselineMCP.sln · RoselineMCPFiles 53Symbols 282Tokenizer cl100k_baseCommit dab3b42Date 2026-07-03

Savings by tool

What each tool returns vs. reading the file

median savings per task vs. reading the whole file. Hover a bar for the pooled figure and full range. Positive = tokens saved; negative = tokens added.

Where the file outline wins — and where it can't

The first run of this benchmark caught search_symbols outline costingtokens: it repeated the file path and fully-qualified name on every symbol. That was fixed — the outline now returns a lean projection, and the median went from −45%to +30%.

What remains is a limit, not a bug. Outline wins big on body-heavy files (Program.cs +89%, CodeFixService.cs+76%) but only breaks even on declaration-only files (IDiagnosticFilterService.cs −92%) — an interfaceis already just signatures. Read those directly; outline the big ones.

Full results

Two baselines, honestly

B1 (whole-file) = the file(s) an agent must open — realistic but generous.B2 (targeted) = only the relevant lines ±3 (a grep -C3 model) — a conservative floor. “Weak” = tasks that saved under 25% vs B1.

Tool	Scenario	n	median	pooled	range	vs grep	weak
search_symbols	File outline instead of reading the file	53	30%	39%	−92%…89%	—	25/53
get_symbol_info	Symbol metadata instead of reading the file	282	69%	84%	−44%…97%	−539%	13/282
get_symbol_info	Go-to-definition (includeSource=true) — the honest weaker case	282	53%	53%	−165%…97%	−693%	91/282
find_references	Reference list instead of reading every referencing file	50	85%	85%	21%…97%	−75%	2/50
find_implementations	Implementation list instead of reading candidate files	10	83%	92%	57%…97%	—	0/10
get_call_graph	Caller list instead of reading the caller files	70	83%	81%	6%…96%	—	2/70
rename_symbol	Emitting a diff instead of rewriting whole filesillustrative	3	84%	81%	70%…88%	—	0/3

Reading the two baselines

vs. whole files (B1): the read tools cut 54–91% — a file is far bigger than the one fact you needed from it.
vs. an optimal grep (B2): the tools are roughly comparable in raw tokens (find_references ≈ grep -C3). Their edge over grep is structure and precision — resolved symbols, cross-project references, real call edges — not raw count.
get_symbol_info with includeSource=true (~55%) is shown separately: you get the body back, so it doesn't pretend to replace reading code you're about to edit.

Output tokens: a diff, not a rewrite

A different axis — the tokens an agent must emit.rename_symbol returns a unified diff instead of the full text of every file it touches: a pooled 81%reduction across 3 illustrative renames.

Methodology

Every measured number is a real string. The tool output is the exact JSON the MCP tool serializes (System.Text.Json, indented). The baseline is the actual bytes of the source an agent would otherwise read.
Tokens are counted with the cl100k_base BPE tokenizer (Microsoft.ML.Tokenizers, the gpt-4 encoding) — a documented, reproducible proxy for Claude's tokenizer, which is not published as a library. Character counts are included so nothing hinges on one tokenizer.
Two baselines: B1 (whole-file) = the full text of the file(s) an agent must open to answer, matching how coding agents actually read. B2 (targeted) = only the relevant lines ±3 (a grep -C3 model), a conservative lower bound on savings.
The headline pools clear navigation wins (outline, get_symbol_info metadata, find_references, find_implementations, get_call_graph). It excludes get_symbol_info includeSource=true (shown separately as the weaker case) and edit output (an output-token, not context, axis).
The tools ran against RoselineMCP's own solution (dogfooding). The symbol and file suites are systematic sweeps over every candidate — not a hand-picked selection — so the distribution (min/median/mean/max) is representative, weak cases included.
Pooled savings weight by size (1 − Σtool ÷ Σbaseline); median savings weight every task equally. Both are reported because they answer different questions.

Limitations

The tokenizer is a proxy. Claude's exact counts differ, but code tokenizes similarly across modern BPE tokenizers, so the order of magnitude holds.
Whole-file (B1) is the realistic baseline but the generous one: a disciplined agent that greps first lands nearer B2. Real agent behavior is between the two — both are shown.
These tools save tokens on navigation and orientation. They do NOT remove the need to read code you are about to edit in depth; get_symbol_info(includeSource=true) makes that explicit.
find_references / find_implementations baselines assume you read whole referencing files to be as complete as the tool (which searches the whole solution). If those references sit in large test files, the whole-file baseline is large — the targeted (B2) column keeps that honest.
Results are specific to this codebase and its file sizes. A repo of tiny files saves less; a repo of large files saves more. Re-run on your own solution to get your numbers: dotnet run --project RoselineMCP.TokenBenchmark.

Reproduce it

Run it on your own solution

The harness loads the solution once via MSBuild, runs the real services, tokenizes their output, and writes the JSON this page renders.

shell

git clone https://github.com/Atypical-Consulting/RoselineMCP
cd RoselineMCP
dotnet run --project RoselineMCP.TokenBenchmark -c Release

Output: website/src/data/benchmark-results.json, stamped with the commit and date. Tokenizer: cl100k_base (Microsoft.ML.Tokenizers, gpt-4) — a proxy for Claude's tokenizer.