RoselineMCP
Measured · reproducible

How many tokens does it save?

Every figure is a real string — the tool's actual JSON output versus the source an agent would otherwise read, tokenized with a real BPE tokenizer. The suites aresystematic sweeps over RoselineMCP's own source, so the distribution is representative. Weak cases included.

81%
pooled reduction · size-weighted
73%
median per task
934,530
tokens saved on the sample
1,148,463
baseline · reading the files
Target RoselineMCP.sln · RoselineMCPFiles 53Symbols 282Tokenizer cl100k_baseCommit dab3b42Date 2026-07-03
Savings by tool

What each tool returns vs. reading the file

−50%−25%0%25%50%75%100%File outlinesearch_symbols · n=53File outline — median 30% (pooled 39%, range −92%…89%, n=53)30%Symbol info (metadata)get_symbol_info · n=282Symbol info (metadata) — median 69% (pooled 84%, range −44%…97%, n=282)69%Symbol info (+source)get_symbol_info · n=282Symbol info (+source) — median 53% (pooled 53%, range −165%…97%, n=282)53%Find referencesfind_references · n=50Find references — median 85% (pooled 85%, range 21%…97%, n=50)85%Find implementationsfind_implementations · n=10Find implementations — median 83% (pooled 92%, range 57%…97%, n=10)83%Call graph (callers)get_call_graph · n=70Call graph (callers) — median 83% (pooled 81%, range 6%…96%, n=70)83%
median savings per task vs. reading the whole file. Hover a bar for the pooled figure and full range. Positive = tokens saved; negative = tokens added.

Where the file outline wins — and where it can't

The first run of this benchmark caught search_symbols outline costingtokens: it repeated the file path and fully-qualified name on every symbol. That was fixed — the outline now returns a lean projection, and the median went from −45%to +30%.

What remains is a limit, not a bug. Outline wins big on body-heavy files (Program.cs +89%, CodeFixService.cs+76%) but only breaks even on declaration-only files (IDiagnosticFilterService.cs −92%) — an interfaceis already just signatures. Read those directly; outline the big ones.

Full results

Two baselines, honestly

B1 (whole-file) = the file(s) an agent must open — realistic but generous.B2 (targeted) = only the relevant lines ±3 (a grep -C3 model) — a conservative floor. “Weak” = tasks that saved under 25% vs B1.

ToolScenarionmedianpooledrangevs grepweak
search_symbolsFile outline instead of reading the file5330%39%−92%…89%25/53
get_symbol_infoSymbol metadata instead of reading the file28269%84%−44%…97%−539%13/282
get_symbol_infoGo-to-definition (includeSource=true) — the honest weaker case28253%53%−165%…97%−693%91/282
find_referencesReference list instead of reading every referencing file5085%85%21%…97%−75%2/50
find_implementationsImplementation list instead of reading candidate files1083%92%57%…97%0/10
get_call_graphCaller list instead of reading the caller files7083%81%6%…96%2/70
rename_symbolEmitting a diff instead of rewriting whole filesillustrative384%81%70%…88%0/3

Reading the two baselines

  • vs. whole files (B1): the read tools cut 54–91% — a file is far bigger than the one fact you needed from it.
  • vs. an optimal grep (B2): the tools are roughly comparable in raw tokens (find_referencesgrep -C3). Their edge over grep is structure and precision — resolved symbols, cross-project references, real call edges — not raw count.
  • get_symbol_info with includeSource=true (~55%) is shown separately: you get the body back, so it doesn't pretend to replace reading code you're about to edit.

Output tokens: a diff, not a rewrite

A different axis — the tokens an agent must emit.rename_symbol returns a unified diff instead of the full text of every file it touches: a pooled 81%reduction across 3 illustrative renames.

Methodology
  • Every measured number is a real string. The tool output is the exact JSON the MCP tool serializes (System.Text.Json, indented). The baseline is the actual bytes of the source an agent would otherwise read.
  • Tokens are counted with the cl100k_base BPE tokenizer (Microsoft.ML.Tokenizers, the gpt-4 encoding) — a documented, reproducible proxy for Claude's tokenizer, which is not published as a library. Character counts are included so nothing hinges on one tokenizer.
  • Two baselines: B1 (whole-file) = the full text of the file(s) an agent must open to answer, matching how coding agents actually read. B2 (targeted) = only the relevant lines ±3 (a grep -C3 model), a conservative lower bound on savings.
  • The headline pools clear navigation wins (outline, get_symbol_info metadata, find_references, find_implementations, get_call_graph). It excludes get_symbol_info includeSource=true (shown separately as the weaker case) and edit output (an output-token, not context, axis).
  • The tools ran against RoselineMCP's own solution (dogfooding). The symbol and file suites are systematic sweeps over every candidate — not a hand-picked selection — so the distribution (min/median/mean/max) is representative, weak cases included.
  • Pooled savings weight by size (1 − Σtool ÷ Σbaseline); median savings weight every task equally. Both are reported because they answer different questions.
Limitations
  • The tokenizer is a proxy. Claude's exact counts differ, but code tokenizes similarly across modern BPE tokenizers, so the order of magnitude holds.
  • Whole-file (B1) is the realistic baseline but the generous one: a disciplined agent that greps first lands nearer B2. Real agent behavior is between the two — both are shown.
  • These tools save tokens on navigation and orientation. They do NOT remove the need to read code you are about to edit in depth; get_symbol_info(includeSource=true) makes that explicit.
  • find_references / find_implementations baselines assume you read whole referencing files to be as complete as the tool (which searches the whole solution). If those references sit in large test files, the whole-file baseline is large — the targeted (B2) column keeps that honest.
  • Results are specific to this codebase and its file sizes. A repo of tiny files saves less; a repo of large files saves more. Re-run on your own solution to get your numbers: dotnet run --project RoselineMCP.TokenBenchmark.
Reproduce it

Run it on your own solution

The harness loads the solution once via MSBuild, runs the real services, tokenizes their output, and writes the JSON this page renders.

shell
git clone https://github.com/Atypical-Consulting/RoselineMCP
cd RoselineMCP
dotnet run --project RoselineMCP.TokenBenchmark -c Release

Output: website/src/data/benchmark-results.json, stamped with the commit and date. Tokenizer: cl100k_base (Microsoft.ML.Tokenizers, gpt-4) — a proxy for Claude's tokenizer.