← blog Leggi in italiano
EN 6 min read

How I reduced token consumption by 80% with Claude Code

Tools and habits for using Claude Code sustainably: RTK to filter command output, Caveman to compress responses, CLAUDE.md as a context contract, and /compact to prevent sessions from ballooning.

aiclaudetoolingproductivity

After a few months of daily Claude Code use across multi-repo projects, I found myself staring at the API usage breakdown with some concern. Tokens weren’t scaling linearly with task complexity — they were scaling with response verbosity and accumulated context. Long sessions, bash command outputs copied wholesale into context, elaborate responses to simple requests.

I started measuring before intervening. The breakdown was clear: input tokens were dominated by command output (git diff, test runners, build logs) — large text blocks the model read to understand system state. Output tokens were dominated by verbose responses: explanations, summaries, code comments nobody asked for.

These are the solutions I adopted, in order of impact.

RTK: filter command output before it hits context

The quietest token consumer is bash command output. A git diff on a refactor can generate 10,000 tokens. A test runner printing every failed assertion can reach 25,000. All of this text enters the context — and the model reads it, even if 90% is redundant.

RTK (Rust Token Killer) is a CLI proxy that sits between the terminal and the model. It intercepts command output and filters it before it reaches the context: removes duplicate lines, truncates repetitive sections, aggregates similar results.

brew install rtk
rtk init -g  # configure the global hook for Claude Code

After restarting, RTK works transparently — nothing changes in the workflow, but output is rewritten before being consumed. On a 30-minute session:

CommandTokens beforeTokens afterReduction
git diff10,0002,500-75%
Test runner25,0002,500-90%
Full session118,00023,900-80%

It’s a single Rust binary with no external dependencies and ~10ms per-command overhead. Supports over 100 commands: git, npm, cargo, docker, AWS CLI, and the main test frameworks.

Caveman: compress model responses

The second big consumer is the model’s own output. By default, Claude responds with complete sentences, restates context, adds explanations even when they’re not needed. On repetitive technical tasks — refactoring, renaming, adding tests — 60-70% of the words are filler.

Caveman is a Claude Code skill that compresses responses by removing filler while keeping only technical substance. It uses fragments instead of complete sentences, eliminates redundancy, and responds with the density of a code comment rather than a spoken explanation.

# Install as a Claude Code skill
mkdir -p ~/.claude/skills
# clone caveman into the skills directory

Four compression levels available via /caveman:

  • lite — removes connective phrases, keeps structure
  • full — fragments, no complete sentences
  • ultra — dense output, almost mnemonic
  • wenyan — classical Chinese style, maximum compression

A concrete example. Response without Caveman:

“I analyzed the file and noticed that the createUser function doesn’t handle the case where the email is already present in the database. I added a check at the beginning of the function that verifies the email’s existence and throws an appropriate error if found. I also updated the tests to cover this scenario.”

With Caveman full:

createUser: added email existence check, throws if duplicate. Tests updated.

From 69 tokens to 19. Average reduction on output is around 65%.

Caveman also has a /caveman-compress command for compressing memory files and CLAUDE.md — average 46% reduction on these files, which get loaded as system context on every session.

CLAUDE.md as a context contract

Every Claude Code session starts from scratch. Without persistent context, the model asks or infers: what’s the stack? What are the conventions? How do you build? Where are the key files?

The CLAUDE.md file in the project root is loaded automatically on every session. It’s the one place worth investing time writing well — not as documentation for a human, but as dense, structured context for the model.

An effective CLAUDE.md:

## Commands
| Task | Command |
|------|---------|
| Dev | pnpm dev |
| Build | pnpm build |
| Test | pnpm test |

## Stack
- Astro 6 + React 19 + TypeScript strict
- Tailwind CSS 4, Redux 5
- baseUrl: "src" → import from 'utils/foo' not '../../utils/foo'

## Key files
| File | Purpose |
|------|---------|
| src/App.tsx | OS desktop layout |
| src/reducers/index.ts | RootState |

## Conventions
- No comments unless WHY is non-obvious
- No any, strict mode
- Tailwind first, no inline styles

This block prevents dozens of questions per session and reduces errors from missing conventions. The context is written once, compressed with /caveman-compress, and loaded automatically.

Memory system: cross-session persistence

Claude Code supports a persistent memory system via .md files in .claude/projects/<path>/memory/. These files are loaded as part of the system prompt in subsequent sessions.

Useful types:

  • feedback: what works, what the model shouldn’t do
  • project: decisions made, rationale, deadlines
  • user: profile of who uses the tool (senior dev, main stack, area of expertise)

The MEMORY.md file is an index — one line per file, always loaded. Individual files are loaded when relevant.

Keeping these files short and compressed with /caveman-compress matters: every byte counts because it’s loaded as a system prefix on every call.

/compact and /clear: managing the session lifecycle

A long session accumulates context: messages, output, iterations. After an hour on a complex problem, the context can contain failed attempts, intermediate outputs, and reasoning that’s no longer relevant.

/compact summarizes the current conversation into a dense summary and continues from there. Use it proactively — don’t wait for the context to approach the limit, because by then the model starts losing important details. A good heuristic: use it after each completed sub-task, or every 20-30 messages.

/clear resets everything. Useful when switching tasks cleanly — for example, after finishing a refactor and starting work on a completely different feature. Carrying refactor context into a feature session is pure noise.

The typical sequence:

[session start] → task A → /compact → task B → /compact → task C → /clear
[new session] → task D

This avoids the progressive quality degradation that happens when context is saturated with stale information.

The overall result

Putting all the layers together:

InterventionTokens saved
RTK (command output)-75–90% on commands
Caveman (model output)~-65% on output
CLAUDE.md + compressed memory-40–50% on system prompt
Proactive /compactavoids re-reading stale context

This isn’t additive — the layers combine. The conservative estimate on a typical work session is a total reduction between 70% and 80% compared to an unoptimized session.

The point isn’t just cost. Fewer tokens in context also means more precise responses: the model doesn’t have to filter noise to find the relevant signal.