How memory compaction works in agents like Claude Code ?

It is lot simpler than you might think

Mar 01, 2026

At its core, memory compaction in autonomous AI agents like Claude Code (Anthropic’s CLI-based coding agent) is an architectural solution to a fundamental limitation: Large Language Models (LLMs) have finite context windows. When an agent operates in a terminal - running commands, reading large files, and iteratively debugging - it generates massive amounts of text. If the agent simply appended every command and output to its prompt, it would quickly hit the token limit, become prohibitively expensive, and suffer from “lost in the middle” attention degradation where it forgets the original goal.

Here is a breakdown of how a technical implementation of memory compaction works in these systems.

1. The Trigger: Token Monitoring

The agent acts as a loop, constantly building a prompt to send to the LLM. It maintains a running count of tokens in the current conversational payload. Compaction isn’t usually continuous; it’s triggered by a threshold.

Soft Limit: For example, if the context window is 200k tokens, the agent might trigger compaction when the payload hits 150k tokens to ensure there is always room for the model’s generation and immediate tool outputs.

2. Context Triage: Segmenting Memory

When compaction is triggered, the agent doesn’t just blindly summarize everything. It categorizes the current context into different tiers of importance:

The System Prompt & Core Directives (Keep): The fundamental instructions on how the agent operates.
The Working State (Keep): The original user request, the current working directory, and the immediate next steps the agent was planning to take.
Recent History (Keep): The last few turns of conversation (e.g., the last 3-5 commands and their outputs) so the agent maintains its immediate train of thought.
Older Episodic Memory (Target for Compaction): Older commands, file reads, and intermediate thoughts that led up to the current state.

3. The Compaction Mechanisms

Once the target data is isolated, the agent applies one or more strategies to shrink it:

LLM-Assisted Summarization: The agent makes a separate, background LLM call. It passes the “Older Episodic Memory” to the model with a prompt like: “Summarize the actions taken and knowledge gained in these steps. Retain key file paths, variable names, and architectural decisions. Discard raw command outputs.” The massive raw text is replaced by a dense, token-efficient paragraph.
Tool Output Truncation: Agents often read files or run commands that output thousands of lines (e.g., a massive npm install log or a large git diff). During compaction, the agent might strip out the middle of these logs, keeping only the first and last few lines, or replace the entire log with a metadata tag like [File X read successfully: 850 lines].
Semantic Offloading (RAG): Instead of just deleting old raw context, advanced agents embed the raw outputs and push them into a local vector database. If the summarized context later lacks a specific detail, the agent can use a “search_memory” tool to retrieve the exact raw text it dropped earlier.

4. Rebuilding the Context

After compaction, the agent stitches the payload back together. The new prompt sent to the LLM looks something like this:

[System Prompt] [Goal: Fix the memory leak in auth.ts] [Compacted History: I have explored the src/ directory, read auth.ts, and identified that the token listener is not unmounting. I attempted to fix it using a useEffect cleanup, but the test failed.] [Recent Turn 1: Ran npm test] [Recent Turn 1 Output: Failed: Jest timeout...] [Current Turn: Awaiting next action]

Why this matters for Claude Code

In coding environments, precision is everything. A naive summarization might drop the exact line number of a bug or the specific flags required for a build command. Tools like Claude Code have to be highly strategic, ensuring that intent (what we are trying to do) and environmental state (what files we have modified) survive the compaction process intact, while heavy, useless text (like successful compilation logs) is aggressively pruned.

The AI Architect — Designing systems that think.

Discussion about this post

Ready for more?