Agent Memory: Short-Term, Long-Term, Episodic
An LLM by itself is stateless — every call starts from a blank slate with no recollection of what happened before. For agents that need to hold multi-turn conversations, recall facts from weeks ago, or learn from past mistakes, memory is the critical infrastructure that bridges the gap between a one-shot text completion and a genuinely useful autonomous system.
This post dissects the major memory architectures used in modern LLM agent frameworks: conversation buffers, summary memory, vector-store-backed long-term retrieval, and episodic memory that captures entire task trajectories. We will walk through the theory behind each, compare trade-offs, and build working Python implementations with LangChain.
Memory Types Overview
Agent memory can be classified along two axes: retention span (how long information persists) and granularity (raw messages vs. compressed summaries vs. embeddings). The diagram below maps the four canonical memory types onto these axes.
Short-Term Memory
- Raw message history (human + AI turns)
- Lives within a single session
- Grows linearly → hits context window limits
- Ideal for quick Q&A chatbots
Summary Memory
- LLM-generated running summary of conversation
- Constant token footprint regardless of turns
- Lossy — fine details may be dropped
- Good for long-running sessions
Long-Term (Vector Store)
- Embeds messages into a vector DB
- Retrieves via semantic similarity at query time
- Persists across sessions indefinitely
- Best for knowledge-heavy agents
Episodic Memory
- Stores entire task trajectories (state → action → result)
- Enables agents to learn from past successes/failures
- Structured retrieval by task similarity
- Essential for self-improving agents
Short-Term: Conversation Buffer
The simplest memory strategy is a conversation buffer — every user message and AI response is appended to a list and injected into the prompt verbatim. This is the default in most chatbot implementations and works well when conversations are short.
How It Works
On each turn the agent prepends the system prompt, appends all previous (role, content)
pairs, and adds the latest user message. The entire history is sent to the LLM as a single prompt.
Window Variant
To avoid blowing past the context window, a windowed buffer keeps only the last k turns. Older messages are silently dropped. This is cheap and predictable but the agent loses all context beyond the window boundary.
from langchain.memory import ConversationBufferWindowMemory # Keep the last 8 human+AI turn pairs memory = ConversationBufferWindowMemory(k=8, return_messages=True) # Add a turn memory.save_context( {"input": "What is RLHF?"}, {"output": "RLHF stands for Reinforcement Learning from Human Feedback..."} ) # Retrieve the current buffer messages = memory.load_memory_variables({}) print(messages["history"]) # Last 8 turns
Token Cost Analysis
Every turn re-sends the full buffer in the prompt, so token consumption is O(n²) across a conversation of n turns. If each turn averages t tokens, total input tokens ≈ t · n · (n+1) / 2. For a 50-turn conversation at 200 tokens per turn, that is roughly 255 k input tokens — an important cost consideration at scale.
Summary Memory
Summary memory tackles the O(n²) problem by maintaining a running natural-language summary of the conversation rather than the raw messages. After each turn the agent calls the LLM to update the summary, then discards the original messages.
The Compression Loop
At each step the framework executes a secondary LLM call:
- Input: Previous summary + new human/AI messages
- Prompt: "Progressively summarize the conversation, adding to the previous summary."
- Output: Updated summary (replaces the old one)
from langchain.memory import ConversationSummaryMemory from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) memory = ConversationSummaryMemory( llm=llm, return_messages=True, human_prefix="User", ai_prefix="Agent", ) # After many turns the memory holds a ~200-token summary # instead of thousands of tokens of raw conversation memory.save_context( {"input": "Explain the difference between PPO and DPO."}, {"output": "PPO is an online RL algorithm that optimizes a clipped ..."} ) summary = memory.load_memory_variables({}) print(summary["history"]) # Compressed summary of all turns
✅ Advantages
- Constant memory footprint (~200–400 tokens)
- Can handle arbitrarily long conversations
- Keeps the "gist" of earlier exchanges
⚠️ Drawbacks
- Extra LLM call each turn → latency + cost
- Lossy — specific numbers, names, or code may vanish
- Summary quality depends on the summarizer model
Hybrid: Summary + Buffer
LangChain's ConversationSummaryBufferMemory blends both approaches. It keeps the
last k messages in raw form and summarizes everything older. This gives the agent
precise recall for recent turns and a compressed overview of the full history — the best of
both worlds.
from langchain.memory import ConversationSummaryBufferMemory memory = ConversationSummaryBufferMemory( llm=llm, max_token_limit=1500, # Summarize once raw messages exceed this return_messages=True, ) # Recent messages stay verbatim; older ones become a summary
Long-Term: Vector Store Memory
For agents that need to recall information across sessions — a customer's past tickets, a codebase agent's previous refactoring decisions, a research assistant's earlier literature reviews — we need persistent, searchable memory. Vector store memory embeds every message into a vector database and retrieves relevant fragments via semantic similarity at query time.
Architecture
The write path and read path are decoupled:
- Write: Every message is embedded and upserted into the vector store with metadata (timestamp, session ID, role).
- Read: Before each LLM call, the current query is embedded and the top-k most similar past messages are retrieved and injected into the prompt.
from langchain.memory import VectorStoreRetrieverMemory from langchain_community.vectorstores import FAISS from langchain_openai import OpenAIEmbeddings # Create a FAISS index backed by OpenAI embeddings embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = FAISS.from_texts([" "], embedding=embeddings) retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) memory = VectorStoreRetrieverMemory(retriever=retriever) # Save a memory fragment memory.save_context( {"input": "Our Kubernetes cluster runs on GKE with 3 node pools."}, {"output": "Got it. I'll remember that for deployment planning."} ) # Weeks later — retrieve relevant memories relevant = memory.load_memory_variables( {"prompt": "How should we deploy this new service?"} ) print(relevant["history"]) # → Returns the Kubernetes context from the earlier session
text-embedding-3-small) over general-purpose models.
The embedding dimensionality directly affects storage cost and retrieval latency in production.
Metadata Filtering
Raw vector similarity alone is often insufficient. Production systems add metadata filters — restricting retrieval to a specific user, session, time range, or topic tag. This prevents the agent from accidentally surfacing another user's conversation in a multi-tenant system.
# Metadata-filtered retrieval with Chroma from langchain_community.vectorstores import Chroma vectorstore = Chroma( collection_name="agent_memory", embedding_function=embeddings, persist_directory="./chroma_db", ) # Store with metadata vectorstore.add_texts( texts=["User prefers Python over TypeScript for backends."], metadatas=[{"user_id": "u-42", "topic": "preferences"}], ) # Retrieve only for this user results = vectorstore.similarity_search( "What language should we use?", k=3, filter={"user_id": "u-42"}, )
Episodic Memory
While the previous memory types store what was said, episodic memory stores what happened. It captures full task trajectories — the sequence of states, actions, tool calls, and outcomes that an agent went through to accomplish (or fail at) a goal. This allows agents to learn from experience, reuse successful strategies, and avoid repeating mistakes.
Trajectory Structure
Each episode is a structured record:
Retrieval by Task Similarity
When the agent faces a new task, it embeds the task description and retrieves the most similar past episodes. Successful episodes are injected as few-shot demonstrations; failed episodes can be injected as negative examples with a "do not repeat this mistake" framing.
class EpisodicMemory: """Stores and retrieves full task trajectories.""" def __init__(self, vectorstore, embeddings): self.vectorstore = vectorstore self.embeddings = embeddings def store_episode(self, episode: dict): # episode = {goal, actions, observations, outcome, reflection} text = f"Goal: {episode['goal']}\n" \ + f"Outcome: {episode['outcome']}\n" \ + f"Reflection: {episode['reflection']}" self.vectorstore.add_texts( texts=[text], metadatas=[{ "outcome": episode["outcome"], "num_steps": len(episode["actions"]), }], ) def recall(self, task_description: str, k: int = 3): # Retrieve the k most similar past episodes return self.vectorstore.similarity_search( task_description, k=k )
Step-Through Example
Consider a coding agent that previously debugged a memory leak:
Python Implementation
Let's build a complete agent that combines all three memory types — conversation buffer for the current session, summary memory for compression, and vector store for long-term recall.
Unified Memory Manager
from dataclasses import dataclass, field from typing import List, Dict, Optional import json, time from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain.memory import ( ConversationBufferWindowMemory, ConversationSummaryMemory, VectorStoreRetrieverMemory, CombinedMemory, ) from langchain_community.vectorstores import FAISS @dataclass class MemoryConfig: buffer_k: int = 6 # Recent turns to keep verbatim vector_top_k: int = 4 # Long-term memories to retrieve embedding_model: str = "text-embedding-3-small" summary_llm: str = "gpt-4o-mini" class AgentMemoryManager: """Orchestrates short-term, summary, and long-term memory.""" def __init__(self, config: MemoryConfig = MemoryConfig()): self.config = config self.llm = ChatOpenAI(model=config.summary_llm, temperature=0) self.embeddings = OpenAIEmbeddings(model=config.embedding_model) # 1. Short-term buffer (last k turns) self.buffer = ConversationBufferWindowMemory( k=config.buffer_k, memory_key="recent_history", return_messages=True, ) # 2. Summary memory (compressed older history) self.summary = ConversationSummaryMemory( llm=self.llm, memory_key="summary", return_messages=True, ) # 3. Long-term vector store vs = FAISS.from_texts([" "], embedding=self.embeddings) retriever = vs.as_retriever( search_kwargs={"k": config.vector_top_k} ) self.longterm = VectorStoreRetrieverMemory( retriever=retriever, memory_key="longterm_context", ) def save(self, user_input: str, ai_output: str): ctx_in = {"input": user_input} ctx_out = {"output": ai_output} self.buffer.save_context(ctx_in, ctx_out) self.summary.save_context(ctx_in, ctx_out) self.longterm.save_context(ctx_in, ctx_out) def load(self, query: str) -> Dict: return { "recent": self.buffer.load_memory_variables({}), "summary": self.summary.load_memory_variables({}), "longterm": self.longterm.load_memory_variables( {"prompt": query} ), }
Wiring into an Agent
from langchain.agents import AgentExecutor, create_openai_tools_agent from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder prompt = ChatPromptTemplate.from_messages([ ("system", """You are a helpful coding assistant. Use the following context from your memory: CONVERSATION SUMMARY: {summary} RELEVANT LONG-TERM MEMORIES: {longterm_context}"""), MessagesPlaceholder("recent_history"), ("human", "{input}"), MessagesPlaceholder("agent_scratchpad"), ]) llm = ChatOpenAI(model="gpt-4o", temperature=0) agent = create_openai_tools_agent(llm, tools=[], prompt=prompt) mem = AgentMemoryManager() executor = AgentExecutor( agent=agent, tools=[], memory=CombinedMemory(memories=[ mem.buffer, mem.summary, mem.longterm ]), verbose=True, )
FAISS.save_local()) and the summary
to a database. In-memory state is lost on process termination.
Choosing a Memory Strategy
The right memory architecture depends on your agent's use case, conversation length, latency budget, and infrastructure constraints. Use the decision matrix below as a starting point.
Chatbot (≤ 20 turns)
Recommendation: Buffer only
Simple, fast, zero extra LLM calls. The context window is large enough to hold the full conversation.
Long Session Agent
Recommendation: Summary + Buffer
Keep recent turns verbatim, summarize older history. Good balance of accuracy and token efficiency.
Knowledge Agent
Recommendation: Vector Store + Buffer
Cross-session recall via embeddings. Essential when the agent must remember facts from days or weeks ago.
Self-Improving Agent
Recommendation: All four types
Episodic memory for learning from past trajectories, plus buffer + summary + vector store for operational context.
Decision Flowchart
Performance Comparison
| Strategy | Tokens/Turn | Latency | Cross-Session | Fidelity |
|---|---|---|---|---|
| Buffer | O(n) | ⚡ Lowest | ❌ | Perfect (verbatim) |
| Window Buffer | O(k) | ⚡ Lowest | ❌ | Recent only |
| Summary | O(1) | 🔶 +1 LLM call | ❌ | Lossy (gist) |
| Summary + Buffer | O(k) + summary | 🔶 +1 LLM call | ❌ | Good blend |
| Vector Store | O(k) | 🔶 +embed+search | ✅ | Top-k relevant |
| Episodic | O(k) | 🔶 +embed+search | ✅ | Trajectory-level |
Common Anti-Patterns
- Stuffing the entire vector store into the prompt — Always limit retrieval to top-k results. Injecting too many memories dilutes the signal and wastes tokens.
-
No metadata filtering — Without user/session scoping, a multi-tenant agent
will leak context between users. Always filter by
user_idat minimum. - Stale summaries — If the summarizer LLM hallucinates or drops important facts, the error compounds over time. Periodically validate summaries against ground truth.
- Ignoring embedding drift — If you change your embedding model, old vectors become incompatible. Re-embed all stored memories or use a versioned index.