Agent Memory: Short-Term, Long-Term, Episodic

MLOps Series LLM Agents & Orchestration

An LLM by itself is stateless — every call starts from a blank slate with no recollection of what happened before. For agents that need to hold multi-turn conversations, recall facts from weeks ago, or learn from past mistakes, memory is the critical infrastructure that bridges the gap between a one-shot text completion and a genuinely useful autonomous system.

This post dissects the major memory architectures used in modern LLM agent frameworks: conversation buffers, summary memory, vector-store-backed long-term retrieval, and episodic memory that captures entire task trajectories. We will walk through the theory behind each, compare trade-offs, and build working Python implementations with LangChain.

Memory Types Overview

Agent memory can be classified along two axes: retention span (how long information persists) and granularity (raw messages vs. compressed summaries vs. embeddings). The diagram below maps the four canonical memory types onto these axes.

Short-Term Memory

Raw message history (human + AI turns)
Lives within a single session
Grows linearly → hits context window limits
Ideal for quick Q&A chatbots

Summary Memory

LLM-generated running summary of conversation
Constant token footprint regardless of turns
Lossy — fine details may be dropped
Good for long-running sessions

Long-Term (Vector Store)

Embeds messages into a vector DB
Retrieves via semantic similarity at query time
Persists across sessions indefinitely
Best for knowledge-heavy agents

Episodic Memory

Stores entire task trajectories (state → action → result)
Enables agents to learn from past successes/failures
Structured retrieval by task similarity
Essential for self-improving agents

Key insight: Production agents almost always combine multiple memory types. A typical stack uses a conversation buffer for the current session, summary memory to compress older turns, and a vector store for cross-session recall.

Short-Term: Conversation Buffer

The simplest memory strategy is a conversation buffer — every user message and AI response is appended to a list and injected into the prompt verbatim. This is the default in most chatbot implementations and works well when conversations are short.

How It Works

On each turn the agent prepends the system prompt, appends all previous (role, content) pairs, and adds the latest user message. The entire history is sent to the LLM as a single prompt.

Window Variant

To avoid blowing past the context window, a windowed buffer keeps only the last k turns. Older messages are silently dropped. This is cheap and predictable but the agent loses all context beyond the window boundary.

from langchain.memory import ConversationBufferWindowMemory

# Keep the last 8 human+AI turn pairs
memory = ConversationBufferWindowMemory(k=8, return_messages=True)

# Add a turn
memory.save_context(
    {"input": "What is RLHF?"},
    {"output": "RLHF stands for Reinforcement Learning from Human Feedback..."}
)

# Retrieve the current buffer
messages = memory.load_memory_variables({})
print(messages["history"])  # Last 8 turns

Pitfall: A conversation buffer with no window will eventually exceed the model's context length. For GPT-4 Turbo (128 k tokens) a fast-typing user can exhaust the window in ~50–80 conversational turns depending on verbosity.

Token Cost Analysis

Every turn re-sends the full buffer in the prompt, so token consumption is O(n²) across a conversation of n turns. If each turn averages t tokens, total input tokens ≈ t · n · (n+1) / 2. For a 50-turn conversation at 200 tokens per turn, that is roughly 255 k input tokens — an important cost consideration at scale.

Summary Memory

Summary memory tackles the O(n²) problem by maintaining a running natural-language summary of the conversation rather than the raw messages. After each turn the agent calls the LLM to update the summary, then discards the original messages.

The Compression Loop

At each step the framework executes a secondary LLM call:

Input: Previous summary + new human/AI messages
Prompt: "Progressively summarize the conversation, adding to the previous summary."
Output: Updated summary (replaces the old one)

from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

memory = ConversationSummaryMemory(
    llm=llm,
    return_messages=True,
    human_prefix="User",
    ai_prefix="Agent",
)

# After many turns the memory holds a ~200-token summary
# instead of thousands of tokens of raw conversation
memory.save_context(
    {"input": "Explain the difference between PPO and DPO."},
    {"output": "PPO is an online RL algorithm that optimizes a clipped ..."}
)

summary = memory.load_memory_variables({})
print(summary["history"])  # Compressed summary of all turns

✅ Advantages

Constant memory footprint (~200–400 tokens)
Can handle arbitrarily long conversations
Keeps the "gist" of earlier exchanges

⚠️ Drawbacks

Extra LLM call each turn → latency + cost
Lossy — specific numbers, names, or code may vanish
Summary quality depends on the summarizer model

Hybrid: Summary + Buffer

LangChain's ConversationSummaryBufferMemory blends both approaches. It keeps the last k messages in raw form and summarizes everything older. This gives the agent precise recall for recent turns and a compressed overview of the full history — the best of both worlds.

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=1500,   # Summarize once raw messages exceed this
    return_messages=True,
)
# Recent messages stay verbatim; older ones become a summary

Long-Term: Vector Store Memory

For agents that need to recall information across sessions — a customer's past tickets, a codebase agent's previous refactoring decisions, a research assistant's earlier literature reviews — we need persistent, searchable memory. Vector store memory embeds every message into a vector database and retrieves relevant fragments via semantic similarity at query time.

Architecture

The write path and read path are decoupled:

Write: Every message is embedded and upserted into the vector store with metadata (timestamp, session ID, role).
Read: Before each LLM call, the current query is embedded and the top-k most similar past messages are retrieved and injected into the prompt.

from langchain.memory import VectorStoreRetrieverMemory
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# Create a FAISS index backed by OpenAI embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_texts([" "], embedding=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

memory = VectorStoreRetrieverMemory(retriever=retriever)

# Save a memory fragment
memory.save_context(
    {"input": "Our Kubernetes cluster runs on GKE with 3 node pools."},
    {"output": "Got it. I'll remember that for deployment planning."}
)

# Weeks later — retrieve relevant memories
relevant = memory.load_memory_variables(
    {"prompt": "How should we deploy this new service?"}
)
print(relevant["history"])
# → Returns the Kubernetes context from the earlier session

Embedding model choice matters. For memory retrieval, prefer models optimized for semantic similarity (e.g., text-embedding-3-small) over general-purpose models. The embedding dimensionality directly affects storage cost and retrieval latency in production.

Metadata Filtering

Raw vector similarity alone is often insufficient. Production systems add metadata filters — restricting retrieval to a specific user, session, time range, or topic tag. This prevents the agent from accidentally surfacing another user's conversation in a multi-tenant system.

# Metadata-filtered retrieval with Chroma
from langchain_community.vectorstores import Chroma

vectorstore = Chroma(
    collection_name="agent_memory",
    embedding_function=embeddings,
    persist_directory="./chroma_db",
)

# Store with metadata
vectorstore.add_texts(
    texts=["User prefers Python over TypeScript for backends."],
    metadatas=[{"user_id": "u-42", "topic": "preferences"}],
)

# Retrieve only for this user
results = vectorstore.similarity_search(
    "What language should we use?",
    k=3,
    filter={"user_id": "u-42"},
)

Episodic Memory

While the previous memory types store what was said, episodic memory stores what happened. It captures full task trajectories — the sequence of states, actions, tool calls, and outcomes that an agent went through to accomplish (or fail at) a goal. This allows agents to learn from experience, reuse successful strategies, and avoid repeating mistakes.

Trajectory Structure

Each episode is a structured record:

Retrieval by Task Similarity

When the agent faces a new task, it embeds the task description and retrieves the most similar past episodes. Successful episodes are injected as few-shot demonstrations; failed episodes can be injected as negative examples with a "do not repeat this mistake" framing.

class EpisodicMemory:
    """Stores and retrieves full task trajectories."""

    def __init__(self, vectorstore, embeddings):
        self.vectorstore = vectorstore
        self.embeddings  = embeddings

    def store_episode(self, episode: dict):
        # episode = {goal, actions, observations, outcome, reflection}
        text = f"Goal: {episode['goal']}\n" \
             + f"Outcome: {episode['outcome']}\n" \
             + f"Reflection: {episode['reflection']}"
        self.vectorstore.add_texts(
            texts=[text],
            metadatas=[{
                "outcome": episode["outcome"],
                "num_steps": len(episode["actions"]),
            }],
        )

    def recall(self, task_description: str, k: int = 3):
        # Retrieve the k most similar past episodes
        return self.vectorstore.similarity_search(
            task_description, k=k
        )

Reflexion pattern: After completing a task, the agent generates a reflection — a natural-language summary of what went well, what went wrong, and what it would do differently. This reflection is stored as part of the episode and becomes the highest-signal retrieval target for future similar tasks.

Step-Through Example

Consider a coding agent that previously debugged a memory leak:

Click a step to trace the episode

Step 1 — Goal: "Investigate and fix the OOM crash in the data pipeline service."

Python Implementation

Let's build a complete agent that combines all three memory types — conversation buffer for the current session, summary memory for compression, and vector store for long-term recall.

Unified Memory Manager

from dataclasses import dataclass, field
from typing import List, Dict, Optional
import json, time

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.memory import (
    ConversationBufferWindowMemory,
    ConversationSummaryMemory,
    VectorStoreRetrieverMemory,
    CombinedMemory,
)
from langchain_community.vectorstores import FAISS


@dataclass
class MemoryConfig:
    buffer_k: int = 6               # Recent turns to keep verbatim
    vector_top_k: int = 4            # Long-term memories to retrieve
    embedding_model: str = "text-embedding-3-small"
    summary_llm: str = "gpt-4o-mini"


class AgentMemoryManager:
    """Orchestrates short-term, summary, and long-term memory."""

    def __init__(self, config: MemoryConfig = MemoryConfig()):
        self.config = config
        self.llm = ChatOpenAI(model=config.summary_llm, temperature=0)
        self.embeddings = OpenAIEmbeddings(model=config.embedding_model)

        # 1. Short-term buffer (last k turns)
        self.buffer = ConversationBufferWindowMemory(
            k=config.buffer_k,
            memory_key="recent_history",
            return_messages=True,
        )

        # 2. Summary memory (compressed older history)
        self.summary = ConversationSummaryMemory(
            llm=self.llm,
            memory_key="summary",
            return_messages=True,
        )

        # 3. Long-term vector store
        vs = FAISS.from_texts([" "], embedding=self.embeddings)
        retriever = vs.as_retriever(
            search_kwargs={"k": config.vector_top_k}
        )
        self.longterm = VectorStoreRetrieverMemory(
            retriever=retriever,
            memory_key="longterm_context",
        )

    def save(self, user_input: str, ai_output: str):
        ctx_in  = {"input": user_input}
        ctx_out = {"output": ai_output}
        self.buffer.save_context(ctx_in, ctx_out)
        self.summary.save_context(ctx_in, ctx_out)
        self.longterm.save_context(ctx_in, ctx_out)

    def load(self, query: str) -> Dict:
        return {
            "recent":   self.buffer.load_memory_variables({}),
            "summary":  self.summary.load_memory_variables({}),
            "longterm": self.longterm.load_memory_variables(
                {"prompt": query}
            ),
        }

Wiring into an Agent

from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a helpful coding assistant.
Use the following context from your memory:

CONVERSATION SUMMARY:
{summary}

RELEVANT LONG-TERM MEMORIES:
{longterm_context}"""),
    MessagesPlaceholder("recent_history"),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_openai_tools_agent(llm, tools=[], prompt=prompt)

mem = AgentMemoryManager()

executor = AgentExecutor(
    agent=agent,
    tools=[],
    memory=CombinedMemory(memories=[
        mem.buffer, mem.summary, mem.longterm
    ]),
    verbose=True,
)

Serialization: If your agent runs across multiple processes or restarts, you must persist the vector store to disk (FAISS.save_local()) and the summary to a database. In-memory state is lost on process termination.

Choosing a Memory Strategy

The right memory architecture depends on your agent's use case, conversation length, latency budget, and infrastructure constraints. Use the decision matrix below as a starting point.

Chatbot (≤ 20 turns)

Recommendation: Buffer only

Simple, fast, zero extra LLM calls. The context window is large enough to hold the full conversation.

Long Session Agent

Recommendation: Summary + Buffer

Keep recent turns verbatim, summarize older history. Good balance of accuracy and token efficiency.

Knowledge Agent

Recommendation: Vector Store + Buffer

Cross-session recall via embeddings. Essential when the agent must remember facts from days or weeks ago.

Self-Improving Agent

Recommendation: All four types

Episodic memory for learning from past trajectories, plus buffer + summary + vector store for operational context.

Decision Flowchart

Performance Comparison

Strategy	Tokens/Turn	Latency	Cross-Session	Fidelity
Buffer	O(n)	⚡ Lowest	❌	Perfect (verbatim)
Window Buffer	O(k)	⚡ Lowest	❌	Recent only
Summary	O(1)	🔶 +1 LLM call	❌	Lossy (gist)
Summary + Buffer	O(k) + summary	🔶 +1 LLM call	❌	Good blend
Vector Store	O(k)	🔶 +embed+search	✅	Top-k relevant
Episodic	O(k)	🔶 +embed+search	✅	Trajectory-level

Rule of thumb: Start with a simple buffer. When conversations exceed ~20 turns or 4 k tokens, add summary compression. When you need cross-session recall, add a vector store. When you want the agent to improve over time, add episodic memory. Each layer adds complexity and latency — only add what your use case requires.

Common Anti-Patterns

Stuffing the entire vector store into the prompt — Always limit retrieval to top-k results. Injecting too many memories dilutes the signal and wastes tokens.
No metadata filtering — Without user/session scoping, a multi-tenant agent will leak context between users. Always filter by user_id at minimum.
Stale summaries — If the summarizer LLM hallucinates or drops important facts, the error compounds over time. Periodically validate summaries against ground truth.
Ignoring embedding drift — If you change your embedding model, old vectors become incompatible. Re-embed all stored memories or use a versioned index.