_private/qwestly-docs/Features/qwestly-agent/unstructured-memory.md

Unstructured Memory — LLM-Managed Agent Memory

How Qwestly remembers things about users across conversations without replaying full chat history. A synthesized, distilled memory layer between raw transcripts and structured profile data.


1. The Problem

After 10 conversations with a user, the agent has learned a lot: their writing style preferences, their career goals, decisions they've made, feedback they've given. But if you just stuff raw chat history into the context window, you hit token limits fast — and most of that history is noise (pleasantries, clarifications, dead ends).

The solution: An LLM extracts discrete, structured "memories" from conversations. These memories are stored, retrieved, and injected into the agent's context — giving it persistent knowledge of the user without burning tokens on raw transcripts.


2. What Is a Memory?

A memory is a single, discrete piece of knowledge about a user. It's not a chat message. It's not a profile field. It's a synthesized fact, preference, goal, or decision extracted by an LLM.

Memory Types

Type Example Lifespan
preference "User dislikes corporate buzzwords. Prefers direct, punchy writing style for career documents." Long (until contradicted)
goal "User is targeting VP of Engineering roles at Series B companies (50-200 people)." Medium (until achieved/changed)
fact "User previously worked at Stripe as a Director of Engineering from 2019-2023." Long (factual)
decision "User chose the 'executive' card template over 'detailed' on 2026-03-15." Permanent (historical record)
context "User is actively interviewing and has 3 pending offers. Wants to move quickly." Short (situational)
feedback "User said the last card draft was 'too formal' and asked for a more conversational tone." Medium (informs future interactions)
personal "User mentioned they have a dog named Max. Uses humor in conversations." Medium (rapport-building)

Memory Schema

{
  "_id": "ObjectId",
  "user_id": "auth0|xxx",
  "type": "preference",
  "content": "User dislikes corporate buzzwords. Prefers direct, punchy writing style for career documents.",
  "importance": 0.8,
  "confidence": "high",
  "source_type": "conversation",
  "source_conversation_id": "conv_abc123",
  "source_turn_ids": ["turn_5", "turn_7"],
  "tags": ["writing_style", "tone", "card_generation"],
  "contradicts": null,
  "contradicted_by": null,
  "access_count": 3,
  "last_accessed_at": "2026-05-17T10:30:00Z",
  "created_at": "2026-05-10T14:22:00Z",
  "updated_at": "2026-05-10T14:22:00Z",
  "expires_at": null
}

3. How Memory Works (The Full Lifecycle)

Step 1: Extraction (Async, After Conversation)

After a conversation ends (or every N turns), an LLM call reviews the transcript and extracts new or updated memories:

Raw conversation transcript (last N turns)
        │
        ▼
Extraction LLM call (budget model — this is a classification/extraction task)
        │
        ▼
New memory documents + updates to existing memories

Extraction prompt structure:

You are a memory extraction system. Review the conversation below and extract 
any new information about the user that would be useful for future conversations.

CONVERSATION:
[transcript]

Extract memories of the following types:
- PREFERENCE: How the user likes things done (writing style, communication, format)
- GOAL: Career objectives, job targets, aspirations
- FACT: Factual information the user has shared about themselves
- DECISION: Choices the user has made (templates, options, directions)
- CONTEXT: Situational information (actively interviewing, timeline constraints)
- FEEDBACK: User's reactions to previous suggestions or generated content
- PERSONAL: Non-career details that build rapport

For each memory, output:
- type: one of the above
- content: A clear, self-contained sentence describing the memory
- importance: 0.0 to 1.0 (how critical is this for future interactions?)
- confidence: "high" | "medium" | "low" (how clearly did the user state this?)

Rules:
- Only extract NEW information not already captured in existing memories (listed below)
- If the user contradicts a prior memory, flag it
- Don't extract transient conversation details ("user said hello")
- Write memories so they're understandable without the conversation context

EXISTING MEMORIES FOR THIS USER:
[list of current memories]

Return as JSON: { "new_memories": [...], "updated_memories": [...], "contradicted_memories": [...] }

When extraction runs:

  • Trigger: Conversation ends (user closes chat, or N minutes of inactivity)
  • Model: Budget-tier (GPT-4o-mini or similar). Extraction is a classification task, not generation.
  • Cost: ~$0.005-0.01 per extraction (a few hundred tokens of transcript + structured output)
  • Latency: Not user-facing. Runs in background.

Step 2: Storage

New memories are inserted into user_memories. Existing memories that are updated get new versions. Contradicted memories are marked with contradicted_by pointing to the new memory.

Duplicate detection: Before inserting, the extraction step checks if a similar memory already exists (cosine similarity on embedded content). If similarity > 0.85, it updates the existing memory rather than creating a duplicate.

Vector index: The content field is embedded (same text-embedding-3-small model, 1536d) and stored for semantic retrieval. This enables search_user_memories(user_id, query) — finding memories relevant to the current conversation.

Step 3: Retrieval (At Session Start + During Conversation)

At session start (injected into system prompt):

# Load top memories for this user
memories = await db.user_memories.find(
    {"user_id": user_id, "contradicted_by": None}  # only active memories
).sort(
    [("importance", -1), ("last_accessed_at", -1)]
).limit(20)

# Format for system prompt injection
memory_context = "\n".join(
    f"- [{m['type'].upper()}] {m['content']}" 
    for m in memories
)

system_prompt = f"""
You are Qwestly, a career agent assistant.

--- WHAT YOU KNOW ABOUT THIS USER ---
{memory_context}

--- YOUR CAPABILITIES ---
[rest of system prompt]
"""

During conversation (as a tool):

@orchestrator.tool
async def search_user_memories(
    ctx: RunContext[AgentDeps],
    query: str,
) -> list[Memory]:
    """Semantically search the agent's memories about this user.
    
    Use this when you need to recall specific preferences, past decisions,
    or context the user has shared in previous conversations. This searches
    synthesized memories, not raw transcripts — it returns distilled facts,
    not noisy chat fragments.
    
    Args:
        query: Natural language description of what you're trying to remember
    """
    embedding = await embed(query)
    results = await db.user_memories.aggregate([
        {"$vectorSearch": {
            "index": "user_memories_vector_index",
            "path": "content_embedding",
            "queryVector": embedding,
            "numCandidates": 50,
            "limit": 10,
            "filter": {"user_id": ctx.deps.user_id, "contradicted_by": None}
        }},
        {"$project": {"content": 1, "type": 1, "importance": 1, "score": {"$meta": "vectorSearchScore"}}}
    ])
    return results

Step 4: Consolidation (Periodic)

When a user accumulates many memories of similar type, an LLM consolidates them:

5 memories about "writing style preference"
        │
        ▼
Consolidation LLM call
        │
        ▼
1 merged memory + 5 old memories marked as superseded

Trigger: When count(memories with same tag) > 5, queue for consolidation. Model: Budget-tier. Simple merge task. Timing: Background job. No user impact.

Step 5: Decay and Cleanup

Not all memories are worth keeping forever:

Condition Action
importance < 0.3 and last_accessed_at > 30 days Archive (set expires_at, exclude from retrieval)
type = "context" and created_at > 90 days Archive (context is short-lived by definition)
contradicted_by is set Exclude from retrieval (kept for audit)
type = "goal" and user achieves goal (new memory marks it) Mark as achieved, lower importance

Important: Nothing is ever deleted. Memories are archived or superseded, preserving the audit trail.


4. Memory vs. Other Data Tiers

Aspect Structured Data (Tier 1) Unstructured Memory (Tier 2) RAG (Tier 3)
Source User input, LinkedIn API, form fields LLM extraction from conversations Uploaded documents, transcripts
What it stores Profile fields, employment, education, cards Facts, preferences, goals, decisions, context Raw text chunks (resume text, interview transcripts)
Structure Strongly typed (Mongoose schemas) Semi-structured (type + content + metadata) Unstructured (text chunks with metadata)
Query Exact field lookup Semantic search + importance/recency ranking Semantic search (vector similarity)
Accuracy Deterministic, exact High-signal (distilled), may have gaps Noisy (raw text), comprehensive
Freshness Updated on user action Updated after each conversation Updated on document upload
Use case "What's my current job title?" "What writing style do I prefer?" "What did my resume say about my Google role?"
Token cost to use ~50-200 tokens (structured fields) ~200-500 tokens (20 memories in prompt) ~1000-3000 tokens (top-5 chunks)

5. Memory in the Orchestrator's System Prompt

The system prompt is where memories get injected. This is the highest-leverage part of the memory system — a good memory injection makes the agent feel like it "knows" the user.

Example system prompt with memory

You are Qwestly, a career agent assistant. You help professionals manage their 
career profiles, generate career documents, and prepare for job searches.

--- WHAT YOU KNOW ABOUT THIS USER ---
- [PREFERENCE] Dislikes corporate buzzwords. Prefers direct, punchy writing.
- [PREFERENCE] Wants cards in "executive" style with 2-page max length.
- [GOAL] Targeting VP of Engineering at Series B startups (50-200 people).
- [GOAL] Also open to "Head of Platform" roles at larger companies if comp > $300k.
- [FACT] Previously Director at Stripe (2019-2023), Senior EM at Google (2015-2019).
- [DECISION] Chose "executive" card template over "detailed" on 2026-03-15.
- [CONTEXT] Actively interviewing. Has 2 final rounds scheduled this week.
- [FEEDBACK] Said the last About section draft was "too formal." Prefers conversational.
- [PERSONAL] Has a dog named Max. Appreciates humor in professional settings.

--- YOUR CAPABILITIES ---
[rest of system prompt — tools, rules, output style]

This ~200 token injection replaces potentially 10,000+ tokens of raw chat history.


6. Implementation

MongoDB Setup

Collection: user_memories

Indexes:

// Primary query: load top memories for a user
{"user_id": 1, "importance": -1, "last_accessed_at": -1}

// Semantic search: vector index on content_embedding
// Atlas Vector Search index: "user_memories_vector_index"
//   field: content_embedding
//   dimensions: 1536
//   similarity: cosine

// Lookup by source conversation
{"source_conversation_id": 1}

// Active memories only (compound filter)
{"user_id": 1, "contradicted_by": null}

Python Service

from pydantic import BaseModel
from datetime import datetime
from typing import Optional


class MemoryType:
    PREFERENCE = "preference"
    GOAL = "goal"
    FACT = "fact"
    DECISION = "decision"
    CONTEXT = "context"
    FEEDBACK = "feedback"
    PERSONAL = "personal"


class Memory(BaseModel):
    user_id: str
    type: str
    content: str
    importance: float
    confidence: str  # "high" | "medium" | "low"
    source_type: str  # "conversation" | "explicit" | "inference"
    source_conversation_id: Optional[str] = None
    source_turn_ids: list[str] = []
    tags: list[str] = []
    contradicts: Optional[str] = None  # memory_id of memory this contradicts
    contradicted_by: Optional[str] = None  # memory_id of memory that supersedes this
    access_count: int = 0
    last_accessed_at: Optional[datetime] = None
    created_at: datetime = datetime.now()
    expires_at: Optional[datetime] = None


class MemoryService:
    """Manages the full memory lifecycle: extraction, retrieval, consolidation."""

    def __init__(self, db, embedding_service):
        self.db = db
        self.embed = embedding_service

    async def get_context_memories(self, user_id: str, limit: int = 20) -> list[Memory]:
        """Load top memories for system prompt injection at session start."""
        cursor = self.db.user_memories.find(
            {"user_id": user_id, "contradicted_by": None}
        ).sort([("importance", -1), ("last_accessed_at", -1)]).limit(limit)

        memories = []
        async for doc in cursor:
            memories.append(Memory(**doc))
            # Update access metadata
            await self.db.user_memories.update_one(
                {"_id": doc["_id"]},
                {"$inc": {"access_count": 1}, "$set": {"last_accessed_at": datetime.now()}}
            )
        return memories

    async def search_memories(self, user_id: str, query: str, limit: int = 10) -> list[dict]:
        """Semantic search over a user's memories. Exposed as a tool."""
        embedding = await self.embed(query)

        results = await self.db.user_memories.aggregate([
            {"$vectorSearch": {
                "index": "user_memories_vector_index",
                "path": "content_embedding",
                "queryVector": embedding,
                "numCandidates": limit * 10,
                "limit": limit,
                "filter": {"user_id": user_id, "contradicted_by": None}
            }},
            {"$project": {
                "content": 1, "type": 1, "importance": 1,
                "confidence": 1, "score": {"$meta": "vectorSearchScore"}
            }}
        ]).to_list(length=limit)

        return results

    async def extract_memories(
        self,
        user_id: str,
        conversation_transcript: str,
        existing_memories: list[Memory],
        llm,
    ) -> dict:
        """Extract new/updated memories from a conversation transcript.
        
        Runs asynchronously after conversation ends. Not user-facing.
        Returns: {"new_memories": [...], "updated_memories": [...], "contradicted": [...]}
        """
        existing_text = "\n".join(
            f"- [{m.type}] {m.content}" for m in existing_memories[:50]
        )

        prompt = f"""You are a memory extraction system. Review the conversation 
below and extract any new information about the user.

CONVERSATION:
{conversation_transcript}

Extract memories of these types:
- preference: How the user likes things done
- goal: Career objectives, job targets
- fact: Factual information shared about themselves
- decision: Choices the user has made
- context: Situational information (timeline, constraints)
- feedback: Reactions to suggestions or generated content
- personal: Non-career details that build rapport

For each memory: type, content (clear self-contained sentence), importance (0.0-1.0), 
confidence (high/medium/low).

Only extract NEW information. If contradicted, flag it.

EXISTING MEMORIES:
{existing_text}

Return JSON: {{"memories": [{{"type": "...", "content": "...", "importance": 0.8, "confidence": "high"}}]}}"""

        result = await llm.complete(prompt, response_format={"type": "json_object"})
        return result.data  # validated by Pydantic

    async def insert_memories(self, memories: list[Memory]) -> list[str]:
        """Insert new memories, generating embeddings for each."""
        ids = []
        for memory in memories:
            memory.content_embedding = await self.embed(memory.content)
            result = await self.db.user_memories.insert_one(
                memory.model_dump()
            )
            ids.append(result.inserted_id)
        return ids

    async def consolidate_memories(self, user_id: str, tag: str, llm) -> Memory | None:
        """Merge multiple memories with the same tag into one consolidated memory."""
        similar = await self.db.user_memories.find(
            {"user_id": user_id, "tags": tag, "contradicted_by": None}
        ).to_list(length=20)

        if len(similar) < 5:
            return None  # Not enough to consolidate

        prompt = f"""Merge these related memories into a single consolidated one:

{chr(10).join(f'- {m["content"]}' for m in similar)}

Return a single memory: {{"type": "...", "content": "...", "importance": 0.8}}"""

        result = await llm.complete(prompt, response_format={"type": "json_object"})
        consolidated = Memory(
            user_id=user_id,
            type=similar[0]["type"],
            content=result.data["content"],
            importance=result.data["importance"],
            confidence="high",
            source_type="consolidation",
            source_conversation_id=None,
            tags=[tag],
        )

        # Mark originals as superseded
        superseded_id = await self.insert_memories([consolidated])
        await self.db.user_memories.update_many(
            {"_id": {"$in": [m["_id"] for m in similar]}},
            {"$set": {"contradicted_by": superseded_id[0]}}
        )

        return consolidated

    async def cleanup_expired(self, user_id: str):
        """Archive low-importance, stale memories."""
        thirty_days_ago = datetime.now() - timedelta(days=30)

        # Archive stale low-importance memories
        await self.db.user_memories.update_many(
            {
                "user_id": user_id,
                "importance": {"$lt": 0.3},
                "last_accessed_at": {"$lt": thirty_days_ago},
                "contradicted_by": None,
            },
            {"$set": {"expires_at": datetime.now() + timedelta(days=90)}}
        )

        # Archive old context memories
        ninety_days_ago = datetime.now() - timedelta(days=90)
        await self.db.user_memories.update_many(
            {
                "user_id": user_id,
                "type": "context",
                "created_at": {"$lt": ninety_days_ago},
                "contradicted_by": None,
            },
            {"$set": {"expires_at": datetime.now()}}
        )

7. Design Decisions

Why LLM extraction instead of heuristics

You could try to extract memories with keyword matching or regex patterns. This fails for the same reason RAG over transcripts is noisy: natural language is ambiguous. "I guess I prefer shorter paragraphs" is a preference, but no keyword signals it clearly. An LLM understands the intent.

Why async + background, not inline

Extraction takes 1-3 seconds. Adding that to every conversation turn would be a terrible UX. By running it after the conversation, it's invisible to the user.

Why discrete memories, not a summary blob

Some systems store one big "user profile summary" that gets updated each session. This is simpler but worse:

  • Because it's updated by appending, it drifts or gets stale
  • You can't search it semantically ("what's their writing preference?")
  • You can't track importance or decay individually
  • You can't see the provenance of a specific piece of knowledge

Discrete memories are a small schema investment that pays back in retrieval quality.

Why importance scoring matters

Not all memories are equally valuable. "User is targeting VP roles" matters more than "User has a dog named Max." The importance score lets you:

  • Prioritize context window allocation (important memories first)
  • Decide which memories to archive (low importance + stale = safe to drop)
  • Surface the most relevant memories in search results

Why consolidation is important

A user who has 12 conversations will accumulate 40+ memories. Many will say similar things ("I prefer casual tone", "I don't like formal language", "keep it conversational", "too stiff last time"). Consolidation merges these into one clear, high-confidence memory — reducing noise and token waste.


8. Cost Estimates

Operation Model Tokens (est.) Cost
Extraction (per conversation) GPT-4o-mini ~500 in, ~200 out ~$0.002
Embedding (per memory) text-embedding-3-small ~50 tokens ~$0.00001
Consolidation (per merge, rare) GPT-4o-mini ~300 in, ~100 out ~$0.001
Retrieval (per session start) text-embedding-3-small ~200 tokens ~$0.00004
Prompt injection (per conversation turn) N/A (input tokens) ~200 tokens ~$0.0005 (flagship model)

Monthly cost for 1,000 conversations: ~$2-5. Memory is cheap.


9. Testing Memory Quality

Memory extraction is non-deterministic, so test it like you test agent outputs:

Eval dataset

[
  {
    "transcript": "User: I really hate when my resume sounds like everyone else's. Keep it real, no fluff.\nAgent: Got it, I'll keep it authentic and direct.",
    "expected_type": "preference",
    "expected_contains": ["authentic", "direct", "no fluff"],
    "expected_importance_min": 0.6
  },
  {
    "transcript": "User: I'm looking for a CTO role at an early-stage fintech startup. Seed or Series A, ideally under 30 people.",
    "expected_type": "goal",
    "expected_contains": ["CTO", "fintech", "early-stage"],
    "expected_importance_min": 0.8
  }
]

Quality checks

  • Precision: Of the memories extracted, what % are actually useful?
  • Recall: Of the test transcripts, what % of known memories were extracted?
  • Hallucination rate: What % of extracted memories contain information not in the transcript?
  • Duplicate rate: What % of extracted memories are duplicates of existing ones?

Run these checks every extraction for the first 100 conversations. Tune the extraction prompt based on results.


10. Interaction with Other Systems

With the Orchestrator

The orchestrator has three ways to access user knowledge:

  1. System prompt injection (automatic) — top memories loaded at session start
  2. search_user_memories(query) tool — semantic search during conversation
  3. get_user_profile(user_id) tool — structured profile data from MongoDB

The orchestrator's system prompt should tell it which to use when:

When the user asks about their preferences, goals, or past decisions:
  → Use search_user_memories to find relevant synthesized memories.

When the user asks for specific profile data (name, title, employment):
  → Use get_user_profile for exact, structured data.

When the user asks about detailed history or specific past content:
  → Use query_user_knowledge for raw transcript/document search.

With the Candidate App

The candidate app doesn't know about memories. It owns the structured profile data and the RAG pipeline (document ingestion → embedding). qwestly-agent owns the memory system independently.

With RAG (Tier 3)

Memory and RAG serve different retrieval needs:

  • Memory: "What does the user prefer?" — high-signal, distilled, based on conversations
  • RAG: "What did the user's resume say about Google?" — raw text, comprehensive, based on documents

They complement each other. A good answer to "help me write my card" might use:

  • Structured data for the scaffold (roles, companies, dates)
  • Memories for style preferences and past decisions
  • RAG for specific achievements and metrics from the resume

11. Summary

Aspect Decision
Storage MongoDB user_memories collection with Atlas Vector Search index
Embedding model OpenAI text-embedding-3-small (1536d) — same as RAG pipeline
Extraction model Budget-tier (GPT-4o-mini) — classification task, not generation
Extraction timing Async, after conversation ends. Not user-facing.
Retrieval Top 20 memories by importance + recency injected into system prompt + tool for semantic search
Consolidation When 5+ memories share a tag, merge via LLM
Decay Low-importance + 30 days stale → archive. Context memories → 90 day expiry.
Cost ~$2-5/month for 1,000 conversations. Negligible.
Ownership qwestly-agent owns this entirely. Candidate app is unaware of it.