Table of Contents

Qwestly Orchestration Plan — Concrete Proposal

A specific, actionable architecture for Qwestly's agentic system. Updated with confirmed decisions: separate Python app, Pydantic AI, three-tier data architecture (structured data, unstructured memory, RAG), direct MongoDB access.

Executive Summary

Build a new, separate Python app (qwestly-agent) using Pydantic AI as the agent framework. It receives user requests via a chat endpoint, routes intent to tool calls, and streams responses. The existing candidate app (Next.js/TS) serves as the frontend and data entry surface — the agent system is a completely separate concern. Data lives in MongoDB — operational data, unstructured memory, and vector search (RAG) all in one place via Atlas. Start with a single orchestrator + tools, add specialist sub-agents as the prompt grows too complex.

System Architecture

graph TD
    subgraph candidate["candidate app
(Next.js / TypeScript)"]
        direction TB
        C1[Frontend - chat UI, etc.]
        C2[Data entry / management]
        C3[Auth - Auth0]
        C4[Owns schema - Mongoose]
        C5[Exposes write APIs]
    end

    subgraph agent["qwestly-agent app
(FastAPI / Python)"]
        direction TB
        A1[POST /api/chat - SSE]
        A2[Pydantic AI Orchestrator]
        A3[Tools + sub-agents]
        A4[Direct MongoDB reads]
        A5[Memory extraction - async]
    end

    subgraph mongo["MongoDB (Atlas)"]
        direction TB
        M1[Structured — users, cards, preferences]
        M2[Memory — user_memories collection]
        M3[RAG/Vectors — knowledge_chunk + index]
    end

    candidate -->|"API calls"| agent
    candidate --> mongo
    agent --> mongo

This is a new app, not a modification of candidate. The agent system has its own deploy cadence, its own dependencies, and its own concerns. The candidate app exposes a few API endpoints for data writes (saving cards, updating profiles) — everything else happens in qwestly-agent.

Phase 1: Foundation (v0)

Core Architecture

graph TD
    chat["Chat UI
(candidate app)"] -->|"POST /api/chat"| fastapi["FastAPI on Vercel
(qwestly-agent)"]
    fastapi --> orch["Pydantic AI Orchestrator"]

    orch --> profile["get_user_profile"]
    orch --> card["generate_card"]
    orch --> knowledge["query_user_knowledge"]

    profile --> mongo["MongoDB + Atlas Vector Search
(structured + memory + RAG)"]
    card --> mongo
    knowledge --> mongo

Framework Choice: Pydantic AI (CONFIRMED)

Why	Detail
Python-native	FastAPI + Pydantic AI share the Pydantic schema DNA. Your existing request/response models plug right in.
Type safety	Every tool input, tool output, and agent result is validated by Pydantic. No runtime surprises.
Multi-provider	DeepSeek v4 Flash for routing/basic tasks, DeepSeek v4 Pro for complex generation — one-line config change per agent.
Structured outputs	Card generation returns a typed `CardResult` model. The LLM can't return malformed data.
Testable	Built-in test model for deterministic agent tests without calling a real LLM.
No new infra	Pure Python, no native deps, runs fine on Vercel's Python runtime.

Model Routing (CONFIRMED)

Two models, assigned by task complexity. This is implemented up front, not deferred.

Model	Tier	What it handles	Why
DeepSeek v4 Flash	Default (chat)	Orchestrator routing, intent classification, memory extraction, `list_capabilities`, simple Q&A	Fast, cheap, sufficient for classification/routing
DeepSeek v4 Pro	Premium (pro)	Card generation, About section writing, multi-step reasoning, content that users will read and judge	Higher quality output for user-facing content

Routing logic: The model is assigned per-agent or per-tool, not chosen by the orchestrator at runtime:

# Orchestrator itself — Flash for fast routing
orchestrator = Agent(
    "deepseek:deepseek-v4-flash",
    system_prompt=ORCHESTRATOR_PROMPT,
    tools=[get_user_profile, generate_card, suggest_about, ...],
)

# Content generation tools use Pro internally
@orchestrator.tool
async def generate_qwestly_card(user_id: str, style: str = "standard") -> CardResult:
    """Generate a Qwestly Card. This tool internally uses DeepSeek v4 Pro
    for high-quality content generation."""
    card_agent = Agent(
        "deepseek:deepseek-v4-pro",  # ← Pro for content
        system_prompt=CARD_GENERATION_PROMPT,
        result_type=CardResult,
    )
    return await card_agent.run(...)

@orchestrator.tool
async def suggest_linkedin_about(user_id: str) -> str:
    """Generate an improved LinkedIn About section. Uses Pro for quality."""
    about_agent = Agent(
        "deepseek:deepseek-v4-pro",  # ← Pro for content
        system_prompt=ABOUT_GENERATION_PROMPT,
    )
    return await about_agent.run(...)

Token threshold upgrade: When a conversation exceeds ~4000 tokens of accumulated context, the orchestrator itself can be swapped to Pro for the remainder of the session. The increased context length and complexity warrant the more capable model. This is a v0.3 optimization — start with static assignment per tool, add threshold-based upgrade once you have latency/cost data.

Cost impact: Flash is ~10x cheaper than Pro. Routing with Flash + generating with Pro keeps the common case (simple questions, memory extraction) cheap while spending on what users actually see (cards, content).

Tools (v0)

Tool	What it does	Backed by
`get_user_profile(user_id)`	Returns structured profile data from MongoDB (reads from `candidates_enhanced`, `CandidateProfileCache`)	Direct MongoDB query
`ingest_linkedin_profile(linkedin_url)`	Fetches LinkedIn data for a user. Calls the candidate app's `/api/linkedin/profile` endpoint — which handles fetching, retries, caching, and force refresh. qwestly-agent does NOT call the third-party API directly.	Candidate app API
`suggest_linkedin_about(user_id)`	Generates an improved LinkedIn About section draft	New — LLM call with profile context
`generate_qwestly_card(user_id, style?)`	Creates a Qwestly Card (structured career document)	New LLM pipeline — likely agentic internally
`query_user_knowledge(user_id, question)`	Semantic search across everything about a user (notes, history, resumes, interviews)	Atlas Vector Search + memory search
`query_qwestly_docs(question)`	Answers "how does Qwestly work?" questions	Atlas Vector Search over internal docs
`search_user_memories(user_id, query)`	Semantic search over synthesized memories (preferences, goals, facts, decisions)	`user_memories` collection + vector search
`list_capabilities()`	Returns what Qwestly can do	Static list or config

Data Architecture (Three Tiers)

This is the key architectural decision. Qwestly's data falls into three distinct tiers, each with different storage strategies and query patterns.

Tier 1: Structured Data

What it is: Database fields — user profiles, employment history, education, preferences, Qwestly Cards. Deterministic, exact queries.

Storage: MongoDB collections (candidates_enhanced, CandidateProfileCache, employment_stints_enhanced, preferences_enhanced, qc_sections, etc.)

Owned by: Candidate app (Mongoose schemas are source of truth)

Accessed by qwestly-agent via: Direct MongoDB reads (read-only). Tool calls for exact lookups.

Query pattern: get_user_profile(user_id) — direct document lookup. Deterministic, exact, fast.

Tier 2: Unstructured Memory (NEW — see `unstructured-memory.md`)

What it is: LLM-extracted, synthesized memories from conversations. Discrete facts, preferences, goals, and decisions the agent learns about a user over time. Not raw transcripts — distilled, deduplicated, structured knowledge.

Storage: MongoDB user_memories collection + vector index for semantic retrieval

Owned by: qwestly-agent (extracts and manages memories)

Query pattern: Loaded at session start (top memories by importance + recency) and searchable via tool during conversation.

Key properties:

Each memory is a discrete document: {type, content, importance, source, user_id, created_at}
Extracted asynchronously after conversations (LLM reviews transcript → extracts memories)
Consolidated periodically (merge similar memories)
Injected into system prompt at session start

This is the gap the candidate app doesn't fill. See unstructured-memory.md for the full design.

Tier 3: RAG (Vector Search over Raw Content)

What it is: Semantic search over unstructured raw content — resumes, interview transcripts, uploaded documents, full chat transcripts. Not synthesized, not distilled — the original text.

Storage: MongoDB knowledge_chunk collection + Atlas Vector Search index (already exists in candidate app)

Owned by: Candidate app (document ingestion pipeline) — qwestly-agent queries the same index

Query pattern: query_user_knowledge(user_id, question) — embed question → $vectorSearch → return top-k chunks

Already implemented in candidate app: Document upload → Python parsing API → chunking (1500 chars, 200 overlap) → OpenAI text-embedding-3-small (1536d) → Atlas $vectorSearch with cosine similarity.

Tier Comparison

Aspect	Structured Data	Unstructured Memory	RAG
What it stores	Profile fields, employment, education, cards	Synthesized facts, preferences, goals, decisions	Raw documents, transcripts, resumes
Processing	Direct from user input / LinkedIn API	LLM extraction + consolidation after conversations	Chunking + embedding pipeline
Query	Exact lookup by field/key	Semantic search + recency/importance ranking	Semantic search (vector similarity)
Accuracy	Deterministic, exact	High-signal (distilled), may have gaps	Noisy (raw text), comprehensive
Updates	On user action / LinkedIn ingest	Async after each conversation	On document upload
Use case	"What's my current title?"	"What writing style do I prefer?"	"What did my interview transcript say about leadership?"
Collection	`candidates_enhanced`, `CandidateProfileCache`, etc.	`user_memories` (NEW)	`knowledge_chunk` (exists)

Data Flow: Full Query Lifecycle

When a user asks "What do we know about me?":

graph TD
    Q["User asks: What do we know about me?"] --> O["Orchestrator fuses all three in LLM context"]

    O --> T1["1. get_user_profile
Structured data from
candidates_enhanced, CandidateProfileCache
→ Deterministic, exact
→ Scaffold: name, roles, companies"]
    O --> T2["2. search_user_memories
Synthesized memories from user_memories
→ High-signal distilled knowledge
→ Preferences, goals, past decisions"]
    O --> T3["3. query_user_knowledge
Raw text chunks from knowledge_chunk via vectorSearch
→ Comprehensive but noisy
→ Specific quotes, detailed history"]

    T1 --> R["5. LLM synthesizes final answer"]
    T2 --> R
    T3 --> R

Data Layer Summary

What	Where	Notes
User profiles	MongoDB (`candidates_enhanced`, `CandidateProfileCache`, etc.)	Already exists. Read by qwestly-agent directly.
LinkedIn data	MongoDB (`linkedin_profiles`, `linkedin_summaries`)	Already exists. Flexible schema handles nested LinkedIn JSON naturally.
Generated artifacts	MongoDB (`qc_sections`, `linkedin_profile_suggestions`, etc.)	Already exists.
Chat history	MongoDB (`chatbot_sessions`, `chatbot_messages`)	Already exists in candidate app. New sessions will be logged by qwestly-agent into a new collection or the same one.
Unstructured memory	MongoDB (`user_memories`) — NEW	LLM-extracted facts, preferences, goals. Owned by qwestly-agent. Needs Atlas Vector Search index (`user_memories_vector_index`, 1536d, cosine) on M10+. For local dev: `db.user_memories.createIndex({content: "text"})` as fallback.
RAG/vector search	MongoDB Atlas Vector Search on `knowledge_chunk`	Already exists. Index name: `knowledge_chunk_vector_index`, 1536d, cosine similarity.
Embeddings	OpenAI `text-embedding-3-small` (1536d)	Already in use by candidate app. qwestly-agent uses same model/dimensions.

Chat Interface (v0)

Chat widget in the candidate app (React/Next.js)
Calls POST /api/chat on the qwestly-agent FastAPI app
Streams responses via Server-Sent Events (SSE)
Start simple: plain text responses. Upgrade to structured action cards later.

Streaming on FastAPI + Vercel

from fastapi import APIRouter
from fastapi.responses import StreamingResponse

router = APIRouter()

@router.post("/chat")
async def chat(request: ChatRequest):
    agent = QwestlyOrchestrator(user_id=request.user_id)

    async def event_stream():
        async for chunk in agent.run_stream(request.message):
            yield f"data: {chunk.json()}\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "X-Accel-Buffering": "no",
            "Cache-Control": "no-cache",
        }
    )

Timeout: Set maxDuration: 300 in Vercel config. A typical agent loop runs ~15-45s.

Auth & Session Management

The candidate app authenticates users via Auth0. qwestly-agent needs to know who is talking to it without re-implementing authentication. The pattern: candidate app creates a short-lived JWT with user context, qwestly-agent verifies it using a shared secret.

Flow

sequenceDiagram
    participant U as User
    participant C as Candidate App
(Auth0-authenticated)
    participant Q as qwestly-agent
POST /api/chat
    participant O as Orchestrator

    U->>C: Authenticated request
    C->>C: Create JWT with
{user_id, email, name}
signed with QWESTLY_AGENT_SHARED_SECRET
    C->>Q: Forward request + JWT
    Q->>Q: Verify JWT using shared secret
Extract user_id → RunContext
    Q->>O: Run with user context

Implementation

Candidate app side (creates JWT before calling qwestly-agent):

// In the chat API route or server component
import jwt from 'jsonwebtoken';

const agentToken = jwt.sign(
  {
    user_id: session.user.sub,       // Auth0 user ID
    email: session.user.email,
    name: session.user.name,
  },
  process.env.QWESTLY_AGENT_SHARED_SECRET,
  { expiresIn: '5m' }                // Short-lived — one per session start
);

qwestly-agent side (verifies JWT on every request):

import os
import jwt
from fastapi import Header, HTTPException

AGENT_SHARED_SECRET = os.environ["QWESTLY_AGENT_SHARED_SECRET"]

async def verify_agent_token(authorization: str = Header(...)) -> dict:
    """Verify the JWT from the candidate app. Returns user context."""
    token = authorization.replace("Bearer ", "")
    try:
        payload = jwt.decode(token, AGENT_SHARED_SECRET, algorithms=["HS256"])
        return payload  # { user_id, email, name }
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

# Usage in route
@router.post("/chat")
async def chat(request: ChatRequest, user: dict = Depends(verify_agent_token)):
    agent = QwestlyOrchestrator(user_id=user["user_id"])
    ...

Why this over alternatives

Approach	Why not
Pass user_id as plain header	No integrity check — any caller could impersonate any user
Candidate proxies all LLM calls	Defeats the purpose of a separate agent app. Tight coupling.
qwestly-agent validates Auth0 tokens directly	Requires Auth0 SDK + same client secret in two apps. JWT is simpler.
API key only (no user context)	Doesn't tell the agent who the user is — can't scope queries.

Session startup

On first message in a session, qwestly-agent:

Verifies the JWT → gets user_id
Loads the user's profile from MongoDB (structured data)
Loads top 20 memories (memory tier)
Injects both into the orchestrator's system prompt
Begins the conversation

Subsequent messages in the same session reuse the loaded context (no need to re-fetch profile/memories unless the session is long — see conversation management below).

Within-Session Conversation Management

Each turn adds tokens to the context window. After 20+ turns, the context can overflow the LLM's limit or degrade performance. Cross-session context is handled by the memory system. Within-session, we need a strategy.

v0: Sliding Window

Keep the last 15 messages in context. Drop older ones. Simple, works for most sessions.

MAX_SESSION_MESSAGES = 15

def trim_context(messages: list) -> list:
    """Keep only the most recent messages within the limit."""
    if len(messages) > MAX_SESSION_MESSAGES:
        # Always keep the system prompt (index 0)
        return [messages[0]] + messages[-(MAX_SESSION_MESSAGES - 1):]
    return messages

Why this is safe: The memory extraction pipeline (running async after the session) captures important facts, preferences, and decisions into the memory tier. So even if within-session context is trimmed, the agent still "remembers" what matters across sessions.

v1: Summarization

When sessions consistently exceed 20 turns, add periodic summarization: compress messages 1-15 into a single summary message, keep messages 16-20 live. An LLM call generates the summary. Keeps the best of both — compact history + recent detail.

v2: RAG over History

Store every turn in MongoDB. When the agent needs historical context from earlier in the session, it calls query_user_knowledge — same RAG tool it uses for documents. The agent decides what history it needs, when it needs it.

Error Handling

Agents are multi-step, multi-system operations. Things will fail. The plan for each failure mode:

Tool failures

Pydantic AI surfaces tool errors to the agent automatically. If get_user_profile throws, the agent sees the error message and can:

Retry (if it looks transient, like a timeout)
Try an alternative (e.g., search memories if profile data is unavailable)
Apologize + explain (if the data is genuinely unavailable)

No special infrastructure needed — this is built into the Pydantic AI agent loop.

LLM call failures

If the LLM API returns an error (rate limit, timeout, provider outage), the agent loop catches it and returns a graceful response:

try:
    result = await orchestrator.run(message, deps=deps)
except ModelError as e:
    # Log the full error for debugging
    await log_error(conversation_id, e)
    # Return a user-friendly message (not raw error)
    yield "I'm having trouble processing that right now. Can you try again?"

Infinite loops

Set a hard limit on tool calls per request: max 10 tool calls. The orchestrator's system prompt also instructs it to give a final answer within a reasonable number of steps.

result = await orchestrator.run(
    message,
    deps=deps,
    max_tool_calls=10,  # Hard cap — Pydantic AI enforces this
)

Vercel timeout

The 300s ceiling is the ultimate safety net. If the agent hasn't finished by then, Vercel kills the function and the user sees an error. Mitigated by: monitoring P95 latency, triggering alerts if consistently >200s, moving to a dedicated backend if needed.

Summary

Failure mode	v0 approach
Tool failure	Agent sees error → retries or apologizes (Pydantic AI built-in)
LLM call failure	Catch `ModelError` → log full error → return user-friendly message
Infinite loop	Max 10 tool calls per request (Pydantic AI enforces)
Vercel timeout	300s ceiling. Monitor P95. Move to dedicated backend if >200s.

Key Design Decisions

Separate Python app, not added to candidate. The agent system has its own deploy cadence, dependencies, and concerns. Candidate app is the frontend + data entry surface.
JWT-based auth between apps. Candidate app creates a short-lived JWT (shared secret) with user context. qwestly-agent verifies it. No Auth0 SDK duplication, no plaintext user IDs.
Direct MongoDB reads, write through candidate app APIs. qwestly-agent reads structured data directly from MongoDB. For writes (saving a card, updating a profile), it calls candidate app API endpoints so validation/business logic stays in one place.
One orchestrator, no sub-agents (for v0). Every capability is a tool the orchestrator can call. Resist splitting into sub-agents until the orchestrator's prompt exceeds ~100 lines or routing logic gets genuinely complex.
Tools that internally use LLMs are fine. suggest_linkedin_about and generate_qwestly_card will be agentic internally. The orchestrator doesn't know or care — it sees a tool call.
Three-tier data access. Structured profile data (tool calls) + synthesized memory (session start injection + tool) + raw RAG (tool). Each serves a different purpose. The orchestrator fuses results.
Memory extraction is async. After each conversation, an LLM reviews the transcript and extracts memories. This doesn't block the user — it runs as a background process.
Sliding window for within-session context. Keep last 15 messages. Memory tier preserves important details across sessions. Upgrade to summarization when sessions grow long.
Stream everything, including tool calls. The user should see the agent's reasoning: "Searching your profile...", "Generating card...", then the final answer streaming in.
Log every interaction to MongoDB. User message, every tool call (name + args + result), final response, latency, token count. This data is your debugging lifeline and your eval dataset.
Graceful error handling. Tool errors exposed to agent for retry. LLM failures caught and returned as user-friendly messages. Hard cap of 10 tool calls per request.

Phase 2: Sub-Agents (v1 — When the Prompt Gets Too Long)

Don't build this until you need it. You'll know when:

The orchestrator's system prompt exceeds ~100 lines
Tool descriptions are paragraphs long and still getting confused
You find yourself writing "if this, delegate to that" logic in the prompt

New Topology

graph TD
    U["User"] --> O["Orchestrator
(routes intent, no longer does the work)"]
    O --> LA["LinkedIn Agent"]
    O --> QA["QA Agent"]
    O --> CG["Card Gen Agent"]
    O --> KA["Knowledge Agent"]

Sub-Agent Pattern (Agent-as-Tool)

Each sub-agent is a Pydantic AI Agent wrapped behind a tool. The orchestrator still calls tools — some tools just happen to run an LLM loop internally:

@orchestrator.tool
async def delegate_card_generation(
    ctx: RunContext[AgentDeps],
    user_id: str,
    style: str = "standard",
) -> CardResult:
    """Generate a Qwestly Card. Handles the full multi-step pipeline:
    profile analysis -> section drafting -> formatting.

    This is a complex operation that runs its own agent loop internally.
    The orchestrator just gets back the final CardResult.
    """
    card_agent = Agent[AgentDeps](
        model="openai:gpt-4o",
        system_prompt=CARD_AGENT_SYSTEM_PROMPT,
        deps_type=AgentDeps,
        result_type=CardResult,
        tools=[get_user_profile, format_card_text],
    )
    result = await card_agent.run(
        f"Generate a {style} Qwestly Card for user {user_id}",
        deps=ctx.deps,
    )
    return result.data

Which Sub-Agents, and When

Sub-agent	Split trigger	Key responsibility
LinkedIn Agent	LinkedIn ingestion grows to handle multiple sources, merge conflicts, partial updates	Fetch, parse, transform LinkedIn data; handle merge with existing profile data
Card Generator Agent	Card generation has 3+ templates, multi-section writing, formatting pipeline	Profile analysis -> template selection -> section writing -> formatting -> review
Profile QA Agent	Users ask complex analytical questions about their data that need multi-hop reasoning	Gets profile + RAG results + synthesizes answers that require reasoning across data sources
Knowledge Agent	Qwestly's internal documentation gets large enough to need its own RAG pipeline	RAG over internal docs, answers "how does X work" questions

Phase 3: HITL + MCP (v2)

Human-in-the-Loop for Write / Public-Facing Actions

By default, the agent generates drafts freely. But any action that writes data, changes visible state, or affects something public-facing requires user confirmation.

The pattern is a class of actions, not a specific tool. Any tool that mutates state follows the same gating:

@agent.tool
async def save_card_as_active(
    ctx: RunContext[AgentDeps],
    user_id: str,
    card_id: str,
) -> ActionResult:
    """Save this Qwestly Card as the user's active/public card.

    CRITICAL: Never call this unless the user has explicitly confirmed.
    First show the card draft, let the user review it, and only call
    this after they say 'save it' or 'make it active'.
    """
    if not _user_confirmed_for_card(ctx, card_id):
        return ActionResult(
            status="needs_confirmation",
            preview=card_id,
            message="Ready to save this as your active card. Confirm?",
        )
    # ... write through candidate app API

Gating classification:

Action class	HITL?	Examples
Read anything	No	Profile lookup, knowledge search, memory search
Generate drafts	No	Card draft, About section draft, headline suggestions
Save / activate	Yes (tool-level)	Save card as active, update profile field, change preferences
Delete / destructive	Yes (tool-level minimum)	Delete card, remove data

v0 UX: Inline chat confirmation. "Ready to save this as your active card. Confirm?" → user says yes/no.

v1 upgrade: Orchestrator-enforced gating (LangGraph interrupt_before) for higher-stakes actions.

See human-in-the-loop.md for the full HITL design.

MCP-ify Tools for Independent Deployment

When a tool needs to deploy independently, wrap it as an MCP server. Don't do this until you have 2+ services that genuinely need independent deploys. For v0, direct function calls are simpler and faster.

Technology Choices Summary

Layer	Choice	Why it fits
Language	Python 3.12+	Pydantic AI + FastAPI + MongoDB driver.
Agent framework	Pydantic AI (v0) -> maybe LangGraph sub-graphs later	Type safety, structured outputs, multi-provider, built-in test model.
LLM provider	DeepSeek v4 Flash (default, routing, extraction) + DeepSeek v4 Pro (content generation, complex reasoning)	Two-tier model routing assigned per-agent/per-tool. Pydantic AI makes model swaps a one-line change.
API layer	FastAPI (new app, separate from candidate)	Python-native, SSE streaming, shared schema DNA with Pydantic AI.
Database	MongoDB (shared with candidate app)	Structured data + memory + vector search — all in one DB.
Vector search	MongoDB Atlas Vector Search (already implemented for resumes in candidate app)	Extend existing index to cover memory + chat logs + docs.
Embeddings	OpenAI text-embedding-3-small (1536d) via `openai` SDK	Already in use by candidate app. Same model, same dimensions. No langchain dependency.
Tool protocol	Direct Python functions (v0) -> MCP (v2)	Start simple. MCP-ify when a tool needs independent deployment.
Streaming	FastAPI StreamingResponse + SSE	Works on Vercel Python runtime. No WebSockets needed.
Deployment	Vercel Pro Team (separate app from candidate)	300s timeout. Two apps on same team, independent deploys.
Frontend	Candidate app (Next.js/React)	Chat widget calls qwestly-agent's /api/chat. No new frontend.
Observability	Structured MongoDB logs (v0) -> Logfire (v1)	Agent debugging without traces is guesswork.

Implementation Roadmap

v0.1: Skeleton

Scaffold qwestly-agent app — follow the existing api-python Vercel pattern: api/index.py entry point, @vercel/python builder, vercel.json with routes. Same MongoDB connection pattern (motor/pymongo).
Add Pydantic AI + agent dependencies — pydantic-ai plus existing stack (fastapi, motor, openai for embeddings)
Define tool interfaces — write the function signatures and descriptions for all 7-8 v0 tools
Implement get_user_profile and list_capabilities — direct MongoDB reads, static config
Wire up a minimal orchestrator agent — Pydantic AI Agent with those 2 tools, a basic system prompt
Add POST /api/chat route to FastAPI with SSE streaming
Verify end-to-end on Vercel: POST a message -> orchestrator calls tool -> response streams back
Create user_memories collection + vector index in MongoDB Atlas
- Collection: user_memories in the candidate_portal database
- Vector index (Atlas UI → Atlas Search → Create Search Index → JSON Editor):
```
{
  "name": "user_memories_vector_index",
  "type": "vectorSearch",
  "definition": {
    "fields": [
      {
        "type": "vector",
        "path": "content_embedding",
        "numDimensions": 1536,
        "similarity": "cosine"
      }
    ]
  }
}
```
- Text index (fallback for local dev without Atlas vector search):
```
db.user_memories.createIndex({ content: "text" });
```
- ⚠️ Requires M10+ Atlas cluster. Won't work on M0/M2 free tier. For local dev, use the text index fallback (see README).

v0.2: Core Capabilities

Implement ingest_linkedin_profile — calls candidate app's /api/linkedin/profile endpoint via httpx
Implement suggest_linkedin_about — DeepSeek Pro generates an improved About section draft
Implement generate_qwestly_card — reads card status and available data sources from MongoDB; delegates to candidate app card endpoints
Implement search_user_memories — semantic search over user_memories with text fallback
Implement query_user_knowledge — $vectorSearch over user-scoped knowledge_chunk + text fallback
Implement query_qwestly_docs — $vectorSearch over internal docs + text fallback
Build memory extraction pipeline — MemoryService with extraction, retrieval, insertion (with embeddings), consolidation, cleanup
Add conversation logging — ConversationLogger writes to agent_conversations collection
Add session API — GET /api/conversations (list) and GET /api/conversations/{id} (history) with auto-naming from first message

v0.3: Hardening

Write unit tests — 48 tests: schemas, auth, config, tools, orchestrator, logging (pytest + pytest-asyncio)
Write integration tests with mocked LLM — system prompt coverage, tool registration, SSE streaming
Write 3-5 E2E tests with real LLM for critical paths — deferred to v1 (needs CI API keys)
Add observability — ConversationLogger captures tool calls, response text, latency to agent_conversations
Add cost tracking — lib/cost.py with DeepSeek Flash/Pro pricing
Add tool-level HITL — lib/hitl.py confirmation tracking, save_card_as_active example tool, HITL rules in system prompt
Version prompts — system prompts live as module constants; file extraction deferred
Deploy to Vercel — ready but not yet deployed

v1: Learn & Iterate

Collect real user interactions — what do they ask? Which tools get used?
Audit tool choice accuracy — is the orchestrator picking the right tool every time?
Improve tool descriptions based on actual mis-routings
Build an eval dataset from real conversations for regression testing
A/B test different system prompts to improve routing accuracy
Tune memory extraction — is the LLM extracting useful memories? Are they improving conversations?

v2+: Evolve

Evaluate sub-agent split — is the orchestrator prompt too long? Are tools getting confused?
Add token-threshold model upgrade — auto-swap orchestrator to Pro when conversation context exceeds ~4000 tokens
Build eval suite — nightly LLM-as-judge runs with trend tracking
MCP-ify tools that need independent deployment cadence

What Stays the Same

Aspect	v0	v0+	Why no change
Candidate app	Frontend + data entry	Frontend + data entry	Separate concerns. Agent logic lives in qwestly-agent.
Deployment	Vercel Pro Team (two apps)	Vercel Pro Team (two apps)	Same platform, independent deploys.
Database	MongoDB	MongoDB	Handles everything — structured data, memory, vector search.
API framework	FastAPI (new app)	FastAPI	Python-native, pairs with Pydantic AI.
Agent framework	Pydantic AI	Pydantic AI + maybe LangGraph sub-graphs for HITL	LangGraph is additive, not a replacement.
Frontend	Candidate app's existing UI + chat widget	Same	Chat widget calls qwestly-agent API.

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
Vercel 300s timeout hit on complex multi-step agents	Low	Medium	Monitor agent runtimes. If consistently >200s, extract the agent loop to a dedicated backend (Railway/Fly, ~$15/mo).
Cold starts (3-8s) frustrate users on first message	Medium	Low	Show "Loading Qwestly..." with a spinner. Use keepalive ping if needed.
Orchestrator picks the wrong tool	Medium	High	Invest heavily in tool descriptions. Add a confirmation step for ambiguous intents. Build eval dataset.
LLM costs scale unexpectedly	High	Medium	Set max 10 tool calls per request. Cache profile data. Log cost per conversation. Use model tiering.
LLM hallucinates user data in generated cards/sections	Medium	High	Provide full profile context in the prompt. Validate output against known facts. Add a "facts check" step in generation tools.
Memory extraction produces low-quality memories	Medium	Medium	Start with high-confidence extraction only. Log memory quality manually for first 100 extractions. Tune the extraction prompt.
Atlas Vector Search latency at scale	Low (for v0)	Medium	Keep index size manageable. Add `limit` + `numCandidates` tuning. $vectorSearch is fast for <1M vectors.

Open Questions (Answer These When You Build)

~~Multi-tenancy~~ → Resolved: Start with single user. The candidate app's provisioned accounts are a different concept (admin-created accounts, not team data partitioning). If multi-tenancy is needed later, add team_id to every query — but that's a schema migration across both apps, not a v0 concern.
Candidate app API auth for external calls: Currently, qwestly-agent reads structured data directly from MongoDB. For writes and LinkedIn ingestion, it calls candidate app APIs. In v0, the chat UI lives in the candidate app so cross-app auth isn't a concern. If these endpoints need to be called from elsewhere in the future, they'll need auth (likely the same JWT shared-secret pattern). Flagged for follow-up.

Resolved Questions

Question	Decision
Auth between apps?	Candidate app creates short-lived JWT (shared secret `QWESTLY_AGENT_SHARED_SECRET`) with `user_id`, `email`, `name`. qwestly-agent verifies it on every request.
Chat persistence?	Yes — qwestly-agent stores conversations in its own `agent_conversations` collection in MongoDB. Persists across browser sessions via `conversation_id`.
Memory ownership?	qwestly-agent owns `user_memories` and `agent_conversations`. Does not reuse candidate app's `chatbot_sessions`/`chatbot_messages`.
Model routing?	Two-tier: DeepSeek v4 Flash for orchestrator + basic tasks, DeepSeek v4 Pro for content generation (cards, About sections). Assigned per-agent/per-tool. Threshold-based upgrade to Pro for long-context sessions (v0.3).
LinkedIn API?	qwestly-agent calls candidate app's `/api/linkedin/profile` endpoint. Candidate app owns the third-party integration (fetching, retries, caching, force refresh). qwestly-agent never calls the third-party API directly.
Card format / data?	Qwestly Cards are already structured in candidate app's models (`QwestlyCardSection`, `QCNoInterviewSection`, etc.). qwestly-agent fetches card data via candidate app APIs or direct MongoDB reads. The format is a solved problem.
Within-session context?	Sliding window (15 messages) for v0. Memory tier preserves cross-session context.
Error handling?	Tool errors → agent sees + retries. LLM failures → graceful user message. Hard cap: 10 tool calls.
Multi-tenancy?	Start with single user. No `team_id` partitioning for v0. Revisit if a team/recruiter plan is introduced.

Referenced by

index

_private/qwestly-docs/Features/qwestly-agent/qwestly-agent-orchestration-plan.md