_private/qwestly-docs/Features/qwestly-agent/qwestly-agent-orchestration-plan.md

Qwestly Orchestration Plan — Concrete Proposal

A specific, actionable architecture for Qwestly's agentic system. Updated with confirmed decisions: separate Python app, Pydantic AI, three-tier data architecture (structured data, unstructured memory, RAG), direct MongoDB access.


Executive Summary

Build a new, separate Python app (qwestly-agent) using Pydantic AI as the agent framework. It receives user requests via a chat endpoint, routes intent to tool calls, and streams responses. The existing candidate app (Next.js/TS) serves as the frontend and data entry surface — the agent system is a completely separate concern. Data lives in MongoDB — operational data, unstructured memory, and vector search (RAG) all in one place via Atlas. Start with a single orchestrator + tools, add specialist sub-agents as the prompt grows too complex.


System Architecture

graph TD subgraph candidate["candidate app
(Next.js / TypeScript)"] direction TB C1[Frontend - chat UI, etc.] C2[Data entry / management] C3[Auth - Auth0] C4[Owns schema - Mongoose] C5[Exposes write APIs] end subgraph agent["qwestly-agent app
(FastAPI / Python)"] direction TB A1[POST /api/chat - SSE] A2[Pydantic AI Orchestrator] A3[Tools + sub-agents] A4[Direct MongoDB reads] A5[Memory extraction - async] end subgraph mongo["MongoDB (Atlas)"] direction TB M1[Structured — users, cards, preferences] M2[Memory — user_memories collection] M3[RAG/Vectors — knowledge_chunk + index] end candidate -->|"API calls"| agent candidate --> mongo agent --> mongo

This is a new app, not a modification of candidate. The agent system has its own deploy cadence, its own dependencies, and its own concerns. The candidate app exposes a few API endpoints for data writes (saving cards, updating profiles) — everything else happens in qwestly-agent.


Phase 1: Foundation (v0)

Core Architecture

graph TD chat["Chat UI
(candidate app)"] -->|"POST /api/chat"| fastapi["FastAPI on Vercel
(qwestly-agent)"] fastapi --> orch["Pydantic AI Orchestrator"] orch --> profile["get_user_profile"] orch --> card["generate_card"] orch --> knowledge["query_user_knowledge"] profile --> mongo["MongoDB + Atlas Vector Search
(structured + memory + RAG)"] card --> mongo knowledge --> mongo

Framework Choice: Pydantic AI (CONFIRMED)

Why Detail
Python-native FastAPI + Pydantic AI share the Pydantic schema DNA. Your existing request/response models plug right in.
Type safety Every tool input, tool output, and agent result is validated by Pydantic. No runtime surprises.
Multi-provider DeepSeek v4 Flash for routing/basic tasks, DeepSeek v4 Pro for complex generation — one-line config change per agent.
Structured outputs Card generation returns a typed CardResult model. The LLM can't return malformed data.
Testable Built-in test model for deterministic agent tests without calling a real LLM.
No new infra Pure Python, no native deps, runs fine on Vercel's Python runtime.

Model Routing (CONFIRMED)

Two models, assigned by task complexity. This is implemented up front, not deferred.

Model Tier What it handles Why
DeepSeek v4 Flash Default (chat) Orchestrator routing, intent classification, memory extraction, list_capabilities, simple Q&A Fast, cheap, sufficient for classification/routing
DeepSeek v4 Pro Premium (pro) Card generation, About section writing, multi-step reasoning, content that users will read and judge Higher quality output for user-facing content

Routing logic: The model is assigned per-agent or per-tool, not chosen by the orchestrator at runtime:

# Orchestrator itself — Flash for fast routing
orchestrator = Agent(
    "deepseek:deepseek-v4-flash",
    system_prompt=ORCHESTRATOR_PROMPT,
    tools=[get_user_profile, generate_card, suggest_about, ...],
)

# Content generation tools use Pro internally
@orchestrator.tool
async def generate_qwestly_card(user_id: str, style: str = "standard") -> CardResult:
    """Generate a Qwestly Card. This tool internally uses DeepSeek v4 Pro
    for high-quality content generation."""
    card_agent = Agent(
        "deepseek:deepseek-v4-pro",  # ← Pro for content
        system_prompt=CARD_GENERATION_PROMPT,
        result_type=CardResult,
    )
    return await card_agent.run(...)

@orchestrator.tool
async def suggest_linkedin_about(user_id: str) -> str:
    """Generate an improved LinkedIn About section. Uses Pro for quality."""
    about_agent = Agent(
        "deepseek:deepseek-v4-pro",  # ← Pro for content
        system_prompt=ABOUT_GENERATION_PROMPT,
    )
    return await about_agent.run(...)

Token threshold upgrade: When a conversation exceeds ~4000 tokens of accumulated context, the orchestrator itself can be swapped to Pro for the remainder of the session. The increased context length and complexity warrant the more capable model. This is a v0.3 optimization — start with static assignment per tool, add threshold-based upgrade once you have latency/cost data.

Cost impact: Flash is ~10x cheaper than Pro. Routing with Flash + generating with Pro keeps the common case (simple questions, memory extraction) cheap while spending on what users actually see (cards, content).

Tools (v0)

Tool What it does Backed by
get_user_profile(user_id) Returns structured profile data from MongoDB (reads from candidates_enhanced, CandidateProfileCache) Direct MongoDB query
ingest_linkedin_profile(linkedin_url) Fetches LinkedIn data for a user. Calls the candidate app's /api/linkedin/profile endpoint — which handles fetching, retries, caching, and force refresh. qwestly-agent does NOT call the third-party API directly. Candidate app API
suggest_linkedin_about(user_id) Generates an improved LinkedIn About section draft New — LLM call with profile context
generate_qwestly_card(user_id, style?) Creates a Qwestly Card (structured career document) New LLM pipeline — likely agentic internally
query_user_knowledge(user_id, question) Semantic search across everything about a user (notes, history, resumes, interviews) Atlas Vector Search + memory search
query_qwestly_docs(question) Answers "how does Qwestly work?" questions Atlas Vector Search over internal docs
search_user_memories(user_id, query) Semantic search over synthesized memories (preferences, goals, facts, decisions) user_memories collection + vector search
list_capabilities() Returns what Qwestly can do Static list or config

Data Architecture (Three Tiers)

This is the key architectural decision. Qwestly's data falls into three distinct tiers, each with different storage strategies and query patterns.

Tier 1: Structured Data

What it is: Database fields — user profiles, employment history, education, preferences, Qwestly Cards. Deterministic, exact queries.

Storage: MongoDB collections (candidates_enhanced, CandidateProfileCache, employment_stints_enhanced, preferences_enhanced, qc_sections, etc.)

Owned by: Candidate app (Mongoose schemas are source of truth)

Accessed by qwestly-agent via: Direct MongoDB reads (read-only). Tool calls for exact lookups.

Query pattern: get_user_profile(user_id) — direct document lookup. Deterministic, exact, fast.

Tier 2: Unstructured Memory (NEW — see unstructured-memory.md)

What it is: LLM-extracted, synthesized memories from conversations. Discrete facts, preferences, goals, and decisions the agent learns about a user over time. Not raw transcripts — distilled, deduplicated, structured knowledge.

Storage: MongoDB user_memories collection + vector index for semantic retrieval

Owned by: qwestly-agent (extracts and manages memories)

Query pattern: Loaded at session start (top memories by importance + recency) and searchable via tool during conversation.

Key properties:

  • Each memory is a discrete document: {type, content, importance, source, user_id, created_at}
  • Extracted asynchronously after conversations (LLM reviews transcript → extracts memories)
  • Consolidated periodically (merge similar memories)
  • Injected into system prompt at session start

This is the gap the candidate app doesn't fill. See unstructured-memory.md for the full design.

Tier 3: RAG (Vector Search over Raw Content)

What it is: Semantic search over unstructured raw content — resumes, interview transcripts, uploaded documents, full chat transcripts. Not synthesized, not distilled — the original text.

Storage: MongoDB knowledge_chunk collection + Atlas Vector Search index (already exists in candidate app)

Owned by: Candidate app (document ingestion pipeline) — qwestly-agent queries the same index

Query pattern: query_user_knowledge(user_id, question) — embed question → $vectorSearch → return top-k chunks

Already implemented in candidate app: Document upload → Python parsing API → chunking (1500 chars, 200 overlap) → OpenAI text-embedding-3-small (1536d) → Atlas $vectorSearch with cosine similarity.

Tier Comparison

Aspect Structured Data Unstructured Memory RAG
What it stores Profile fields, employment, education, cards Synthesized facts, preferences, goals, decisions Raw documents, transcripts, resumes
Processing Direct from user input / LinkedIn API LLM extraction + consolidation after conversations Chunking + embedding pipeline
Query Exact lookup by field/key Semantic search + recency/importance ranking Semantic search (vector similarity)
Accuracy Deterministic, exact High-signal (distilled), may have gaps Noisy (raw text), comprehensive
Updates On user action / LinkedIn ingest Async after each conversation On document upload
Use case "What's my current title?" "What writing style do I prefer?" "What did my interview transcript say about leadership?"
Collection candidates_enhanced, CandidateProfileCache, etc. user_memories (NEW) knowledge_chunk (exists)

Data Flow: Full Query Lifecycle

When a user asks "What do we know about me?":

graph TD Q["User asks: What do we know about me?"] --> O["Orchestrator fuses all three in LLM context"] O --> T1["1. get_user_profile
Structured data from
candidates_enhanced, CandidateProfileCache
→ Deterministic, exact
→ Scaffold: name, roles, companies"] O --> T2["2. search_user_memories
Synthesized memories from user_memories
→ High-signal distilled knowledge
→ Preferences, goals, past decisions"] O --> T3["3. query_user_knowledge
Raw text chunks from knowledge_chunk via vectorSearch
→ Comprehensive but noisy
→ Specific quotes, detailed history"] T1 --> R["5. LLM synthesizes final answer"] T2 --> R T3 --> R

Data Layer Summary

What Where Notes
User profiles MongoDB (candidates_enhanced, CandidateProfileCache, etc.) Already exists. Read by qwestly-agent directly.
LinkedIn data MongoDB (linkedin_profiles, linkedin_summaries) Already exists. Flexible schema handles nested LinkedIn JSON naturally.
Generated artifacts MongoDB (qc_sections, linkedin_profile_suggestions, etc.) Already exists.
Chat history MongoDB (chatbot_sessions, chatbot_messages) Already exists in candidate app. New sessions will be logged by qwestly-agent into a new collection or the same one.
Unstructured memory MongoDB (user_memories) — NEW LLM-extracted facts, preferences, goals. Owned by qwestly-agent. Needs Atlas Vector Search index (user_memories_vector_index, 1536d, cosine) on M10+. For local dev: db.user_memories.createIndex({content: "text"}) as fallback.
RAG/vector search MongoDB Atlas Vector Search on knowledge_chunk Already exists. Index name: knowledge_chunk_vector_index, 1536d, cosine similarity.
Embeddings OpenAI text-embedding-3-small (1536d) Already in use by candidate app. qwestly-agent uses same model/dimensions.

Chat Interface (v0)

  • Chat widget in the candidate app (React/Next.js)
  • Calls POST /api/chat on the qwestly-agent FastAPI app
  • Streams responses via Server-Sent Events (SSE)
  • Start simple: plain text responses. Upgrade to structured action cards later.

Streaming on FastAPI + Vercel

from fastapi import APIRouter
from fastapi.responses import StreamingResponse

router = APIRouter()

@router.post("/chat")
async def chat(request: ChatRequest):
    agent = QwestlyOrchestrator(user_id=request.user_id)

    async def event_stream():
        async for chunk in agent.run_stream(request.message):
            yield f"data: {chunk.json()}\n\n"

    return StreamingResponse(
        event_stream(),
        media_type="text/event-stream",
        headers={
            "X-Accel-Buffering": "no",
            "Cache-Control": "no-cache",
        }
    )

Timeout: Set maxDuration: 300 in Vercel config. A typical agent loop runs ~15-45s.


Auth & Session Management

The candidate app authenticates users via Auth0. qwestly-agent needs to know who is talking to it without re-implementing authentication. The pattern: candidate app creates a short-lived JWT with user context, qwestly-agent verifies it using a shared secret.

Flow

sequenceDiagram participant U as User participant C as Candidate App
(Auth0-authenticated) participant Q as qwestly-agent
POST /api/chat participant O as Orchestrator U->>C: Authenticated request C->>C: Create JWT with
{user_id, email, name}
signed with QWESTLY_AGENT_SHARED_SECRET C->>Q: Forward request + JWT Q->>Q: Verify JWT using shared secret
Extract user_id → RunContext Q->>O: Run with user context

Implementation

Candidate app side (creates JWT before calling qwestly-agent):

// In the chat API route or server component
import jwt from 'jsonwebtoken';

const agentToken = jwt.sign(
  {
    user_id: session.user.sub,       // Auth0 user ID
    email: session.user.email,
    name: session.user.name,
  },
  process.env.QWESTLY_AGENT_SHARED_SECRET,
  { expiresIn: '5m' }                // Short-lived — one per session start
);

qwestly-agent side (verifies JWT on every request):

import os
import jwt
from fastapi import Header, HTTPException

AGENT_SHARED_SECRET = os.environ["QWESTLY_AGENT_SHARED_SECRET"]

async def verify_agent_token(authorization: str = Header(...)) -> dict:
    """Verify the JWT from the candidate app. Returns user context."""
    token = authorization.replace("Bearer ", "")
    try:
        payload = jwt.decode(token, AGENT_SHARED_SECRET, algorithms=["HS256"])
        return payload  # { user_id, email, name }
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")
    except jwt.InvalidTokenError:
        raise HTTPException(status_code=401, detail="Invalid token")

# Usage in route
@router.post("/chat")
async def chat(request: ChatRequest, user: dict = Depends(verify_agent_token)):
    agent = QwestlyOrchestrator(user_id=user["user_id"])
    ...

Why this over alternatives

Approach Why not
Pass user_id as plain header No integrity check — any caller could impersonate any user
Candidate proxies all LLM calls Defeats the purpose of a separate agent app. Tight coupling.
qwestly-agent validates Auth0 tokens directly Requires Auth0 SDK + same client secret in two apps. JWT is simpler.
API key only (no user context) Doesn't tell the agent who the user is — can't scope queries.

Session startup

On first message in a session, qwestly-agent:

  1. Verifies the JWT → gets user_id
  2. Loads the user's profile from MongoDB (structured data)
  3. Loads top 20 memories (memory tier)
  4. Injects both into the orchestrator's system prompt
  5. Begins the conversation

Subsequent messages in the same session reuse the loaded context (no need to re-fetch profile/memories unless the session is long — see conversation management below).


Within-Session Conversation Management

Each turn adds tokens to the context window. After 20+ turns, the context can overflow the LLM's limit or degrade performance. Cross-session context is handled by the memory system. Within-session, we need a strategy.

v0: Sliding Window

Keep the last 15 messages in context. Drop older ones. Simple, works for most sessions.

MAX_SESSION_MESSAGES = 15

def trim_context(messages: list) -> list:
    """Keep only the most recent messages within the limit."""
    if len(messages) > MAX_SESSION_MESSAGES:
        # Always keep the system prompt (index 0)
        return [messages[0]] + messages[-(MAX_SESSION_MESSAGES - 1):]
    return messages

Why this is safe: The memory extraction pipeline (running async after the session) captures important facts, preferences, and decisions into the memory tier. So even if within-session context is trimmed, the agent still "remembers" what matters across sessions.

v1: Summarization

When sessions consistently exceed 20 turns, add periodic summarization: compress messages 1-15 into a single summary message, keep messages 16-20 live. An LLM call generates the summary. Keeps the best of both — compact history + recent detail.

v2: RAG over History

Store every turn in MongoDB. When the agent needs historical context from earlier in the session, it calls query_user_knowledge — same RAG tool it uses for documents. The agent decides what history it needs, when it needs it.


Error Handling

Agents are multi-step, multi-system operations. Things will fail. The plan for each failure mode:

Tool failures

Pydantic AI surfaces tool errors to the agent automatically. If get_user_profile throws, the agent sees the error message and can:

  • Retry (if it looks transient, like a timeout)
  • Try an alternative (e.g., search memories if profile data is unavailable)
  • Apologize + explain (if the data is genuinely unavailable)

No special infrastructure needed — this is built into the Pydantic AI agent loop.

LLM call failures

If the LLM API returns an error (rate limit, timeout, provider outage), the agent loop catches it and returns a graceful response:

try:
    result = await orchestrator.run(message, deps=deps)
except ModelError as e:
    # Log the full error for debugging
    await log_error(conversation_id, e)
    # Return a user-friendly message (not raw error)
    yield "I'm having trouble processing that right now. Can you try again?"

Infinite loops

Set a hard limit on tool calls per request: max 10 tool calls. The orchestrator's system prompt also instructs it to give a final answer within a reasonable number of steps.

result = await orchestrator.run(
    message,
    deps=deps,
    max_tool_calls=10,  # Hard cap — Pydantic AI enforces this
)

Vercel timeout

The 300s ceiling is the ultimate safety net. If the agent hasn't finished by then, Vercel kills the function and the user sees an error. Mitigated by: monitoring P95 latency, triggering alerts if consistently >200s, moving to a dedicated backend if needed.

Summary

Failure mode v0 approach
Tool failure Agent sees error → retries or apologizes (Pydantic AI built-in)
LLM call failure Catch ModelError → log full error → return user-friendly message
Infinite loop Max 10 tool calls per request (Pydantic AI enforces)
Vercel timeout 300s ceiling. Monitor P95. Move to dedicated backend if >200s.

Key Design Decisions

  1. Separate Python app, not added to candidate. The agent system has its own deploy cadence, dependencies, and concerns. Candidate app is the frontend + data entry surface.

  2. JWT-based auth between apps. Candidate app creates a short-lived JWT (shared secret) with user context. qwestly-agent verifies it. No Auth0 SDK duplication, no plaintext user IDs.

  3. Direct MongoDB reads, write through candidate app APIs. qwestly-agent reads structured data directly from MongoDB. For writes (saving a card, updating a profile), it calls candidate app API endpoints so validation/business logic stays in one place.

  4. One orchestrator, no sub-agents (for v0). Every capability is a tool the orchestrator can call. Resist splitting into sub-agents until the orchestrator's prompt exceeds ~100 lines or routing logic gets genuinely complex.

  5. Tools that internally use LLMs are fine. suggest_linkedin_about and generate_qwestly_card will be agentic internally. The orchestrator doesn't know or care — it sees a tool call.

  6. Three-tier data access. Structured profile data (tool calls) + synthesized memory (session start injection + tool) + raw RAG (tool). Each serves a different purpose. The orchestrator fuses results.

  7. Memory extraction is async. After each conversation, an LLM reviews the transcript and extracts memories. This doesn't block the user — it runs as a background process.

  8. Sliding window for within-session context. Keep last 15 messages. Memory tier preserves important details across sessions. Upgrade to summarization when sessions grow long.

  9. Stream everything, including tool calls. The user should see the agent's reasoning: "Searching your profile...", "Generating card...", then the final answer streaming in.

  10. Log every interaction to MongoDB. User message, every tool call (name + args + result), final response, latency, token count. This data is your debugging lifeline and your eval dataset.

  11. Graceful error handling. Tool errors exposed to agent for retry. LLM failures caught and returned as user-friendly messages. Hard cap of 10 tool calls per request.


Phase 2: Sub-Agents (v1 — When the Prompt Gets Too Long)

Don't build this until you need it. You'll know when:

  • The orchestrator's system prompt exceeds ~100 lines
  • Tool descriptions are paragraphs long and still getting confused
  • You find yourself writing "if this, delegate to that" logic in the prompt

New Topology

graph TD U["User"] --> O["Orchestrator
(routes intent, no longer does the work)"] O --> LA["LinkedIn Agent"] O --> QA["QA Agent"] O --> CG["Card Gen Agent"] O --> KA["Knowledge Agent"]

Sub-Agent Pattern (Agent-as-Tool)

Each sub-agent is a Pydantic AI Agent wrapped behind a tool. The orchestrator still calls tools — some tools just happen to run an LLM loop internally:

@orchestrator.tool
async def delegate_card_generation(
    ctx: RunContext[AgentDeps],
    user_id: str,
    style: str = "standard",
) -> CardResult:
    """Generate a Qwestly Card. Handles the full multi-step pipeline:
    profile analysis -> section drafting -> formatting.

    This is a complex operation that runs its own agent loop internally.
    The orchestrator just gets back the final CardResult.
    """
    card_agent = Agent[AgentDeps](
        model="openai:gpt-4o",
        system_prompt=CARD_AGENT_SYSTEM_PROMPT,
        deps_type=AgentDeps,
        result_type=CardResult,
        tools=[get_user_profile, format_card_text],
    )
    result = await card_agent.run(
        f"Generate a {style} Qwestly Card for user {user_id}",
        deps=ctx.deps,
    )
    return result.data

Which Sub-Agents, and When

Sub-agent Split trigger Key responsibility
LinkedIn Agent LinkedIn ingestion grows to handle multiple sources, merge conflicts, partial updates Fetch, parse, transform LinkedIn data; handle merge with existing profile data
Card Generator Agent Card generation has 3+ templates, multi-section writing, formatting pipeline Profile analysis -> template selection -> section writing -> formatting -> review
Profile QA Agent Users ask complex analytical questions about their data that need multi-hop reasoning Gets profile + RAG results + synthesizes answers that require reasoning across data sources
Knowledge Agent Qwestly's internal documentation gets large enough to need its own RAG pipeline RAG over internal docs, answers "how does X work" questions

Phase 3: HITL + MCP (v2)

Human-in-the-Loop for Write / Public-Facing Actions

By default, the agent generates drafts freely. But any action that writes data, changes visible state, or affects something public-facing requires user confirmation.

The pattern is a class of actions, not a specific tool. Any tool that mutates state follows the same gating:

@agent.tool
async def save_card_as_active(
    ctx: RunContext[AgentDeps],
    user_id: str,
    card_id: str,
) -> ActionResult:
    """Save this Qwestly Card as the user's active/public card.

    CRITICAL: Never call this unless the user has explicitly confirmed.
    First show the card draft, let the user review it, and only call
    this after they say 'save it' or 'make it active'.
    """
    if not _user_confirmed_for_card(ctx, card_id):
        return ActionResult(
            status="needs_confirmation",
            preview=card_id,
            message="Ready to save this as your active card. Confirm?",
        )
    # ... write through candidate app API

Gating classification:

Action class HITL? Examples
Read anything No Profile lookup, knowledge search, memory search
Generate drafts No Card draft, About section draft, headline suggestions
Save / activate Yes (tool-level) Save card as active, update profile field, change preferences
Delete / destructive Yes (tool-level minimum) Delete card, remove data

v0 UX: Inline chat confirmation. "Ready to save this as your active card. Confirm?" → user says yes/no.

v1 upgrade: Orchestrator-enforced gating (LangGraph interrupt_before) for higher-stakes actions.

See human-in-the-loop.md for the full HITL design.

MCP-ify Tools for Independent Deployment

When a tool needs to deploy independently, wrap it as an MCP server. Don't do this until you have 2+ services that genuinely need independent deploys. For v0, direct function calls are simpler and faster.


Technology Choices Summary

Layer Choice Why it fits
Language Python 3.12+ Pydantic AI + FastAPI + MongoDB driver.
Agent framework Pydantic AI (v0) -> maybe LangGraph sub-graphs later Type safety, structured outputs, multi-provider, built-in test model.
LLM provider DeepSeek v4 Flash (default, routing, extraction) + DeepSeek v4 Pro (content generation, complex reasoning) Two-tier model routing assigned per-agent/per-tool. Pydantic AI makes model swaps a one-line change.
API layer FastAPI (new app, separate from candidate) Python-native, SSE streaming, shared schema DNA with Pydantic AI.
Database MongoDB (shared with candidate app) Structured data + memory + vector search — all in one DB.
Vector search MongoDB Atlas Vector Search (already implemented for resumes in candidate app) Extend existing index to cover memory + chat logs + docs.
Embeddings OpenAI text-embedding-3-small (1536d) via openai SDK Already in use by candidate app. Same model, same dimensions. No langchain dependency.
Tool protocol Direct Python functions (v0) -> MCP (v2) Start simple. MCP-ify when a tool needs independent deployment.
Streaming FastAPI StreamingResponse + SSE Works on Vercel Python runtime. No WebSockets needed.
Deployment Vercel Pro Team (separate app from candidate) 300s timeout. Two apps on same team, independent deploys.
Frontend Candidate app (Next.js/React) Chat widget calls qwestly-agent's /api/chat. No new frontend.
Observability Structured MongoDB logs (v0) -> Logfire (v1) Agent debugging without traces is guesswork.

Implementation Roadmap

v0.1: Skeleton

  • Scaffold qwestly-agent app — follow the existing api-python Vercel pattern: api/index.py entry point, @vercel/python builder, vercel.json with routes. Same MongoDB connection pattern (motor/pymongo).
  • Add Pydantic AI + agent dependenciespydantic-ai plus existing stack (fastapi, motor, openai for embeddings)
  • Define tool interfaces — write the function signatures and descriptions for all 7-8 v0 tools
  • Implement get_user_profile and list_capabilities — direct MongoDB reads, static config
  • Wire up a minimal orchestrator agent — Pydantic AI Agent with those 2 tools, a basic system prompt
  • Add POST /api/chat route to FastAPI with SSE streaming
  • Verify end-to-end on Vercel: POST a message -> orchestrator calls tool -> response streams back
  • Create user_memories collection + vector index in MongoDB Atlas
    • Collection: user_memories in the candidate_portal database
    • Vector index (Atlas UI → Atlas Search → Create Search Index → JSON Editor):
      {
        "name": "user_memories_vector_index",
        "type": "vectorSearch",
        "definition": {
          "fields": [
            {
              "type": "vector",
              "path": "content_embedding",
              "numDimensions": 1536,
              "similarity": "cosine"
            }
          ]
        }
      }
      
    • Text index (fallback for local dev without Atlas vector search):
      db.user_memories.createIndex({ content: "text" });
      
    • ⚠️ Requires M10+ Atlas cluster. Won't work on M0/M2 free tier. For local dev, use the text index fallback (see README).

v0.2: Core Capabilities

  • Implement ingest_linkedin_profile — calls candidate app's /api/linkedin/profile endpoint via httpx
  • Implement suggest_linkedin_about — DeepSeek Pro generates an improved About section draft
  • Implement generate_qwestly_card — reads card status and available data sources from MongoDB; delegates to candidate app card endpoints
  • Implement search_user_memories — semantic search over user_memories with text fallback
  • Implement query_user_knowledge$vectorSearch over user-scoped knowledge_chunk + text fallback
  • Implement query_qwestly_docs$vectorSearch over internal docs + text fallback
  • Build memory extraction pipeline — MemoryService with extraction, retrieval, insertion (with embeddings), consolidation, cleanup
  • Add conversation logging — ConversationLogger writes to agent_conversations collection
  • Add session APIGET /api/conversations (list) and GET /api/conversations/{id} (history) with auto-naming from first message

v0.3: Hardening

  • Write unit tests — 48 tests: schemas, auth, config, tools, orchestrator, logging (pytest + pytest-asyncio)
  • Write integration tests with mocked LLM — system prompt coverage, tool registration, SSE streaming
  • Write 3-5 E2E tests with real LLM for critical paths — deferred to v1 (needs CI API keys)
  • Add observability — ConversationLogger captures tool calls, response text, latency to agent_conversations
  • Add cost trackinglib/cost.py with DeepSeek Flash/Pro pricing
  • Add tool-level HITLlib/hitl.py confirmation tracking, save_card_as_active example tool, HITL rules in system prompt
  • Version prompts — system prompts live as module constants; file extraction deferred
  • Deploy to Vercel — ready but not yet deployed

v1: Learn & Iterate

  • Collect real user interactions — what do they ask? Which tools get used?
  • Audit tool choice accuracy — is the orchestrator picking the right tool every time?
  • Improve tool descriptions based on actual mis-routings
  • Build an eval dataset from real conversations for regression testing
  • A/B test different system prompts to improve routing accuracy
  • Tune memory extraction — is the LLM extracting useful memories? Are they improving conversations?

v2+: Evolve

  • Evaluate sub-agent split — is the orchestrator prompt too long? Are tools getting confused?
  • Add token-threshold model upgrade — auto-swap orchestrator to Pro when conversation context exceeds ~4000 tokens
  • Build eval suite — nightly LLM-as-judge runs with trend tracking
  • MCP-ify tools that need independent deployment cadence

What Stays the Same

Aspect v0 v0+ Why no change
Candidate app Frontend + data entry Frontend + data entry Separate concerns. Agent logic lives in qwestly-agent.
Deployment Vercel Pro Team (two apps) Vercel Pro Team (two apps) Same platform, independent deploys.
Database MongoDB MongoDB Handles everything — structured data, memory, vector search.
API framework FastAPI (new app) FastAPI Python-native, pairs with Pydantic AI.
Agent framework Pydantic AI Pydantic AI + maybe LangGraph sub-graphs for HITL LangGraph is additive, not a replacement.
Frontend Candidate app's existing UI + chat widget Same Chat widget calls qwestly-agent API.

Risks & Mitigations

Risk Likelihood Impact Mitigation
Vercel 300s timeout hit on complex multi-step agents Low Medium Monitor agent runtimes. If consistently >200s, extract the agent loop to a dedicated backend (Railway/Fly, ~$15/mo).
Cold starts (3-8s) frustrate users on first message Medium Low Show "Loading Qwestly..." with a spinner. Use keepalive ping if needed.
Orchestrator picks the wrong tool Medium High Invest heavily in tool descriptions. Add a confirmation step for ambiguous intents. Build eval dataset.
LLM costs scale unexpectedly High Medium Set max 10 tool calls per request. Cache profile data. Log cost per conversation. Use model tiering.
LLM hallucinates user data in generated cards/sections Medium High Provide full profile context in the prompt. Validate output against known facts. Add a "facts check" step in generation tools.
Memory extraction produces low-quality memories Medium Medium Start with high-confidence extraction only. Log memory quality manually for first 100 extractions. Tune the extraction prompt.
Atlas Vector Search latency at scale Low (for v0) Medium Keep index size manageable. Add limit + numCandidates tuning. $vectorSearch is fast for <1M vectors.

Open Questions (Answer These When You Build)

  1. Multi-tenancyResolved: Start with single user. The candidate app's provisioned accounts are a different concept (admin-created accounts, not team data partitioning). If multi-tenancy is needed later, add team_id to every query — but that's a schema migration across both apps, not a v0 concern.

  2. Candidate app API auth for external calls: Currently, qwestly-agent reads structured data directly from MongoDB. For writes and LinkedIn ingestion, it calls candidate app APIs. In v0, the chat UI lives in the candidate app so cross-app auth isn't a concern. If these endpoints need to be called from elsewhere in the future, they'll need auth (likely the same JWT shared-secret pattern). Flagged for follow-up.

Resolved Questions

Question Decision
Auth between apps? Candidate app creates short-lived JWT (shared secret QWESTLY_AGENT_SHARED_SECRET) with user_id, email, name. qwestly-agent verifies it on every request.
Chat persistence? Yes — qwestly-agent stores conversations in its own agent_conversations collection in MongoDB. Persists across browser sessions via conversation_id.
Memory ownership? qwestly-agent owns user_memories and agent_conversations. Does not reuse candidate app's chatbot_sessions/chatbot_messages.
Model routing? Two-tier: DeepSeek v4 Flash for orchestrator + basic tasks, DeepSeek v4 Pro for content generation (cards, About sections). Assigned per-agent/per-tool. Threshold-based upgrade to Pro for long-context sessions (v0.3).
LinkedIn API? qwestly-agent calls candidate app's /api/linkedin/profile endpoint. Candidate app owns the third-party integration (fetching, retries, caching, force refresh). qwestly-agent never calls the third-party API directly.
Card format / data? Qwestly Cards are already structured in candidate app's models (QwestlyCardSection, QCNoInterviewSection, etc.). qwestly-agent fetches card data via candidate app APIs or direct MongoDB reads. The format is a solved problem.
Within-session context? Sliding window (15 messages) for v0. Memory tier preserves cross-session context.
Error handling? Tool errors → agent sees + retries. LLM failures → graceful user message. Hard cap: 10 tool calls.
Multi-tenancy? Start with single user. No team_id partitioning for v0. Revisit if a team/recruiter plan is introduced.