_private/qwestly-docs/Features/qwestly-agent/qwestly-agent-orchestration-plan.md
Table of Contents
Qwestly Orchestration Plan — Concrete Proposal
A specific, actionable architecture for Qwestly's agentic system. Updated with confirmed decisions: separate Python app, Pydantic AI, three-tier data architecture (structured data, unstructured memory, RAG), direct MongoDB access.
Executive Summary
Build a new, separate Python app (qwestly-agent) using Pydantic AI as the agent framework. It receives user requests via a chat endpoint, routes intent to tool calls, and streams responses. The existing candidate app (Next.js/TS) serves as the frontend and data entry surface — the agent system is a completely separate concern. Data lives in MongoDB — operational data, unstructured memory, and vector search (RAG) all in one place via Atlas. Start with a single orchestrator + tools, add specialist sub-agents as the prompt grows too complex.
System Architecture
graph TD
subgraph candidate["candidate app
(Next.js / TypeScript)"]
direction TB
C1[Frontend - chat UI, etc.]
C2[Data entry / management]
C3[Auth - Auth0]
C4[Owns schema - Mongoose]
C5[Exposes write APIs]
end
subgraph agent["qwestly-agent app
(FastAPI / Python)"]
direction TB
A1[POST /api/chat - SSE]
A2[Pydantic AI Orchestrator]
A3[Tools + sub-agents]
A4[Direct MongoDB reads]
A5[Memory extraction - async]
end
subgraph mongo["MongoDB (Atlas)"]
direction TB
M1[Structured — users, cards, preferences]
M2[Memory — user_memories collection]
M3[RAG/Vectors — knowledge_chunk + index]
end
candidate -->|"API calls"| agent
candidate --> mongo
agent --> mongo
This is a new app, not a modification of candidate. The agent system has its own deploy cadence, its own dependencies, and its own concerns. The candidate app exposes a few API endpoints for data writes (saving cards, updating profiles) — everything else happens in qwestly-agent.
Phase 1: Foundation (v0)
Core Architecture
graph TD
chat["Chat UI
(candidate app)"] -->|"POST /api/chat"| fastapi["FastAPI on Vercel
(qwestly-agent)"]
fastapi --> orch["Pydantic AI Orchestrator"]
orch --> profile["get_user_profile"]
orch --> card["generate_card"]
orch --> knowledge["query_user_knowledge"]
profile --> mongo["MongoDB + Atlas Vector Search
(structured + memory + RAG)"]
card --> mongo
knowledge --> mongo
Framework Choice: Pydantic AI (CONFIRMED)
| Why | Detail |
|---|---|
| Python-native | FastAPI + Pydantic AI share the Pydantic schema DNA. Your existing request/response models plug right in. |
| Type safety | Every tool input, tool output, and agent result is validated by Pydantic. No runtime surprises. |
| Multi-provider | DeepSeek v4 Flash for routing/basic tasks, DeepSeek v4 Pro for complex generation — one-line config change per agent. |
| Structured outputs | Card generation returns a typed CardResult model. The LLM can't return malformed data. |
| Testable | Built-in test model for deterministic agent tests without calling a real LLM. |
| No new infra | Pure Python, no native deps, runs fine on Vercel's Python runtime. |
Model Routing (CONFIRMED)
Two models, assigned by task complexity. This is implemented up front, not deferred.
| Model | Tier | What it handles | Why |
|---|---|---|---|
| DeepSeek v4 Flash | Default (chat) | Orchestrator routing, intent classification, memory extraction, list_capabilities, simple Q&A |
Fast, cheap, sufficient for classification/routing |
| DeepSeek v4 Pro | Premium (pro) | Card generation, About section writing, multi-step reasoning, content that users will read and judge | Higher quality output for user-facing content |
Routing logic: The model is assigned per-agent or per-tool, not chosen by the orchestrator at runtime:
# Orchestrator itself — Flash for fast routing
orchestrator = Agent(
"deepseek:deepseek-v4-flash",
system_prompt=ORCHESTRATOR_PROMPT,
tools=[get_user_profile, generate_card, suggest_about, ...],
)
# Content generation tools use Pro internally
@orchestrator.tool
async def generate_qwestly_card(user_id: str, style: str = "standard") -> CardResult:
"""Generate a Qwestly Card. This tool internally uses DeepSeek v4 Pro
for high-quality content generation."""
card_agent = Agent(
"deepseek:deepseek-v4-pro", # ← Pro for content
system_prompt=CARD_GENERATION_PROMPT,
result_type=CardResult,
)
return await card_agent.run(...)
@orchestrator.tool
async def suggest_linkedin_about(user_id: str) -> str:
"""Generate an improved LinkedIn About section. Uses Pro for quality."""
about_agent = Agent(
"deepseek:deepseek-v4-pro", # ← Pro for content
system_prompt=ABOUT_GENERATION_PROMPT,
)
return await about_agent.run(...)
Token threshold upgrade: When a conversation exceeds ~4000 tokens of accumulated context, the orchestrator itself can be swapped to Pro for the remainder of the session. The increased context length and complexity warrant the more capable model. This is a v0.3 optimization — start with static assignment per tool, add threshold-based upgrade once you have latency/cost data.
Cost impact: Flash is ~10x cheaper than Pro. Routing with Flash + generating with Pro keeps the common case (simple questions, memory extraction) cheap while spending on what users actually see (cards, content).
Tools (v0)
| Tool | What it does | Backed by |
|---|---|---|
get_user_profile(user_id) |
Returns structured profile data from MongoDB (reads from candidates_enhanced, CandidateProfileCache) |
Direct MongoDB query |
ingest_linkedin_profile(linkedin_url) |
Fetches LinkedIn data for a user. Calls the candidate app's /api/linkedin/profile endpoint — which handles fetching, retries, caching, and force refresh. qwestly-agent does NOT call the third-party API directly. |
Candidate app API |
suggest_linkedin_about(user_id) |
Generates an improved LinkedIn About section draft | New — LLM call with profile context |
generate_qwestly_card(user_id, style?) |
Creates a Qwestly Card (structured career document) | New LLM pipeline — likely agentic internally |
query_user_knowledge(user_id, question) |
Semantic search across everything about a user (notes, history, resumes, interviews) | Atlas Vector Search + memory search |
query_qwestly_docs(question) |
Answers "how does Qwestly work?" questions | Atlas Vector Search over internal docs |
search_user_memories(user_id, query) |
Semantic search over synthesized memories (preferences, goals, facts, decisions) | user_memories collection + vector search |
list_capabilities() |
Returns what Qwestly can do | Static list or config |
Data Architecture (Three Tiers)
This is the key architectural decision. Qwestly's data falls into three distinct tiers, each with different storage strategies and query patterns.
Tier 1: Structured Data
What it is: Database fields — user profiles, employment history, education, preferences, Qwestly Cards. Deterministic, exact queries.
Storage: MongoDB collections (candidates_enhanced, CandidateProfileCache, employment_stints_enhanced, preferences_enhanced, qc_sections, etc.)
Owned by: Candidate app (Mongoose schemas are source of truth)
Accessed by qwestly-agent via: Direct MongoDB reads (read-only). Tool calls for exact lookups.
Query pattern: get_user_profile(user_id) — direct document lookup. Deterministic, exact, fast.
Tier 2: Unstructured Memory (NEW — see unstructured-memory.md)
What it is: LLM-extracted, synthesized memories from conversations. Discrete facts, preferences, goals, and decisions the agent learns about a user over time. Not raw transcripts — distilled, deduplicated, structured knowledge.
Storage: MongoDB user_memories collection + vector index for semantic retrieval
Owned by: qwestly-agent (extracts and manages memories)
Query pattern: Loaded at session start (top memories by importance + recency) and searchable via tool during conversation.
Key properties:
- Each memory is a discrete document:
{type, content, importance, source, user_id, created_at} - Extracted asynchronously after conversations (LLM reviews transcript → extracts memories)
- Consolidated periodically (merge similar memories)
- Injected into system prompt at session start
This is the gap the candidate app doesn't fill. See unstructured-memory.md for the full design.
Tier 3: RAG (Vector Search over Raw Content)
What it is: Semantic search over unstructured raw content — resumes, interview transcripts, uploaded documents, full chat transcripts. Not synthesized, not distilled — the original text.
Storage: MongoDB knowledge_chunk collection + Atlas Vector Search index (already exists in candidate app)
Owned by: Candidate app (document ingestion pipeline) — qwestly-agent queries the same index
Query pattern: query_user_knowledge(user_id, question) — embed question → $vectorSearch → return top-k chunks
Already implemented in candidate app: Document upload → Python parsing API → chunking (1500 chars, 200 overlap) → OpenAI text-embedding-3-small (1536d) → Atlas $vectorSearch with cosine similarity.
Tier Comparison
| Aspect | Structured Data | Unstructured Memory | RAG |
|---|---|---|---|
| What it stores | Profile fields, employment, education, cards | Synthesized facts, preferences, goals, decisions | Raw documents, transcripts, resumes |
| Processing | Direct from user input / LinkedIn API | LLM extraction + consolidation after conversations | Chunking + embedding pipeline |
| Query | Exact lookup by field/key | Semantic search + recency/importance ranking | Semantic search (vector similarity) |
| Accuracy | Deterministic, exact | High-signal (distilled), may have gaps | Noisy (raw text), comprehensive |
| Updates | On user action / LinkedIn ingest | Async after each conversation | On document upload |
| Use case | "What's my current title?" | "What writing style do I prefer?" | "What did my interview transcript say about leadership?" |
| Collection | candidates_enhanced, CandidateProfileCache, etc. |
user_memories (NEW) |
knowledge_chunk (exists) |
Data Flow: Full Query Lifecycle
When a user asks "What do we know about me?":
graph TD
Q["User asks: What do we know about me?"] --> O["Orchestrator fuses all three in LLM context"]
O --> T1["1. get_user_profile
Structured data from
candidates_enhanced, CandidateProfileCache
→ Deterministic, exact
→ Scaffold: name, roles, companies"]
O --> T2["2. search_user_memories
Synthesized memories from user_memories
→ High-signal distilled knowledge
→ Preferences, goals, past decisions"]
O --> T3["3. query_user_knowledge
Raw text chunks from knowledge_chunk via vectorSearch
→ Comprehensive but noisy
→ Specific quotes, detailed history"]
T1 --> R["5. LLM synthesizes final answer"]
T2 --> R
T3 --> R
Data Layer Summary
| What | Where | Notes |
|---|---|---|
| User profiles | MongoDB (candidates_enhanced, CandidateProfileCache, etc.) |
Already exists. Read by qwestly-agent directly. |
| LinkedIn data | MongoDB (linkedin_profiles, linkedin_summaries) |
Already exists. Flexible schema handles nested LinkedIn JSON naturally. |
| Generated artifacts | MongoDB (qc_sections, linkedin_profile_suggestions, etc.) |
Already exists. |
| Chat history | MongoDB (chatbot_sessions, chatbot_messages) |
Already exists in candidate app. New sessions will be logged by qwestly-agent into a new collection or the same one. |
| Unstructured memory | MongoDB (user_memories) — NEW |
LLM-extracted facts, preferences, goals. Owned by qwestly-agent. Needs Atlas Vector Search index (user_memories_vector_index, 1536d, cosine) on M10+. For local dev: db.user_memories.createIndex({content: "text"}) as fallback. |
| RAG/vector search | MongoDB Atlas Vector Search on knowledge_chunk |
Already exists. Index name: knowledge_chunk_vector_index, 1536d, cosine similarity. |
| Embeddings | OpenAI text-embedding-3-small (1536d) |
Already in use by candidate app. qwestly-agent uses same model/dimensions. |
Chat Interface (v0)
- Chat widget in the candidate app (React/Next.js)
- Calls
POST /api/chaton the qwestly-agent FastAPI app - Streams responses via Server-Sent Events (SSE)
- Start simple: plain text responses. Upgrade to structured action cards later.
Streaming on FastAPI + Vercel
from fastapi import APIRouter
from fastapi.responses import StreamingResponse
router = APIRouter()
@router.post("/chat")
async def chat(request: ChatRequest):
agent = QwestlyOrchestrator(user_id=request.user_id)
async def event_stream():
async for chunk in agent.run_stream(request.message):
yield f"data: {chunk.json()}\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"X-Accel-Buffering": "no",
"Cache-Control": "no-cache",
}
)
Timeout: Set maxDuration: 300 in Vercel config. A typical agent loop runs ~15-45s.
Auth & Session Management
The candidate app authenticates users via Auth0. qwestly-agent needs to know who is talking to it without re-implementing authentication. The pattern: candidate app creates a short-lived JWT with user context, qwestly-agent verifies it using a shared secret.
Flow
sequenceDiagram
participant U as User
participant C as Candidate App
(Auth0-authenticated)
participant Q as qwestly-agent
POST /api/chat
participant O as Orchestrator
U->>C: Authenticated request
C->>C: Create JWT with
{user_id, email, name}
signed with QWESTLY_AGENT_SHARED_SECRET
C->>Q: Forward request + JWT
Q->>Q: Verify JWT using shared secret
Extract user_id → RunContext
Q->>O: Run with user context
Implementation
Candidate app side (creates JWT before calling qwestly-agent):
// In the chat API route or server component
import jwt from 'jsonwebtoken';
const agentToken = jwt.sign(
{
user_id: session.user.sub, // Auth0 user ID
email: session.user.email,
name: session.user.name,
},
process.env.QWESTLY_AGENT_SHARED_SECRET,
{ expiresIn: '5m' } // Short-lived — one per session start
);
qwestly-agent side (verifies JWT on every request):
import os
import jwt
from fastapi import Header, HTTPException
AGENT_SHARED_SECRET = os.environ["QWESTLY_AGENT_SHARED_SECRET"]
async def verify_agent_token(authorization: str = Header(...)) -> dict:
"""Verify the JWT from the candidate app. Returns user context."""
token = authorization.replace("Bearer ", "")
try:
payload = jwt.decode(token, AGENT_SHARED_SECRET, algorithms=["HS256"])
return payload # { user_id, email, name }
except jwt.ExpiredSignatureError:
raise HTTPException(status_code=401, detail="Token expired")
except jwt.InvalidTokenError:
raise HTTPException(status_code=401, detail="Invalid token")
# Usage in route
@router.post("/chat")
async def chat(request: ChatRequest, user: dict = Depends(verify_agent_token)):
agent = QwestlyOrchestrator(user_id=user["user_id"])
...
Why this over alternatives
| Approach | Why not |
|---|---|
| Pass user_id as plain header | No integrity check — any caller could impersonate any user |
| Candidate proxies all LLM calls | Defeats the purpose of a separate agent app. Tight coupling. |
| qwestly-agent validates Auth0 tokens directly | Requires Auth0 SDK + same client secret in two apps. JWT is simpler. |
| API key only (no user context) | Doesn't tell the agent who the user is — can't scope queries. |
Session startup
On first message in a session, qwestly-agent:
- Verifies the JWT → gets
user_id - Loads the user's profile from MongoDB (structured data)
- Loads top 20 memories (memory tier)
- Injects both into the orchestrator's system prompt
- Begins the conversation
Subsequent messages in the same session reuse the loaded context (no need to re-fetch profile/memories unless the session is long — see conversation management below).
Within-Session Conversation Management
Each turn adds tokens to the context window. After 20+ turns, the context can overflow the LLM's limit or degrade performance. Cross-session context is handled by the memory system. Within-session, we need a strategy.
v0: Sliding Window
Keep the last 15 messages in context. Drop older ones. Simple, works for most sessions.
MAX_SESSION_MESSAGES = 15
def trim_context(messages: list) -> list:
"""Keep only the most recent messages within the limit."""
if len(messages) > MAX_SESSION_MESSAGES:
# Always keep the system prompt (index 0)
return [messages[0]] + messages[-(MAX_SESSION_MESSAGES - 1):]
return messages
Why this is safe: The memory extraction pipeline (running async after the session) captures important facts, preferences, and decisions into the memory tier. So even if within-session context is trimmed, the agent still "remembers" what matters across sessions.
v1: Summarization
When sessions consistently exceed 20 turns, add periodic summarization: compress messages 1-15 into a single summary message, keep messages 16-20 live. An LLM call generates the summary. Keeps the best of both — compact history + recent detail.
v2: RAG over History
Store every turn in MongoDB. When the agent needs historical context from earlier in the session, it calls query_user_knowledge — same RAG tool it uses for documents. The agent decides what history it needs, when it needs it.
Error Handling
Agents are multi-step, multi-system operations. Things will fail. The plan for each failure mode:
Tool failures
Pydantic AI surfaces tool errors to the agent automatically. If get_user_profile throws, the agent sees the error message and can:
- Retry (if it looks transient, like a timeout)
- Try an alternative (e.g., search memories if profile data is unavailable)
- Apologize + explain (if the data is genuinely unavailable)
No special infrastructure needed — this is built into the Pydantic AI agent loop.
LLM call failures
If the LLM API returns an error (rate limit, timeout, provider outage), the agent loop catches it and returns a graceful response:
try:
result = await orchestrator.run(message, deps=deps)
except ModelError as e:
# Log the full error for debugging
await log_error(conversation_id, e)
# Return a user-friendly message (not raw error)
yield "I'm having trouble processing that right now. Can you try again?"
Infinite loops
Set a hard limit on tool calls per request: max 10 tool calls. The orchestrator's system prompt also instructs it to give a final answer within a reasonable number of steps.
result = await orchestrator.run(
message,
deps=deps,
max_tool_calls=10, # Hard cap — Pydantic AI enforces this
)
Vercel timeout
The 300s ceiling is the ultimate safety net. If the agent hasn't finished by then, Vercel kills the function and the user sees an error. Mitigated by: monitoring P95 latency, triggering alerts if consistently >200s, moving to a dedicated backend if needed.
Summary
| Failure mode | v0 approach |
|---|---|
| Tool failure | Agent sees error → retries or apologizes (Pydantic AI built-in) |
| LLM call failure | Catch ModelError → log full error → return user-friendly message |
| Infinite loop | Max 10 tool calls per request (Pydantic AI enforces) |
| Vercel timeout | 300s ceiling. Monitor P95. Move to dedicated backend if >200s. |
Key Design Decisions
-
Separate Python app, not added to candidate. The agent system has its own deploy cadence, dependencies, and concerns. Candidate app is the frontend + data entry surface.
-
JWT-based auth between apps. Candidate app creates a short-lived JWT (shared secret) with user context. qwestly-agent verifies it. No Auth0 SDK duplication, no plaintext user IDs.
-
Direct MongoDB reads, write through candidate app APIs. qwestly-agent reads structured data directly from MongoDB. For writes (saving a card, updating a profile), it calls candidate app API endpoints so validation/business logic stays in one place.
-
One orchestrator, no sub-agents (for v0). Every capability is a tool the orchestrator can call. Resist splitting into sub-agents until the orchestrator's prompt exceeds ~100 lines or routing logic gets genuinely complex.
-
Tools that internally use LLMs are fine.
suggest_linkedin_aboutandgenerate_qwestly_cardwill be agentic internally. The orchestrator doesn't know or care — it sees a tool call. -
Three-tier data access. Structured profile data (tool calls) + synthesized memory (session start injection + tool) + raw RAG (tool). Each serves a different purpose. The orchestrator fuses results.
-
Memory extraction is async. After each conversation, an LLM reviews the transcript and extracts memories. This doesn't block the user — it runs as a background process.
-
Sliding window for within-session context. Keep last 15 messages. Memory tier preserves important details across sessions. Upgrade to summarization when sessions grow long.
-
Stream everything, including tool calls. The user should see the agent's reasoning: "Searching your profile...", "Generating card...", then the final answer streaming in.
-
Log every interaction to MongoDB. User message, every tool call (name + args + result), final response, latency, token count. This data is your debugging lifeline and your eval dataset.
-
Graceful error handling. Tool errors exposed to agent for retry. LLM failures caught and returned as user-friendly messages. Hard cap of 10 tool calls per request.
Phase 2: Sub-Agents (v1 — When the Prompt Gets Too Long)
Don't build this until you need it. You'll know when:
- The orchestrator's system prompt exceeds ~100 lines
- Tool descriptions are paragraphs long and still getting confused
- You find yourself writing "if this, delegate to that" logic in the prompt
New Topology
graph TD
U["User"] --> O["Orchestrator
(routes intent, no longer does the work)"]
O --> LA["LinkedIn Agent"]
O --> QA["QA Agent"]
O --> CG["Card Gen Agent"]
O --> KA["Knowledge Agent"]
Sub-Agent Pattern (Agent-as-Tool)
Each sub-agent is a Pydantic AI Agent wrapped behind a tool. The orchestrator still calls tools — some tools just happen to run an LLM loop internally:
@orchestrator.tool
async def delegate_card_generation(
ctx: RunContext[AgentDeps],
user_id: str,
style: str = "standard",
) -> CardResult:
"""Generate a Qwestly Card. Handles the full multi-step pipeline:
profile analysis -> section drafting -> formatting.
This is a complex operation that runs its own agent loop internally.
The orchestrator just gets back the final CardResult.
"""
card_agent = Agent[AgentDeps](
model="openai:gpt-4o",
system_prompt=CARD_AGENT_SYSTEM_PROMPT,
deps_type=AgentDeps,
result_type=CardResult,
tools=[get_user_profile, format_card_text],
)
result = await card_agent.run(
f"Generate a {style} Qwestly Card for user {user_id}",
deps=ctx.deps,
)
return result.data
Which Sub-Agents, and When
| Sub-agent | Split trigger | Key responsibility |
|---|---|---|
| LinkedIn Agent | LinkedIn ingestion grows to handle multiple sources, merge conflicts, partial updates | Fetch, parse, transform LinkedIn data; handle merge with existing profile data |
| Card Generator Agent | Card generation has 3+ templates, multi-section writing, formatting pipeline | Profile analysis -> template selection -> section writing -> formatting -> review |
| Profile QA Agent | Users ask complex analytical questions about their data that need multi-hop reasoning | Gets profile + RAG results + synthesizes answers that require reasoning across data sources |
| Knowledge Agent | Qwestly's internal documentation gets large enough to need its own RAG pipeline | RAG over internal docs, answers "how does X work" questions |
Phase 3: HITL + MCP (v2)
Human-in-the-Loop for Write / Public-Facing Actions
By default, the agent generates drafts freely. But any action that writes data, changes visible state, or affects something public-facing requires user confirmation.
The pattern is a class of actions, not a specific tool. Any tool that mutates state follows the same gating:
@agent.tool
async def save_card_as_active(
ctx: RunContext[AgentDeps],
user_id: str,
card_id: str,
) -> ActionResult:
"""Save this Qwestly Card as the user's active/public card.
CRITICAL: Never call this unless the user has explicitly confirmed.
First show the card draft, let the user review it, and only call
this after they say 'save it' or 'make it active'.
"""
if not _user_confirmed_for_card(ctx, card_id):
return ActionResult(
status="needs_confirmation",
preview=card_id,
message="Ready to save this as your active card. Confirm?",
)
# ... write through candidate app API
Gating classification:
| Action class | HITL? | Examples |
|---|---|---|
| Read anything | No | Profile lookup, knowledge search, memory search |
| Generate drafts | No | Card draft, About section draft, headline suggestions |
| Save / activate | Yes (tool-level) | Save card as active, update profile field, change preferences |
| Delete / destructive | Yes (tool-level minimum) | Delete card, remove data |
v0 UX: Inline chat confirmation. "Ready to save this as your active card. Confirm?" → user says yes/no.
v1 upgrade: Orchestrator-enforced gating (LangGraph interrupt_before) for higher-stakes actions.
See human-in-the-loop.md for the full HITL design.
MCP-ify Tools for Independent Deployment
When a tool needs to deploy independently, wrap it as an MCP server. Don't do this until you have 2+ services that genuinely need independent deploys. For v0, direct function calls are simpler and faster.
Technology Choices Summary
| Layer | Choice | Why it fits |
|---|---|---|
| Language | Python 3.12+ | Pydantic AI + FastAPI + MongoDB driver. |
| Agent framework | Pydantic AI (v0) -> maybe LangGraph sub-graphs later | Type safety, structured outputs, multi-provider, built-in test model. |
| LLM provider | DeepSeek v4 Flash (default, routing, extraction) + DeepSeek v4 Pro (content generation, complex reasoning) | Two-tier model routing assigned per-agent/per-tool. Pydantic AI makes model swaps a one-line change. |
| API layer | FastAPI (new app, separate from candidate) | Python-native, SSE streaming, shared schema DNA with Pydantic AI. |
| Database | MongoDB (shared with candidate app) | Structured data + memory + vector search — all in one DB. |
| Vector search | MongoDB Atlas Vector Search (already implemented for resumes in candidate app) | Extend existing index to cover memory + chat logs + docs. |
| Embeddings | OpenAI text-embedding-3-small (1536d) via openai SDK |
Already in use by candidate app. Same model, same dimensions. No langchain dependency. |
| Tool protocol | Direct Python functions (v0) -> MCP (v2) | Start simple. MCP-ify when a tool needs independent deployment. |
| Streaming | FastAPI StreamingResponse + SSE | Works on Vercel Python runtime. No WebSockets needed. |
| Deployment | Vercel Pro Team (separate app from candidate) | 300s timeout. Two apps on same team, independent deploys. |
| Frontend | Candidate app (Next.js/React) | Chat widget calls qwestly-agent's /api/chat. No new frontend. |
| Observability | Structured MongoDB logs (v0) -> Logfire (v1) | Agent debugging without traces is guesswork. |
Implementation Roadmap
v0.1: Skeleton
- Scaffold qwestly-agent app — follow the existing
api-pythonVercel pattern:api/index.pyentry point,@vercel/pythonbuilder,vercel.jsonwith routes. Same MongoDB connection pattern (motor/pymongo). - Add Pydantic AI + agent dependencies —
pydantic-aiplus existing stack (fastapi,motor,openaifor embeddings) - Define tool interfaces — write the function signatures and descriptions for all 7-8 v0 tools
- Implement
get_user_profileandlist_capabilities— direct MongoDB reads, static config - Wire up a minimal orchestrator agent — Pydantic AI Agent with those 2 tools, a basic system prompt
- Add
POST /api/chatroute to FastAPI with SSE streaming - Verify end-to-end on Vercel: POST a message -> orchestrator calls tool -> response streams back
- Create
user_memoriescollection + vector index in MongoDB Atlas- Collection:
user_memoriesin thecandidate_portaldatabase - Vector index (Atlas UI → Atlas Search → Create Search Index → JSON Editor):
{ "name": "user_memories_vector_index", "type": "vectorSearch", "definition": { "fields": [ { "type": "vector", "path": "content_embedding", "numDimensions": 1536, "similarity": "cosine" } ] } } - Text index (fallback for local dev without Atlas vector search):
db.user_memories.createIndex({ content: "text" }); - ⚠️ Requires M10+ Atlas cluster. Won't work on M0/M2 free tier. For local dev, use the text index fallback (see README).
- Collection:
v0.2: Core Capabilities
- Implement
ingest_linkedin_profile— calls candidate app's/api/linkedin/profileendpoint via httpx - Implement
suggest_linkedin_about— DeepSeek Pro generates an improved About section draft - Implement
generate_qwestly_card— reads card status and available data sources from MongoDB; delegates to candidate app card endpoints - Implement
search_user_memories— semantic search overuser_memorieswith text fallback - Implement
query_user_knowledge—$vectorSearchover user-scopedknowledge_chunk+ text fallback - Implement
query_qwestly_docs—$vectorSearchover internal docs + text fallback - Build memory extraction pipeline — MemoryService with extraction, retrieval, insertion (with embeddings), consolidation, cleanup
- Add conversation logging — ConversationLogger writes to
agent_conversationscollection - Add session API —
GET /api/conversations(list) andGET /api/conversations/{id}(history) with auto-naming from first message
v0.3: Hardening
- Write unit tests — 48 tests: schemas, auth, config, tools, orchestrator, logging (pytest + pytest-asyncio)
- Write integration tests with mocked LLM — system prompt coverage, tool registration, SSE streaming
- Write 3-5 E2E tests with real LLM for critical paths — deferred to v1 (needs CI API keys)
- Add observability — ConversationLogger captures tool calls, response text, latency to
agent_conversations - Add cost tracking —
lib/cost.pywith DeepSeek Flash/Pro pricing - Add tool-level HITL —
lib/hitl.pyconfirmation tracking,save_card_as_activeexample tool, HITL rules in system prompt - Version prompts — system prompts live as module constants; file extraction deferred
- Deploy to Vercel — ready but not yet deployed
v1: Learn & Iterate
- Collect real user interactions — what do they ask? Which tools get used?
- Audit tool choice accuracy — is the orchestrator picking the right tool every time?
- Improve tool descriptions based on actual mis-routings
- Build an eval dataset from real conversations for regression testing
- A/B test different system prompts to improve routing accuracy
- Tune memory extraction — is the LLM extracting useful memories? Are they improving conversations?
v2+: Evolve
- Evaluate sub-agent split — is the orchestrator prompt too long? Are tools getting confused?
- Add token-threshold model upgrade — auto-swap orchestrator to Pro when conversation context exceeds ~4000 tokens
- Build eval suite — nightly LLM-as-judge runs with trend tracking
- MCP-ify tools that need independent deployment cadence
What Stays the Same
| Aspect | v0 | v0+ | Why no change |
|---|---|---|---|
| Candidate app | Frontend + data entry | Frontend + data entry | Separate concerns. Agent logic lives in qwestly-agent. |
| Deployment | Vercel Pro Team (two apps) | Vercel Pro Team (two apps) | Same platform, independent deploys. |
| Database | MongoDB | MongoDB | Handles everything — structured data, memory, vector search. |
| API framework | FastAPI (new app) | FastAPI | Python-native, pairs with Pydantic AI. |
| Agent framework | Pydantic AI | Pydantic AI + maybe LangGraph sub-graphs for HITL | LangGraph is additive, not a replacement. |
| Frontend | Candidate app's existing UI + chat widget | Same | Chat widget calls qwestly-agent API. |
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Vercel 300s timeout hit on complex multi-step agents | Low | Medium | Monitor agent runtimes. If consistently >200s, extract the agent loop to a dedicated backend (Railway/Fly, ~$15/mo). |
| Cold starts (3-8s) frustrate users on first message | Medium | Low | Show "Loading Qwestly..." with a spinner. Use keepalive ping if needed. |
| Orchestrator picks the wrong tool | Medium | High | Invest heavily in tool descriptions. Add a confirmation step for ambiguous intents. Build eval dataset. |
| LLM costs scale unexpectedly | High | Medium | Set max 10 tool calls per request. Cache profile data. Log cost per conversation. Use model tiering. |
| LLM hallucinates user data in generated cards/sections | Medium | High | Provide full profile context in the prompt. Validate output against known facts. Add a "facts check" step in generation tools. |
| Memory extraction produces low-quality memories | Medium | Medium | Start with high-confidence extraction only. Log memory quality manually for first 100 extractions. Tune the extraction prompt. |
| Atlas Vector Search latency at scale | Low (for v0) | Medium | Keep index size manageable. Add limit + numCandidates tuning. $vectorSearch is fast for <1M vectors. |
Open Questions (Answer These When You Build)
-
Multi-tenancy→ Resolved: Start with single user. The candidate app's provisioned accounts are a different concept (admin-created accounts, not team data partitioning). If multi-tenancy is needed later, addteam_idto every query — but that's a schema migration across both apps, not a v0 concern. -
Candidate app API auth for external calls: Currently, qwestly-agent reads structured data directly from MongoDB. For writes and LinkedIn ingestion, it calls candidate app APIs. In v0, the chat UI lives in the candidate app so cross-app auth isn't a concern. If these endpoints need to be called from elsewhere in the future, they'll need auth (likely the same JWT shared-secret pattern). Flagged for follow-up.
Resolved Questions
| Question | Decision |
|---|---|
| Auth between apps? | Candidate app creates short-lived JWT (shared secret QWESTLY_AGENT_SHARED_SECRET) with user_id, email, name. qwestly-agent verifies it on every request. |
| Chat persistence? | Yes — qwestly-agent stores conversations in its own agent_conversations collection in MongoDB. Persists across browser sessions via conversation_id. |
| Memory ownership? | qwestly-agent owns user_memories and agent_conversations. Does not reuse candidate app's chatbot_sessions/chatbot_messages. |
| Model routing? | Two-tier: DeepSeek v4 Flash for orchestrator + basic tasks, DeepSeek v4 Pro for content generation (cards, About sections). Assigned per-agent/per-tool. Threshold-based upgrade to Pro for long-context sessions (v0.3). |
| LinkedIn API? | qwestly-agent calls candidate app's /api/linkedin/profile endpoint. Candidate app owns the third-party integration (fetching, retries, caching, force refresh). qwestly-agent never calls the third-party API directly. |
| Card format / data? | Qwestly Cards are already structured in candidate app's models (QwestlyCardSection, QCNoInterviewSection, etc.). qwestly-agent fetches card data via candidate app APIs or direct MongoDB reads. The format is a solved problem. |
| Within-session context? | Sliding window (15 messages) for v0. Memory tier preserves cross-session context. |
| Error handling? | Tool errors → agent sees + retries. LLM failures → graceful user message. Hard cap: 10 tool calls. |
| Multi-tenancy? | Start with single user. No team_id partitioning for v0. Revisit if a team/recruiter plan is introduced. |