_private/qwestly-docs/Features/qwestly-agent/rag-and-tools-patterns.md
Table of Contents
RAG & Tools Patterns for Qwestly
How to wire up retrieval-augmented generation and tool-calling so the orchestrator can answer questions about users and generate personalized content. Includes architectural patterns, embedding strategy, and the case for MCP.
1. The Data Landscape
Qwestly needs to handle several kinds of data:
| Data type | Source | Storage | Query pattern |
|---|---|---|---|
| Structured user profile | LinkedIn API, direct input | PostgreSQL (or similar) | get_user_profile(user_id) โ direct DB/API query |
| LinkedIn raw data | LinkedIn API (scraped/fetched) | JSON blob in DB or object store | Retrieved with profile |
| Generated artifacts | Qwestly services | PostgreSQL + file store | get_user_cards(user_id), get_user_suggestions(user_id) |
| Unstructured notes | User uploads, freeform input | Vector DB (RAG) | Semantic search: "what does the user care about?" |
| Qwestly documentation | Internal docs | Vector DB (RAG) | "How does the card generator work?" |
| Historical interactions | Chat logs | DB (for audit) + Vector DB (for "what have we discussed") | Semantic search + metadata filtering |
2. When to Use a Tool vs. RAG
This is a critical design decision. Mixing them up leads to unreliable agents.
Use a TOOL when:
- The data is structured. User profile fields (name, email, headline, experience) are rows in a DB. A SQL query or API call is deterministic and exact.
- You need exact answers. "What is the user's current job title?" โ RAG might miss this or retrieve a stale chunk. A tool gets it right every time.
- The operation writes data. "Save this card draft." Tools execute side effects; RAG is read-only.
- The query is operational. "Ingest this LinkedIn profile." That's a workflow, not a search.
Use RAG when:
- The question is semantic. "What themes appear in the user's career story?" โ no SQL query answers this. You need similarity search over notes, past conversations, or generated content.
- The data is unstructured. PDFs, freeform notes, chat history, open-ended text.
- You're augmenting generation. "Write a personalized summary based on what we know." RAG retrieves relevant context the LLM wouldn't otherwise have.
- You need to cite sources. RAG naturally supports returning source chunks for attribution.
The Hybrid Pattern for Qwestly
For the "what do we know about the user?" query, use tool-first, RAG-second:
1. Tool: get_user_profile(user_id) โ structured data from DB
2. Tool: get_user_artifacts(user_id) โ list of past Qwestly outputs
3. RAG: semantically search user's notes + chat history for context
4. Combine all three in LLM context โ generate answer
The agent doesn't need to choose between tool and RAG โ it calls multiple tools and the orchestrator fuses the results.
3. RAG Architecture for Qwestly
High-level flow
User asks: "What have we done for my profile so far?"
1. Embed the query ("what have we done for my profile so far")
2. Search vector DB with embedding โ top-k chunks
3. Also call tool: get_recent_activity(user_id)
4. Combine: [RAG results] + [tool results] โ prompt
5. LLM synthesizes final answer
Chunking strategy
| Document type | Chunk size | Overlap | Strategy |
|---|---|---|---|
| Chat history | 512 tokens | 64 | Per-message chunks, with conversation_id metadata |
| User notes | 512-1024 tokens | 128 | Semantic chunks (break at paragraph boundaries) |
| Generated cards | Full document | N/A | Store as single chunk with section metadata |
| LinkedIn data | Per-section | N/A | Chunk by section (experience, education, etc.) |
Embedding model recommendations (2026)
| Model | Dimensions | Quality | Cost | Verdict |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | Best | Moderate | Production choice if on OpenAI |
| OpenAI text-embedding-3-small | 1536 | Very good | Cheap | Good cost/quality balance |
| Cohere embed-english-v3.0 | 1024 | Very good | Moderate | Good alternative, 1024 dims |
| voyage-3-lite | 1024 | Good | Cheap | Fast, good for high volume |
| SentenceTransformers local | 768-1024 | Good | Free | Offline capable, run on your own infra |
Recommendation: Start with text-embedding-3-small (1536 dims, great quality, low cost). Upgrade to large if retrieval quality isn't sufficient.
Vector DB options
You already use MongoDB with Atlas Vector Search for resume uploads. This simplifies the decision:
| DB | Self-hosted | Managed | Pros | Cons |
|---|---|---|---|---|
| MongoDB Atlas Vector Search โฌ ๏ธ You already have this | โ | โ Atlas | same DB as your data. Already deployed for resume uploads. $vectorSearch with metadata filtering in a single aggregation pipeline. |
No native hybrid search (BM25+vector) โ but combine $vectorSearch + $text in one pipeline as a workaround. |
| Qdrant | โ | โ | Fast, written in Rust. Great hybrid search. | Another system to operate alongside MongoDB. |
| Pinecone | โ | โ | Zero-ops. Fast. Serverless option. | Vendor lock-in. Separate DB from MongoDB. Cost at scale. |
| pgvector (Postgres extension) | โ | โ | Good for Postgres users, < 1M vectors. | You're not on Postgres. Would mean migrating data or running two DBs. |
| Chroma | โ | Cloud (beta) | Simple API, Python-native. | Not as performant at scale. Separate system. |
| Weaviate | โ | โ | Built-in hybrid search. | Heavier to self-host. Another system to operate. |
Recommendation: Stick with MongoDB Atlas Vector Search โ you already have it working for resumes. Extend the same index to cover chat history, documentation, and notes. One DB, one connection pool, one $vectorSearch pipeline. Add a dedicated vector DB only if you need native hybrid search or outgrow Atlas's scale.
4. Hybrid Search
Pure semantic search (embeddings) can miss exact keyword matches. Hybrid search combines embedding similarity with keyword (BM25) search:
score = ฮฑ * semantic_score + (1-ฮฑ) * keyword_score
Where ฮฑ is a tunable weight (typically 0.5-0.7 for semantic-heavy use cases).
When hybrid matters for Qwestly:
- Searching for specific companies, schools, or titles ("Adobe", "Stanford", "VP of Engineering") โ keyword signals are strong.
- Searching user notes with proper nouns โ embeddings alone may not capture exact matches.
Options:
- MongoDB Atlas: Combine
$vectorSearchwith$textor$searchin an aggregation pipeline for an approximate hybrid. Not native BM25+vector, but close enough for most use cases. - Qdrant: Built-in hybrid search with BM25.
- Weaviate: Built-in hybrid search with BM25.
5. The MCP (Model Context Protocol) Angle
MCP is an open protocol (by Anthropic) that standardizes how LLMs interact with tools and data sources. Think of it as "USB-C for AI" โ a common interface for connecting LLMs to external systems.
How MCP works
Host (your app) โ MCP Client โ MCP Server โ External system (DB, API, etc.)
Each MCP Server exposes:
- Tools: Functions the LLM can call (e.g.,
get_user_profile) - Resources: Data the LLM can read (e.g.,
user://42/profile) - Prompts: Reusable prompt templates (e.g., "analyze career progression")
Why MCP matters for Qwestly
-
Standardized tool interface. Instead of writing custom tool-wiring code for each framework, tools become MCP servers. LangGraph, Pydantic AI, and the OpenAI Agents SDK all support MCP clients now.
-
Your existing services become pluggable. Have a LinkedIn ingestion service? Wrap it as an MCP server. A card generation API? MCP server. The orchestrator discovers them dynamically.
-
Decoupling. The orchestrator framework and the tool implementations are independent. You can swap out LangGraph for Pydantic AI without rewriting your tools.
-
Ecosystem reuse. There are already MCP servers for Postgres, filesystem, web search, Puppeteer, GitHub, and many more. You get integrations for free.
MCP adoption reality check (2026)
- MCP is widely supported across frameworks (LangGraph, Pydantic AI, OpenAI Agents SDK, Claude desktop app, VS Code extensions).
- It's not mandatory. You can wire tools directly without MCP. But if you want a future-proof, decoupled architecture, MCP is the direction the ecosystem is moving.
Recommendation: Design your tools as MCP servers from day one. Even if the orchestrator uses direct tool wiring initially, the MCP interface means you can evolve independently.
6. Tool Design Patterns
Pattern 1: Simple Query Tool
@tool
def get_user_profile(user_id: str) -> UserProfile:
"""Get a user's structured profile data. Use this for questions about
name, email, headline, experience, education, and other profile fields.
Args:
user_id: The Qwestly user ID (looks like 'usr_xxx')
"""
return db.query("SELECT * FROM users WHERE id = ?", user_id)
Pattern 2: Agentic Tool (Tool that's also an Agent)
@tool
def generate_qwestly_card(user_id: str, style: str = "standard") -> CardResult:
"""Generate a Qwestly Card for a user. This is a multi-step process
that analyzes profile data and creates a structured career document.
Args:
user_id: The Qwestly user ID
style: Card style ('standard', 'detailed', 'executive')
"""
# Internally this calls an LLM loop to analyze and generate
# The orchestrator sees it as a single tool call
profile = get_user_profile(user_id)
card_agent = Agent(system_prompt=..., tools=[...])
return card_agent.run(profile, style)
Pattern 3: RAG Query Tool
@tool
def query_user_knowledge_base(user_id: str, question: str) -> list[Chunk]:
"""Semantically search everything we know about a user โ notes, past
conversations, generated artifacts. Use for questions like 'what themes
emerge in their career?' or 'what have we suggested before?'
Args:
user_id: The Qwestly user ID
question: The natural language question to search for
"""
embedding = embed(question)
results = vector_db.search(
collection=f"user_{user_id}",
embedding=embedding,
top_k=5
)
return results
Pattern 4: Decision Tool (routes to the right sub-system)
@tool
def analyze_intent(user_message: str) -> IntentResult:
"""Classify the user's intent and return the appropriate handler.
Internal use by the orchestrator.
Returns:
Intent result with handler name and extracted parameters
"""
# This could be a simple classifier or a small LLM call
...
7. Prompt Engineering for Tool Choice
The orchestrator's system prompt is where you teach it which tool to use when. This is the single most important prompt in your system.
Example: Qwestly orchestrator system prompt (abbreviated)
You are Qwestly's career agent assistant. You help users manage their career
profile, generate Qwestly Cards, and answer questions about their career data.
Available tools:
- get_user_profile(user_id): Fetch structured profile data. FIRST tool to call
for any user-specific question.
- ingest_linkedin_profile(linkedin_url): Import LinkedIn data for a new user.
Use only when the user explicitly asks to import or update LinkedIn data.
- query_user_data(user_id, question): Semantic search across notes, history,
and profile context. Use for "what do you know about X" or "what themes..."
- suggest_linkedin_about(user_id): Generate an improved LinkedIn About section.
- generate_qwestly_card(user_id, styles): Create a Qwestly Card (structured
career document). The user may call this "resume", "card", or "profile doc".
- list_my_tools(): List what Qwestly can do. Use when the user seems unsure
what's available.
Rules:
1. Always call get_user_profile first to establish context about the user.
2. If user data is missing or incomplete, ask before ingesting LinkedIn data.
3. For content generation (about section, card), first gather the user's
preferences โ style, tone, what to emphasize.
4. Never save or publish without explicit user confirmation.
5. If you're unsure which tool to use, ask the user rather than guessing.
The "tool description" is the real prompt
When you define a tool, the description field is what the LLM reads to decide whether to use it. Write descriptions as instructions:
Bad: "Fetch user profile"
Good: "Get a user's full profile data including name, headline, experience, education, skills, and career preferences. Call this first for any question about a specific user. Requires a valid user_id."
8. Observability & Evaluation
Once your agent is calling tools and running RAG, you need to know:
- Which tools were called, in what order, with what arguments?
- Was the RAG retrieval relevant or garbage?
- Did the agent make the right tool choice?
- How long did each step take?
- How many tokens did it cost?
Tools for that
| Tool | What it does | Framework integration |
|---|---|---|
| LangSmith | Full tracing, evaluation, datasets | LangGraph native, others via API |
| Logfire (by Pydantic) | Tracing + structured logging | Pydantic AI native |
| Arize / Phoenix | LLM observability, embeddings | Framework-agnostic |
| OpenAI Dashboard | Built-in tracing | OpenAI-only |
| Custom (OpenTelemetry) | Standard traces/metrics | Any framework |
Don't skip this. Agent systems are non-deterministic โ without tracing, debugging is guesswork.
9. Recommended Stack for Qwestly's Data Layer
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Orchestrator Agent โ
โ (Pydantic AI / LangGraph / OpenAI SDK) โ
โโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโค
โ Tool: โ Tool: โ Tool: โ Tool: โ
โ User โ LinkedIn โ Card โ Knowledge โ
โ Profileโ Ingest โ Generate โ Query (RAG) โ
โโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโค
โ Tool implementations (your services + MCP) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ MongoDB (everything in one place) โ
โ โโโ users collection (profile data) โ
โ โโโ cards collection (generated artifacts) โ
โ โโโ chat_logs collection (conversation history)โ
โ โโโ Atlas Vector Search index on any field โ
โ (resumes, notes, docs, history) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Data flow summary:
- Orchestrator receives user message
- Tool call โ
get_user_profile(user_id)โ structured data from MongoDB - Tool call โ
query_user_knowledge_base(user_id, question)โ semantic results from Vector DB - Both results injected into LLM context
- LLM synthesizes answer (or delegates to a generation tool)
- All steps traced in LangSmith/Logfire