Table of Contents

RAG & Tools Patterns for Qwestly

How to wire up retrieval-augmented generation and tool-calling so the orchestrator can answer questions about users and generate personalized content. Includes architectural patterns, embedding strategy, and the case for MCP.

1. The Data Landscape

Qwestly needs to handle several kinds of data:

Data type	Source	Storage	Query pattern
Structured user profile	LinkedIn API, direct input	PostgreSQL (or similar)	`get_user_profile(user_id)` — direct DB/API query
LinkedIn raw data	LinkedIn API (scraped/fetched)	JSON blob in DB or object store	Retrieved with profile
Generated artifacts	Qwestly services	PostgreSQL + file store	`get_user_cards(user_id)`, `get_user_suggestions(user_id)`
Unstructured notes	User uploads, freeform input	Vector DB (RAG)	Semantic search: "what does the user care about?"
Qwestly documentation	Internal docs	Vector DB (RAG)	"How does the card generator work?"
Historical interactions	Chat logs	DB (for audit) + Vector DB (for "what have we discussed")	Semantic search + metadata filtering

2. When to Use a Tool vs. RAG

This is a critical design decision. Mixing them up leads to unreliable agents.

Use a TOOL when:

The data is structured. User profile fields (name, email, headline, experience) are rows in a DB. A SQL query or API call is deterministic and exact.
You need exact answers. "What is the user's current job title?" — RAG might miss this or retrieve a stale chunk. A tool gets it right every time.
The operation writes data. "Save this card draft." Tools execute side effects; RAG is read-only.
The query is operational. "Ingest this LinkedIn profile." That's a workflow, not a search.

Use RAG when:

The question is semantic. "What themes appear in the user's career story?" — no SQL query answers this. You need similarity search over notes, past conversations, or generated content.
The data is unstructured. PDFs, freeform notes, chat history, open-ended text.
You're augmenting generation. "Write a personalized summary based on what we know." RAG retrieves relevant context the LLM wouldn't otherwise have.
You need to cite sources. RAG naturally supports returning source chunks for attribution.

The Hybrid Pattern for Qwestly

For the "what do we know about the user?" query, use tool-first, RAG-second:

1. Tool: get_user_profile(user_id) → structured data from DB
2. Tool: get_user_artifacts(user_id) → list of past Qwestly outputs
3. RAG: semantically search user's notes + chat history for context
4. Combine all three in LLM context → generate answer

The agent doesn't need to choose between tool and RAG — it calls multiple tools and the orchestrator fuses the results.

3. RAG Architecture for Qwestly

High-level flow

User asks: "What have we done for my profile so far?"

  1. Embed the query ("what have we done for my profile so far")
  2. Search vector DB with embedding → top-k chunks
  3. Also call tool: get_recent_activity(user_id)
  4. Combine: [RAG results] + [tool results] → prompt
  5. LLM synthesizes final answer

Chunking strategy

Document type	Chunk size	Overlap	Strategy
Chat history	512 tokens	64	Per-message chunks, with conversation_id metadata
User notes	512-1024 tokens	128	Semantic chunks (break at paragraph boundaries)
Generated cards	Full document	N/A	Store as single chunk with section metadata
LinkedIn data	Per-section	N/A	Chunk by section (experience, education, etc.)

Embedding model recommendations (2026)

Model	Dimensions	Quality	Cost	Verdict
OpenAI text-embedding-3-large	3072	Best	Moderate	Production choice if on OpenAI
OpenAI text-embedding-3-small	1536	Very good	Cheap	Good cost/quality balance
Cohere embed-english-v3.0	1024	Very good	Moderate	Good alternative, 1024 dims
voyage-3-lite	1024	Good	Cheap	Fast, good for high volume
SentenceTransformers local	768-1024	Good	Free	Offline capable, run on your own infra

Recommendation: Start with text-embedding-3-small (1536 dims, great quality, low cost). Upgrade to large if retrieval quality isn't sufficient.

Vector DB options

You already use MongoDB with Atlas Vector Search for resume uploads. This simplifies the decision:

DB	Self-hosted	Managed	Pros	Cons
MongoDB Atlas Vector Search ⬅️ You already have this	✅	✅ Atlas	same DB as your data. Already deployed for resume uploads. `$vectorSearch` with metadata filtering in a single aggregation pipeline.	No native hybrid search (BM25+vector) — but combine `$vectorSearch` + `$text` in one pipeline as a workaround.
Qdrant	✅	✅	Fast, written in Rust. Great hybrid search.	Another system to operate alongside MongoDB.
Pinecone	❌	✅	Zero-ops. Fast. Serverless option.	Vendor lock-in. Separate DB from MongoDB. Cost at scale.
pgvector (Postgres extension)	✅	✅	Good for Postgres users, < 1M vectors.	You're not on Postgres. Would mean migrating data or running two DBs.
Chroma	✅	Cloud (beta)	Simple API, Python-native.	Not as performant at scale. Separate system.
Weaviate	✅	✅	Built-in hybrid search.	Heavier to self-host. Another system to operate.

Recommendation: Stick with MongoDB Atlas Vector Search — you already have it working for resumes. Extend the same index to cover chat history, documentation, and notes. One DB, one connection pool, one $vectorSearch pipeline. Add a dedicated vector DB only if you need native hybrid search or outgrow Atlas's scale.

4. Hybrid Search

Pure semantic search (embeddings) can miss exact keyword matches. Hybrid search combines embedding similarity with keyword (BM25) search:

score = α * semantic_score + (1-α) * keyword_score

Where α is a tunable weight (typically 0.5-0.7 for semantic-heavy use cases).

When hybrid matters for Qwestly:

Searching for specific companies, schools, or titles ("Adobe", "Stanford", "VP of Engineering") — keyword signals are strong.
Searching user notes with proper nouns — embeddings alone may not capture exact matches.

Options:

MongoDB Atlas: Combine $vectorSearch with $text or $search in an aggregation pipeline for an approximate hybrid. Not native BM25+vector, but close enough for most use cases.
Qdrant: Built-in hybrid search with BM25.
Weaviate: Built-in hybrid search with BM25.

5. The MCP (Model Context Protocol) Angle

MCP is an open protocol (by Anthropic) that standardizes how LLMs interact with tools and data sources. Think of it as "USB-C for AI" — a common interface for connecting LLMs to external systems.

How MCP works

Host (your app) ↔ MCP Client ↔ MCP Server ↔ External system (DB, API, etc.)

Each MCP Server exposes:

Tools: Functions the LLM can call (e.g., get_user_profile)
Resources: Data the LLM can read (e.g., user://42/profile)
Prompts: Reusable prompt templates (e.g., "analyze career progression")

Why MCP matters for Qwestly

Standardized tool interface. Instead of writing custom tool-wiring code for each framework, tools become MCP servers. LangGraph, Pydantic AI, and the OpenAI Agents SDK all support MCP clients now.
Your existing services become pluggable. Have a LinkedIn ingestion service? Wrap it as an MCP server. A card generation API? MCP server. The orchestrator discovers them dynamically.
Decoupling. The orchestrator framework and the tool implementations are independent. You can swap out LangGraph for Pydantic AI without rewriting your tools.
Ecosystem reuse. There are already MCP servers for Postgres, filesystem, web search, Puppeteer, GitHub, and many more. You get integrations for free.

MCP adoption reality check (2026)

MCP is widely supported across frameworks (LangGraph, Pydantic AI, OpenAI Agents SDK, Claude desktop app, VS Code extensions).
It's not mandatory. You can wire tools directly without MCP. But if you want a future-proof, decoupled architecture, MCP is the direction the ecosystem is moving.

Recommendation: Design your tools as MCP servers from day one. Even if the orchestrator uses direct tool wiring initially, the MCP interface means you can evolve independently.

6. Tool Design Patterns

Pattern 1: Simple Query Tool

@tool
def get_user_profile(user_id: str) -> UserProfile:
    """Get a user's structured profile data. Use this for questions about
    name, email, headline, experience, education, and other profile fields.
    
    Args:
        user_id: The Qwestly user ID (looks like 'usr_xxx')
    """
    return db.query("SELECT * FROM users WHERE id = ?", user_id)

Pattern 2: Agentic Tool (Tool that's also an Agent)

@tool
def generate_qwestly_card(user_id: str, style: str = "standard") -> CardResult:
    """Generate a Qwestly Card for a user. This is a multi-step process
    that analyzes profile data and creates a structured career document.
    
    Args:
        user_id: The Qwestly user ID
        style: Card style ('standard', 'detailed', 'executive')
    """
    # Internally this calls an LLM loop to analyze and generate
    # The orchestrator sees it as a single tool call
    profile = get_user_profile(user_id)
    card_agent = Agent(system_prompt=..., tools=[...])
    return card_agent.run(profile, style)

Pattern 3: RAG Query Tool

@tool
def query_user_knowledge_base(user_id: str, question: str) -> list[Chunk]:
    """Semantically search everything we know about a user — notes, past
    conversations, generated artifacts. Use for questions like 'what themes
    emerge in their career?' or 'what have we suggested before?'
    
    Args:
        user_id: The Qwestly user ID
        question: The natural language question to search for
    """
    embedding = embed(question)
    results = vector_db.search(
        collection=f"user_{user_id}",
        embedding=embedding,
        top_k=5
    )
    return results

Pattern 4: Decision Tool (routes to the right sub-system)

@tool
def analyze_intent(user_message: str) -> IntentResult:
    """Classify the user's intent and return the appropriate handler.
    Internal use by the orchestrator.
    
    Returns:
        Intent result with handler name and extracted parameters
    """
    # This could be a simple classifier or a small LLM call
    ...

7. Prompt Engineering for Tool Choice

The orchestrator's system prompt is where you teach it which tool to use when. This is the single most important prompt in your system.

Example: Qwestly orchestrator system prompt (abbreviated)

You are Qwestly's career agent assistant. You help users manage their career
profile, generate Qwestly Cards, and answer questions about their career data.

Available tools:
- get_user_profile(user_id): Fetch structured profile data. FIRST tool to call
  for any user-specific question.
- ingest_linkedin_profile(linkedin_url): Import LinkedIn data for a new user.
  Use only when the user explicitly asks to import or update LinkedIn data.
- query_user_data(user_id, question): Semantic search across notes, history,
  and profile context. Use for "what do you know about X" or "what themes..."
- suggest_linkedin_about(user_id): Generate an improved LinkedIn About section.
- generate_qwestly_card(user_id, styles): Create a Qwestly Card (structured
  career document). The user may call this "resume", "card", or "profile doc".
- list_my_tools(): List what Qwestly can do. Use when the user seems unsure
  what's available.

Rules:
1. Always call get_user_profile first to establish context about the user.
2. If user data is missing or incomplete, ask before ingesting LinkedIn data.
3. For content generation (about section, card), first gather the user's
   preferences — style, tone, what to emphasize.
4. Never save or publish without explicit user confirmation.
5. If you're unsure which tool to use, ask the user rather than guessing.

The "tool description" is the real prompt

When you define a tool, the description field is what the LLM reads to decide whether to use it. Write descriptions as instructions:

Bad: "Fetch user profile" Good: "Get a user's full profile data including name, headline, experience, education, skills, and career preferences. Call this first for any question about a specific user. Requires a valid user_id."

8. Observability & Evaluation

Once your agent is calling tools and running RAG, you need to know:

Which tools were called, in what order, with what arguments?
Was the RAG retrieval relevant or garbage?
Did the agent make the right tool choice?
How long did each step take?
How many tokens did it cost?

Tools for that

Tool	What it does	Framework integration
LangSmith	Full tracing, evaluation, datasets	LangGraph native, others via API
Logfire (by Pydantic)	Tracing + structured logging	Pydantic AI native
Arize / Phoenix	LLM observability, embeddings	Framework-agnostic
OpenAI Dashboard	Built-in tracing	OpenAI-only
Custom (OpenTelemetry)	Standard traces/metrics	Any framework

Don't skip this. Agent systems are non-deterministic — without tracing, debugging is guesswork.

9. Recommended Stack for Qwestly's Data Layer

┌─────────────────────────────────────────────────┐
│              Orchestrator Agent                 │
│  (Pydantic AI / LangGraph / OpenAI SDK)         │
├────────┬──────────┬──────────┬──────────────────┤
│  Tool: │ Tool:    │ Tool:    │  Tool:           │
│  User  │ LinkedIn │ Card     │  Knowledge       │
│  Profile│ Ingest  │ Generate │  Query (RAG)     │
├────────┴──────────┴──────────┴──────────────────┤
│  Tool implementations (your services + MCP)     │
├─────────────────────────────────────────────────┤
│  MongoDB (everything in one place)              │
│  ├── users collection (profile data)            │
│  ├── cards collection (generated artifacts)     │
│  ├── chat_logs collection (conversation history)│
│  └── Atlas Vector Search index on any field     │
│      (resumes, notes, docs, history)            │
└─────────────────────────────────────────────────┘

Data flow summary:

Orchestrator receives user message
Tool call → get_user_profile(user_id) → structured data from MongoDB
Tool call → query_user_knowledge_base(user_id, question) → semantic results from Vector DB
Both results injected into LLM context
LLM synthesizes answer (or delegates to a generation tool)
All steps traced in LangSmith/Logfire

Referenced by

index

_private/qwestly-docs/Features/qwestly-agent/rag-and-tools-patterns.md