_private/qwestly-docs/Features/qwestly-agent/deployment-options.md

Deployment Options โ€” Agentic Systems on Vercel

Can you run an agent orchestrator on Vercel? Yes, with caveats. Here's what you need to know about timeouts, memory, streaming, cold starts, and the patterns that work in practice.


1. The Core Tension

Agentic systems have a fundamental conflict with serverless architectures:

Serverless wants Agentic systems want
Short-lived functions (< 10-30s) Long-running loops (30-120s)
Stateless, fire-and-forget Stateful multi-step conversations
Quick cold-starts (no heavy imports) Heavy dependencies (LLM SDKs, vector libs)

This isn't a dealbreaker โ€” it's a design constraint. You just need to understand where the limits are.


2. Vercel's Limits at a Glance

Resource Hobby Pro Enterprise
Serverless function timeout 10s 60s (default), up to 300s (configurable) Up to 900s
Edge function timeout 30s 30s 30s
Serverless function memory 128-1024 MB 128-1024 MB 128-1024 MB
Edge function memory 128 MB 128 MB 128 MB
Response streaming โœ… Yes โœ… Yes โœ… Yes
Max function size 50 MB (zipped) 50 MB (zipped) 250 MB (zipped)
Concurrent invocations 10 100 (soft) Custom

The hard constraint for agents is function timeout.

An agent loop with 2-3 tool calls and 2-3 LLM calls can easily consume 30-60 seconds. On Hobby you're limited to 10s โ€” that's one quick LLM call and you're done. On Pro you can go up to 300s, which covers most agent workflows with room to spare.


3. Estimating Your Runtime

For Qwestly's typical request:

Step                    | Time (typical) | Time (worst case)
------------------------|----------------|-------------------
User message arrives    | ~5ms           | ~5ms
Orchestrator LLM call 1 | 3s             | 10s (slow model, long context)
Tool: get_user_profile  | 50ms           | 200ms (DB cold query)
Tool: query_history     | 200ms          | 1s (RAG search)
Orchestrator LLM call 2 | 3s             | 10s (synthesizing results)
Tool: generate_card     | 8s             | 20s (multi-step agent)
Response streaming      | 2s             | 5s (streaming back)
------------------------|----------------|-------------------
Total                   | ~16s           | ~46s

On Vercel Pro with 60s default timeout: most requests fit. A few worst-case complex card generations could miss the limit.

On Vercel Pro with 300s extended timeout: all requests fit comfortably. This is the sweet spot.

On Vercel Hobby (10s): you can do exactly one quick LLM call and one tool call. No multi-step agents.


4. Five Deployment Patterns (Ranked Best to Worst)

Pattern A: Vercel Frontend + Dedicated Backend (Recommended)

User โ†โ†’ Vercel (Next.js)
            โ”‚
            โ”‚ API call (your domain/private network)
            โ–ผ
        Dedicated backend (Fly.io / Railway / Modal / VPS)
            โ”‚
            โ”œโ”€โ”€ Agent loop (orchestrator + tools)
            โ”œโ”€โ”€ MongoDB + Atlas Vector Search (one DB)
            โ””โ”€โ”€ LLM API calls

Vercel handles: Chat UI, authentication, static assets, lightweight API routes that proxy to the backend.

Backend handles: The actual agent loop โ€” streaming responses, long-running tool chains, RAG pipelines, LLM calls.

Why this is the recommendation: Vercel is excellent at what it does (frontend, CDN, serverless lightweight APIs). But it wasn't designed for Stateful Agent loops. Putting the agent runtime on a purpose-fit platform removes all timeout anxiety and gives you:

  • No timeout limits (run 10-minute agent chains if needed)
  • Full control over memory and CPU
  • Persistent connections (WebSockets, SSE streams)
  • Background task processing

What to use for the backend:

Platform Why it fits Cost Notes
Fly.io Global, any runtime, persistent connections, fast deploy ~$20-50/mo for a small VM Best balance of DX + capability. Any Docker image.
Railway Dead simple, good for Python backends ~$10-20/mo Less control than Fly, but easier.
Modal Purpose-built for AI workloads. GPU access, fast cold starts. Pay-per-use (~$10-50/mo) Best for heavy compute. Overkill for simple orchestrator.
Render Simple, managed, good for FastAPI ~$7-25/mo Web Service type has no timeout limit.
A VPS (DigitalOcean, etc.) Full control ~$6-24/mo You manage everything. Fine if you're comfortable with it.

Pattern B: Vercel-Only with Extended Timeout (Simple, Tight)

User โ†โ†’ Vercel (Next.js API Route with maxDuration: 300)
            โ”‚
            โ”œโ”€ Agent loop runs inside function
            โ”œโ”€ Tool calls to external services
            โ””โ”€ LLM API calls

Vercel handles: Everything. The agent loop runs inside a serverless function.

Constraints:

  • Must set maxDuration: 300 in your function config (Pro plan)
  • 300s (5 minutes) is the hard cap, even on Enterprise
  • 1024 MB memory limit โ€” fine for most agents, but tight if you load large models or process big datasets
  • No background processing โ€” the function dies when the response ends
  • Cold starts add 1-3s to the first request (less if you keep the runtime warm)

Code setup:

// app/api/chat/route.ts โ€” Vercel Edge or Serverless

export const maxDuration = 300;  // Up to 300s on Pro

export async function POST(req: Request) {
  // Agent loop runs here
  // Streaming via ReadableStream or Vercel AI SDK
}

Can it work for Qwestly? Yes, for v0/mvp. Most requests will complete within 300s. But you're living at the ceiling โ€” any change that adds latency (new tool, slower model, larger context) risks timeout errors.

User โ†โ†’ Vercel Edge (global, fast)

Edge functions run on V8 isolates at Vercel's edge nodes. They're great for:

  • Fast, globally distributed responses
  • Authentication checks
  • URL rewrites/redirects

They're bad for agents because:

  • 30s hard timeout (non-negotiable)
  • 128 MB memory limit
  • No Node.js native APIs (no fs, no child_process, no raw TCP)
  • Limited runtime (only a subset of Node.js APIs)

Verdict: Don't run your agent loop here. Use edge functions only for the chat UI proxy or auth layer.

Pattern D: Background Task + Webhook Polling (Complex, Unnecessary)

User โ†’ Vercel โ†’ Returns immediately with "processing" โ†’ Background task runs โ†’ Webhook callback
User polls for result โ†’ Gets result when done

How it works: Vercel function accepts the request, kicks off a background task (on another service), returns a token. The user polls for the result. When the task finishes, a webhook updates the result.

Why you shouldn't do this: It's complex, requires polling UI, breaks the streaming UX, and adds latency. The only reason to use this is if you're stuck on Hobby plan (10s limit). Upgrade to Pro instead.

Pattern E: Hybrid โ€” Vercel Runs Simple Agents, Backend Runs Complex Ones

User โ†’ Vercel API route
         โ”‚
    Intent classifier
         โ”‚
         โ”œโ”€ "Simple" โ†’ handled inline by Vercel
         โ”‚   (e.g., list capabilities, what's my name)
         โ”‚
         โ””โ”€ "Complex" โ†’ forwarded to backend
             (e.g., generate a card, ingest LinkedIn profile)

This is a practical middle-ground. Most chats are quick queries. Only the heavy artifacts (card generation, LinkedIn ingestion) need the backend.

Pros: You minimize backend usage for the common case. Cheap infra. Cons: Two code paths to maintain. Intent classification latency adds overhead.


5. Streaming on Vercel

Streaming is where Vercel actually shines. The AI SDK + Vercel Edge/Serverless streaming is best-in-class.

How streaming works with an agent loop

Time โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’

User sends message
  โ”‚
  โ”œโ”€ LLM produces first reasoning tokens โ†’ streamed immediately
  โ”œโ”€ LLM decides to call tool โ†’ tool call info streamed
  โ”œโ”€ Tool executes (user sees "Searching your profile...")
  โ”œโ”€ Tool result fed back to LLM
  โ”œโ”€ LLM produces final answer โ†’ streamed token by token
  โ”‚
Response complete

The Vercel AI SDK handles this natively:

import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

const result = streamText({
  model: openai('gpt-4o'),
  system: orchestratorPrompt,
  messages: history,
  tools: {
    get_user_profile: tool({ ... }),
    generate_card: tool({ ... }),
  },
});

// Streams to client automatically
return result.toDataStreamResponse();

What's actually happening: The LLM streams tokens โ†’ when a tool is called, execution pauses the stream โ†’ tool runs โ†’ tool result is fed back โ†’ stream resumes. The user sees a continuous experience.

Timeouts during streaming: The function stays alive for the entire stream. The timeout clock starts when the function starts, not per-tool-call. So a 300s timeout gives you 300s total for the full agent loop.


6. Memory Considerations

What uses memory Typical Peak
Python runtime ~30-50 MB ~50 MB
LLM SDK (openai, anthropic) ~10 MB ~10 MB
Pydantic AI / framework ~10 MB ~20 MB
Postgres connection pool ~5 MB ~10 MB
Prompt / context payloads ~1-5 MB ~10 MB (very long conversations)
Response buffer ~0.5 MB ~2 MB
Total (typical agent) ~60-80 MB ~100-150 MB

You're unlikely to hit Vercel's 1024 MB limit with a pure logic agent. The risk is:

  • Loading large models locally (don't โ€” use API calls)
  • Processing large files (LinkedIn PDF exports, bulk data) โ€” stream instead of loading into memory
  • Memory leaks in long-running loops โ€” less of an issue on serverless (process dies after request)

Verdict: Memory is not a constraint for Qwestly's use case on Vercel Pro.


7. Cold Starts

Cold starts happen when Vercel spins up a new instance of your function. They add 1-3 seconds of latency.

Framework Cold start (typical) Mitigation
Next.js (Edge) ~0.1-0.5s Already fast. Hard to notice.
Next.js (Serverless, Node) ~1-3s Noticeable. Use "Pro" plan with 0-sleep (pays to keep warm).
Python (via Custom Runtime) ~3-10s Painful. Not recommended on Vercel without a warm strategy.

If you use Python on Vercel: You'd need Vercel's Python runtime or Docker. Cold starts are rough (~5-10s). This is another reason to put the Python agent on a dedicated backend and let Vercel handle the frontend.

If you use TypeScript on Vercel: Cold starts are manageable (1-3s). The Vercel AI SDK is TypeScript-native, so you can build the agent loop in TS without Python.


7b. Vercel Python Runtime โ€” The Reality Check

Since you already have FastAPI on Vercel's Python runtime, here's the unfiltered truth about what that means for running an agent orchestrator:

What works well

Capability Status Notes
FastAPI routes โœ… Perfect Your existing API structure works as-is
ASGI streaming โœ… Works FastAPI + StreamingResponse via ASGI. Vercel supports it.
LLM API calls โœ… Works Outbound HTTPS to OpenAI/Anthropic โ€” no issues
Tool execution โœ… Works Any pure-Python tool (DB queries, API calls, data transforms)
Pydantic AI โœ… Works Pure Python, no native deps, runs fine
Postgres connections โœ… Works Standard asyncpg/psycopg2 โ€” just manage connection pool sizing

What has sharp edges

Edge The problem Mitigation
Cold starts 3-10s on first request after idle. Vercel spins down Python functions aggressively. Use Vercel Pro's "0-sleep" or set up a cron ping every 5 min to keep warm. Or accept it โ€” 5s cold start once per user session is tolerable.
Timeout ceiling 300s max, even on Pro Team. Hard limit, no exceptions. Monitor actual agent runtimes. If you consistently hit 200s+ on complex requests, you'll need a dedicated backend for those paths.
No background tasks You can't asyncio.create_task and return immediately โ€” the function dies when the response ends. Use a separate queue service (or defer batch work to a cron/background job on a different platform).
No WebSockets Vercel Serverless doesn't support persistent WebSocket connections. Use SSE (Server-Sent Events) for streaming instead. Same UX, simpler infra, works on Serverless.
Ephemeral filesystem Can't write to disk and expect it to persist. /tmp exists but is per-instance. Never write files. Stream PDF generation output directly to the response. Use object storage (S3/R2) for any persistent files.
Large dependency size Python deps (numpy, pandas, etc.) can push your function size past limits. Keep the agent lightweight. Don't bundle ML frameworks โ€” call them as APIs. Pydantic AI + httpx + a DB driver is ~15MB zipped. Fine.

Cold start impact on UX

COLD:  |โ”€โ”€cold start (~3-8s)โ”€โ”€|โ”€โ”€agent loopโ”€โ”€|โ”€โ”€streamingโ”€โ”€|
WARM:                         |โ”€โ”€agent loopโ”€โ”€|โ”€โ”€streamingโ”€โ”€|
                               (no cold start)

The cold start penalty hits the first request after a period of inactivity. For a Qwestly chatbot:

  • User opens the chat โ†’ cold start (3-8s) โ†’ agent processes their first message โ†’ they see a response
  • User types a follow-up โ†’ warm (instant) โ†’ fast response
  • Chat sits idle for 15+ min โ†’ next message gets a cold start again

Is this acceptable? For a startup MVP, yes. 5s of loading on the first message is within normal web app behavior. Just show a "Loading Qwestly..." indicator. It's only a problem if you have high-traffic, always-on expectations.

Can you go all-in on Vercel Python for v0?

Yes, absolutely. Your existing FastAPI on Vercel can host the agent orchestrator. The constraints (300s timeout, cold starts, no background tasks) are manageable for an MVP. You'll know you've outgrown it when:

  1. Agent runtimes regularly exceed 200s (timeout anxiety)
  2. Cold start latency becomes unacceptable for your traffic pattern
  3. You need background processing (batch LinkedIn imports, scheduled card generation)
  4. You want WebSocket-based real-time features

At that point, you move the agent loop off Vercel to a dedicated service โ€” but your FastAPI code is portable. The agent logic (tools, prompts, Pydantic AI agents) moves with you. You're not rewriting.


8. Concurrency & Queueing

What happens when 10 users chat simultaneously?

Vercel Pro: 100 concurrent function executions (soft limit). For an agent loop that takes 30s per user, 10 concurrent users = 10 functions running for 30s. No problem.

But: Each function holds a DB connection, an LLM API call, and memory. If 50 users hit at once, you'll have 50 concurrent outbound connections. Your DB and LLM API need to handle that.

Practical concern: LLM APIs (OpenAI) rate-limit by tier. On the default Tier 1, you get ~500 RPM and ~10,000 TPM. If 10 users all hit the agent at once, that's 10 simultaneous LLM calls โ€” fine. If 100 users hit, that's 100 simultaneous calls โ€” you'll hit rate limits.

Mitigation:

  • Implement a simple queue for heavy operations (card generation)
  • Use token-based rate limiting per user
  • OpenAI's tier system: request a higher tier when you need it

9. Concrete Recommendations for Qwestly (Given Your Existing Setup)

You already have:

  • Vercel Pro Team (paid) โ€” good for 300s function timeout
  • FastAPI on Vercel Python runtime โ€” agent code runs alongside existing API

This changes the calculus significantly. You don't need a separate backend platform for v0 โ€” you already have one.

Recommendation A (v0): Same FastAPI, Add Agent Routes โ€” โญโญโญโญโญ

"Use what you have, add agent endpoints to your existing FastAPI on Vercel."

Your existing Vercel Pro Team
  โ”œโ”€โ”€ Current FastAPI app (does whatever Qwestly does today)
  โ””โ”€โ”€ New routes: /api/chat, /api/agent
      โ”œโ”€โ”€ Pydantic AI orchestrator agent
      โ”œโ”€โ”€ Tool implementations
      โ”œโ”€โ”€ SSE streaming to client
      โ””โ”€โ”€ Connects to existing MongoDB (+ Atlas Vector Search for RAG)

What you do:

  1. Add a POST /api/chat endpoint to your existing FastAPI app
  2. Wire up Pydantic AI agent inside it
  3. Stream responses via FastAPI's StreamingResponse
  4. Deploy โ€” same pipeline, same Vercel project, same domain

Why this wins for v0:

  • Zero new infrastructure. Same repo, same deploy, same team account.
  • Python-native. You were leaning Python anyway. FastAPI + Pydantic AI are a natural pair.
  • Minimal operational change. No new platform to learn, no new bills, no cross-service latency.
  • 300s timeout covers the vast majority of agent workflows.
  • Fast startup: Pydantic AI has minimal imports. Your cold start is dominated by FastAPI + DB driver, which you already have.

When you'd outgrow this (and how you'd know):

Trigger Signal Next step
Agent runs hit 200s+ consistently Timeout errors in logs Move agent loop to a dedicated service
Cold start > 8s and users complain Analytics show high drop on first message Add keepalive ping, or pre-warm with a "splash" endpoint
Need background processing "Generate cards for all users" as a batch job Defer to a queue worker on Railway/Fly (still cheap, ~$10/mo)
Scaling to 100s of concurrent users Latency spikes under load Agent loop offload + horizontal scaling

But don't solve problems you don't have yet. Start here.

Recommendation B (v0, if you want to decouple): Same as A, agent on separate FastAPI

If you prefer to keep the agent code isolated from your main API (different deploy cadence, different scaling), deploy a second FastAPI app on the same Vercel team:

Vercel Pro Team
  โ”œโ”€โ”€ App 1: Main Qwestly API (your existing FastAPI)
  โ””โ”€โ”€ App 2: Agent API (new FastAPI + Pydantic AI)
         โ””โ”€โ”€ Chat UI calls App 2's /api/chat directly

Same platform, separate codebases, independent deploys. Still zero new infrastructure. Still Python. Still 300s timeout. Same trade-offs.

Recommendation C (when you need it): Extract Agent to Dedicated Backend

When you hit the "outgrow" signals above, the agent loop moves to a stateless FastAPI process on Railway/Fly.io:

Vercel Pro Team (frontend + main API)
  โ””โ”€โ”€ Railway/Fly.io (Python agent backend)
       โ””โ”€โ”€ No timeout limits, background workers, persistent connections

This is the same "extract" pattern teams use everywhere โ€” start monolithic, extract when the boundaries are clear. Your agent code (tools, prompts, Pydantic AI agents) is already a well-defined module. Moving it is a deployment change, not a rewrite.


10. Key Takeaways

Question Answer
Can I use Vercel? Yes, but use it for what it's good at: frontend, auth, lightweight APIs.
Can the agent loop run on Vercel? Yes, on Pro (300s max) with TypeScript. Not on Hobby. Python on Vercel has rough cold starts.
Do timeouts matter? Yes. A multi-step agent + streaming easily hits 30-60s. Pro's 60s default is tight. Configure maxDuration: 300.
Is memory a concern? No. 1024 MB is plenty for a pure logic agent.
What about Python? Python on Vercel is possible (custom runtime) but cold starts are painful. Better to put Python on a dedicated backend.
What about streaming? Vercel's streaming (especially via the AI SDK) is excellent. Works on both serverless and edge.
What about scaling? Vercel handles frontend scaling automatically. The backend (wherever it lives) needs to handle concurrent LLM API calls โ€” watch OpenAI rate limits.
Recommended setup for v0? Vercel frontend + Railway/Fly Python backend. Simple, fast, no timeout worries. ~$40/mo total.
Can I switch later? Yes. The agent logic (pydantic AI agents, tools, prompts) is portable. The deployment platform is the easiest thing to change.

The short version

Vercel is fine for the frontend and as a streaming proxy. Run the actual agent loop on something that doesn't have a 5-minute execution cap. Railway and Fly.io are the simplest Python-friendly options. You're looking at ~$40/mo total for a production-grade setup.