_private/qwestly-docs/architecture-reviews/unauthenticated-agent-onboarding.md

Architecture Review: Unauthenticated Agent Onboarding

Docs reviewed: _docs/unauthenticated-agent-onboarding-tech-design.md, _docs/linkedin-and-resume-grader-tech-design.md Date: 2026-06-09


1. Summary

The plan is solid at the macro level โ€” clean boundary between pre-auth and post-auth, sensible proxy architecture, pragmatic v1 scoping. The two biggest risks are (1) the claim migration touching 25 collections with no transactional safety discussed, and (2) zero observability defined for a system where the LLM drives critical state changes (email capture, claim generation). The pre_auth boolean on ChatRequest is a minor code smell that's acceptable for v1 given the bounded scope, but should have a deprecation plan.


2. Critical Issues

2.1 Claim migration has no transactional boundary (ยง12.6, ยง5, ยง9)

The migrateClaimedAccount() function updates 25 collections. If it fails partway through, the account is left in an inconsistent state. The doc says "using the same pattern as the existing migrateClaimedAccount()" but doesn't validate whether that existing pattern is safe.

Risk: Corrupted claim state requiring manual DB intervention. Silent failures where some collections are updated and others aren't.

Mitigation: Document the existing migration's failure behavior. If it's not atomic, add compensating rollback logic or an idempotent re-run mechanism. Consider breaking the migration into prioritized tiers (critical vs nice-to-have) so the account is usable even if non-critical collections fail.

2.2 LLM-driven state changes have no guardrails (ยง6.2, ยง6.4)

The capture_email tool is called by the LLM. The agent decides when to capture email and generate a claim. If the LLM hallucinates or is prompt-injected, it could:

  • Capture emails at the wrong time (before showing value)
  • Capture nonsense emails
  • Generate claims for users who haven't consented

Risk: Bad user experience, spam claims, wasted SendGrid sends.

Mitigation: Add server-side validation in the tool itself โ€” reject emails that don't pass basic validation. Add a confirmation pattern (agent says "I'll send to X, is that right?" before calling the tool). Consider a rate cap on capture_email per session (already partially addressed in ยง11.1 "30 agent turns per session").

2.3 No retry or timeout strategy for external dependencies (ยง6, ยง7)

The flow depends on: LinkedIn API (ingest), DeepSeek LLM (grade), SendGrid (claim email). None of these have timeout or retry behavior defined. A slow LinkedIn API call or LLM timeout leaves the user staring at "..." in the chat.

Risk: Hung sessions, user abandonment during onboarding.

Mitigation: Add hard timeouts: LinkedIn fetch 15s, LLM grade 30s, SendGrid 10s. Add graceful degradation messages in the preauth prompt ("LinkedIn is taking a while โ€” want to upload your resume as a PDF instead?"). The doc already mentions the resume upload fallback path; make this a timeout-triggered offer, not just a user-initiated alternative.


3. Significant Concerns

3.1 pre_auth boolean flag on ChatRequest (ยง6.1)

The prompt explicitly calls out boolean flags that change endpoint behavior as an anti-pattern. pre_auth: bool switches prompt, tool set, and capability scope. For v1 with exactly two modes this is tolerable, but if a third mode appears it will degrade into nested conditionals.

Suggestion: For v1, keep the flag but isolate the branching to a single factory/strategy at the orchestrator boundary. The doc already does this in ยง6.1 with PREAUTH_PROMPTS dict โ€” ensure tool registration follows the same pattern. Add a comment marking this for refactor if a third variant is added.

3.2 No indexes defined for query patterns (ยง2, ยง3, ยง5)

The doc describes query patterns:

  • provisioned_accounts by session_id (GET /api/provision/session/{id})
  • provisioned_accounts by _id (PATCH, POST claim)
  • agent_conversations by user_id (post-claim resume)
  • candidates by user_id (provision โ†’ candidate lookup)

No indexes are specified. MongoDB will collection-scan without them.

Suggestion: Add index definitions to the data model sections. At minimum: unique index on provisioned_accounts._id (default), index on agent_conversations.user_id, index on provisioned_accounts.email (for admin lookup). Document the expected query patterns explicitly.

3.3 No observability for scheduled/background jobs (ยง11.2)

Session cleanup (7-day expiry) and claim token expiry are mentioned but there's no way to know if they ran successfully. If the cleanup cron silently fails, abandoned provisions accumulate indefinitely.

Suggestion: Add a cleaned_up_at timestamp on provisions that were cleaned up, so the absence of recent timestamps triggers an alert. For Vercel cron, document that Vercel provides execution logs. Add a simple health check endpoint that returns the count of provisions older than 7 days that should have been cleaned.

3.4 Two-tab anonymous session race (ยง4.4)

A user opens two tabs to questly.com/agent. Each tab creates a separate anonymous session with a different provisional_user_id. They chat in one tab, then switch to the other. Data (LinkedIn grade, email capture) is split across two sessions with no merging path.

Risk: Confused users, split data, poor experience.

Suggestion: For v1, accept this as known limitation. If it becomes a support issue, mitigate by: (a) storing the session_id in sessionStorage and detecting tab reuse, (b) showing a "You have an existing session" banner if a second provision is detected for the same browser fingerprint.

3.5 Chat UI shared component extraction plan is fuzzy (ยง10)

The doc lists components to extract but doesn't decide which approach โ€” copy+adapt vs. publish to @qwestly/ui. For FileUploadWidget, HitlMessage, MessageBubble it says "Copy + adapt." For the SSE reader it says "Extract core SSE parsing to public-site/src/lib/sse-reader.ts" (inside public-site, not shared). This creates a future deduplication burden.

Suggestion: Be explicit: v1 copies these components (faster, no cross-package churn). v2 (post-validation) should extract them to @qwestly/ui or a shared packages/agent-chat/. Document this as technical debt with an acceptance gate.


4. Minor Observations

4.1 Missing item #23 in Phase 4 (ยง13)

The checklist jumps from item 22 to 24. No item 23 exists. Intentional gap or numbering error?

4.2 Prompt duplication in system prompt (ยง6.2)

The PREAUTH_SYSTEM_PROMPT has two "## Available Tools" sections โ€” lines 436 and 457. The second one appears to be an updated copy. The first one lists grade_linkedin_profile() with 3 criteria but no generate_linkedin_about or generate_linkedin_experience. The second one (starts at line 457) includes those generation tools. This is confusing โ€” which one is correct? Per the decisions in ยง12.9 and gap ยง3, the pre-auth prompt should NOT include generation tools. The second copy (line 457+) appears to be a stale version.

4.3 ingest_linkedin_profile response shape undefined (ยง6, grader doc ยง5.3)

The LinkedIn API ingestion tool is referenced but its response shape is never specified. The grader tools (grade_linkedin_profile, generate_linkedin_about) depend on this data structure but don't define their input contract.

4.4 Grades as numbers without calibration (ยง4, grader doc)

The grader rubric uses 0-10 with ยฑ1 acceptable variance. The grader doc ยง4.1 says "ship first with the existing prompts, tune later based on real usage." The build note's recommendation to hand-grade 12-20 profiles first was explicitly deferred. This means the first real users will see potentially inconsistent grades. Acceptable if users understand "this is a demo," but risky if grades are presented as authoritative.

4.5 No mention of PostHog analytics (ยง7)

The doc doesn't address product analytics for the onboarding funnel. How will we know if users drop off at the grade, at the email gate, or after the claim? Without funnel analytics, the product team can't optimize the flow.


5. Severity Table

# Concern Severity Actionable Now?
1 No transactional safety for 25-collection claim migration High Yes โ€” document existing behavior, add idempotent re-run
2 LLM-driven state changes without server-side validation High Yes โ€” add email validation in capture_email tool
3 No retry/timeout for external API/LLM/SendGrid calls High Yes โ€” add hard timeouts + graceful degradation messages
4 Boolean flag (pre_auth) drives mode branching Medium No โ€” acceptable for v1, add deprecation plan
5 No MongoDB indexes defined Medium Yes โ€” add to data model sections
6 No observability for scheduled jobs Medium Yes โ€” add health check endpoint
7 Two-tab session race condition Low No โ€” accept as known limitation for v1
8 Shared component extraction plan ambiguous Low No โ€” document as v2 technical debt
9 Preauth system prompt has duplicated/contradictory tool lists Medium Yes โ€” clean up the prompt to remove stale copy
10 Missing PostHog analytics for onboarding funnel Medium No โ€” add tracking plan post-implementation