- Per-end-user conversations — each of your users has their own thread
- Tool integration — the assistant calls your business logic (lookups, actions) via webhooks; qlaud handles the dispatch loop
- Semantic search — your end-user can search their own conversation history; you don’t run a vector DB
- Streaming UX — text appears word-by-word, like every modern chat
- Per-user billing — hard spend caps; you bill how you want at month-end
pip install / npm install.
Prerequisites
- A qlaud account (sign up free, $5 starter credit)
- Your master key from /keys, exported as
QLAUD_MASTER_KEY - Python 3.9+ (using plain
requests) or Node 18+ (using built-infetch). No qlaud SDK required for any of this.
Architecture in one paragraph
Each of your end-users gets:- A qlaud per-user API key (
qlk_live_…) with a hard spend cap, minted on signup using your master key. - A qlaud thread tagged with their
end_user_id.
Step 1 — On signup, mint a per-user key + thread
Whenever a new user signs up in your app, run this once:Step 2 — Send a message in a conversation
Once you have a user’sqlaud_secret and qlaud_thread_id, sending a turn
is one call. qlaud loads the prior history server-side; you only send the
new user content:
messages array.
Step 3 — Stream the response (token-by-token UX)
For a real chat UI you want text to appear word-by-word. Addstream: true
and read the SSE stream:
Step 4 — Add a tool
Let’s give the assistant the ability to look up user account info. Two parts: register the tool with qlaud, then host the webhook.Register the tool (one-time)
Host the webhook (your backend)
Use the tool in a conversation
tool_ids=[lookup_account_id]:
- qlaud sends the question + tool definition to Claude
- Claude emits a
tool_useblock:lookup_account({email: "user@example.com"}) - qlaud POSTs to your webhook with the input
- Your handler queries your DB and returns
{output: {plan: "pro", ...}} - qlaud appends a
tool_resultto the conversation - Claude reads the tool result and responds: “You’re on the Pro plan…”
- You get the final text response
Step 5 — Search the user’s history
Your end-user wants to find a past conversation: “What did we discuss about refunds last week?” No vector DB to provision; semantic search is already indexed:end_user_id filter scopes results to ONE of your end-users — they only
see their own past conversations, never any other customer’s. The
underlying Vectorize index handles that filter at the metadata layer.
Step 6 — Bill at month-end
End of month, pull per-key usage and invoice however you want (Stripe, Paddle, in-app credits, custom):cost_micros is what qlaud charged YOU (upstream cost × 1.07 markup).
Whatever margin you put on top of that is yours.
What you didn’t build
| If you didn’t have qlaud you’d have written | Where qlaud handles it |
|---|---|
Postgres conversations + messages tables | /v1/threads/:id/messages auto-loads history |
| ”Drop oldest message when context exceeds N tokens” | History capped automatically; token-aware truncation later |
Tool-call state machine: parse tool_use, dispatch, append tool_result, re-call assistant | runToolLoop — loops up to 8 turns, dispatches in parallel, retries on 5xx |
| Embedding pipeline + Pinecone client | Auto-embed on store, Vectorize-backed /v1/search |
| Per-user cost attribution table | /v1/usage?by_key rolls up automatically |
| Webhook delivery: signing, retries, dedup | HMAC-SHA256 signing + 3 retries built in |
| Streaming SSE handler that ALSO persists the full message after the stream closes | Tee’d internally — you stream to user, we persist for search |
Next steps
- Switch models per turn — change
model:togpt-5.4mid-conversation; history persists, qlaud translates the shape. - Use /v1/jobs for long-running batch work that shouldn’t block your request thread.
- Parallel tool calls happen automatically — when the assistant emits
multiple
tool_useblocks, qlaud fans out viaPromise.all. No code change needed. - Per-user spend caps are already enforced gateway-side. Once a user
hits their
max_spend_usdcap, the next request returns 402 before the upstream model is ever called.