Skip to main content
This tutorial walks the full stack of a user-facing AI chat product on qlaud. By the end you’ll have a chat backend with:
  • Per-end-user conversations — each of your users has their own thread
  • Tool integration — the assistant calls your business logic (lookups, actions) via webhooks; qlaud handles the dispatch loop
  • Semantic search — your end-user can search their own conversation history; you don’t run a vector DB
  • Streaming UX — text appears word-by-word, like every modern chat
  • Per-user billing — hard spend caps; you bill how you want at month-end
What you DON’T build: Postgres tables, message store, context-window loader, tool-call state machine, embedding pipeline, vector store, conversation search, per-user cost attribution. Estimated time end-to-end: ~30 minutes, mostly waiting on pip install / npm install.

Prerequisites

  • A qlaud account (sign up free, $5 starter credit)
  • Your master key from /keys, exported as QLAUD_MASTER_KEY
  • Python 3.9+ (using plain requests) or Node 18+ (using built-in fetch). No qlaud SDK required for any of this.

Architecture in one paragraph

Each of your end-users gets:
  1. A qlaud per-user API key (qlk_live_…) with a hard spend cap, minted on signup using your master key.
  2. A qlaud thread tagged with their end_user_id.
That’s it. Their messages go to qlaud; qlaud calls the model, optionally fires tools, persists everything, exposes search, and meters cost — all keyed off their thread + their per-user key. Your backend only orchestrates.

Step 1 — On signup, mint a per-user key + thread

Whenever a new user signs up in your app, run this once:
import os, requests

QLAUD = "https://api.qlaud.ai"
MASTER = os.environ["QLAUD_MASTER_KEY"]

def onboard_qlaud(end_user_id: str, monthly_budget_usd: float = 5.0):
    """Provision a qlaud per-user key + an initial thread for a new user."""
    headers = {"x-api-key": MASTER, "content-type": "application/json"}

    # 1. Mint a standard-scoped key with a hard cap.
    key = requests.post(f"{QLAUD}/v1/keys", headers=headers, json={
        "name": f"end_user_{end_user_id}",
        "scope": "standard",
        "max_spend_usd": monthly_budget_usd,
    }).json()

    # 2. Create their first thread, tagged with your end_user_id.
    thread = requests.post(f"{QLAUD}/v1/threads", headers={
        "x-api-key": key["secret"],
        "content-type": "application/json",
    }, json={
        "end_user_id": end_user_id,
        "metadata": {"plan": "free"},
    }).json()

    # Store both in YOUR users table.
    return {
        "qlaud_key_id": key["id"],
        "qlaud_secret": key["secret"],   # only returned once — store it
        "qlaud_thread_id": thread["id"],
    }
You now have one place per user that holds their entire AI footprint. That’s all the per-user state you need to track on your side.

Step 2 — Send a message in a conversation

Once you have a user’s qlaud_secret and qlaud_thread_id, sending a turn is one call. qlaud loads the prior history server-side; you only send the new user content:
def chat(qlaud_secret: str, thread_id: str, user_msg: str) -> str:
    r = requests.post(
        f"{QLAUD}/v1/threads/{thread_id}/messages",
        headers={"x-api-key": qlaud_secret, "content-type": "application/json"},
        json={
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "content": user_msg,   # NOT a `messages` array
        },
    )
    body = r.json()
    return body["content"][0]["text"]
That’s a complete chat backend. No message store, no context loader, no “how do I keep history under N tokens” code. qlaud caps at the last 50 turns automatically and you never see the upstream messages array.

Step 3 — Stream the response (token-by-token UX)

For a real chat UI you want text to appear word-by-word. Add stream: true and read the SSE stream:
import json

def chat_stream(qlaud_secret: str, thread_id: str, user_msg: str):
    with requests.post(
        f"{QLAUD}/v1/threads/{thread_id}/messages",
        headers={"x-api-key": qlaud_secret, "content-type": "application/json"},
        json={
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "content": user_msg,
            "stream": True,
        },
        stream=True,
    ) as r:
        for line in r.iter_lines(decode_unicode=True):
            if not line or not line.startswith("data: "):
                continue
            event = json.loads(line[6:])
            if event.get("type") == "content_block_delta":
                delta = event.get("delta", {})
                if delta.get("type") == "text_delta":
                    yield delta["text"]
Frontend: pipe each yielded chunk straight into your UI. After the stream closes, qlaud has already persisted the full assistant turn for you.

Step 4 — Add a tool

Let’s give the assistant the ability to look up user account info. Two parts: register the tool with qlaud, then host the webhook.

Register the tool (one-time)

def register_account_lookup_tool():
    r = requests.post(
        f"{QLAUD}/v1/tools",
        headers={"x-api-key": MASTER, "content-type": "application/json"},
        json={
            "name": "lookup_account",
            "description": "Look up a customer's account info by email",
            "input_schema": {
                "type": "object",
                "properties": {
                    "email": {"type": "string", "description": "Customer email"}
                },
                "required": ["email"],
            },
            "webhook_url": "https://my-app.example/qlaud/tools/account-lookup",
        },
    ).json()
    return r["id"], r["secret"]   # store both — secret is returned ONCE

Host the webhook (your backend)

import hmac, hashlib
from flask import Flask, request, jsonify

app = Flask(__name__)
TOOL_SECRET = os.environ["QLAUD_TOOL_SECRET_LOOKUP_ACCOUNT"]

def verify_signature(headers, body_bytes):
    ts = headers.get("X-Qlaud-Timestamp", "")
    sig = headers.get("X-Qlaud-Signature", "")
    expected = hmac.new(
        TOOL_SECRET.encode(),
        f"{ts}.{body_bytes.decode()}".encode(),
        hashlib.sha256,
    ).hexdigest()
    return hmac.compare_digest(sig, expected)

@app.post("/qlaud/tools/account-lookup")
def account_lookup():
    if not verify_signature(request.headers, request.get_data()):
        return jsonify({"error": "bad signature"}), 401
    payload = request.get_json()
    email = payload["input"]["email"]

    # Your business logic — DB query, internal API, etc.
    account = my_db.find_account(email)
    if not account:
        return jsonify({"output": "no account found", "is_error": True})

    return jsonify({
        "output": {
            "plan": account.plan,
            "credits_remaining": account.credits,
            "joined_at": account.joined_at.isoformat(),
        }
    })

Use the tool in a conversation

def chat_with_tools(qlaud_secret, thread_id, user_msg, tool_ids):
    r = requests.post(
        f"{QLAUD}/v1/threads/{thread_id}/messages",
        headers={"x-api-key": qlaud_secret, "content-type": "application/json"},
        json={
            "model": "claude-sonnet-4-6",
            "max_tokens": 1024,
            "content": user_msg,
            "tools": tool_ids,
        },
    )
    return r.json()["content"]
What happens when the user asks “what plan am I on?” and you pass tool_ids=[lookup_account_id]:
  1. qlaud sends the question + tool definition to Claude
  2. Claude emits a tool_use block: lookup_account({email: "user@example.com"})
  3. qlaud POSTs to your webhook with the input
  4. Your handler queries your DB and returns {output: {plan: "pro", ...}}
  5. qlaud appends a tool_result to the conversation
  6. Claude reads the tool result and responds: “You’re on the Pro plan…”
  7. You get the final text response
You wrote ~20 lines (one webhook handler). qlaud orchestrated the rest — including signature verification, retries on transient failures, parallel dispatch when multiple tools fire at once, and persistence of the entire dance for audit.

Step 5 — Search the user’s history

Your end-user wants to find a past conversation: “What did we discuss about refunds last week?” No vector DB to provision; semantic search is already indexed:
def search_user_history(end_user_id: str, query: str):
    r = requests.get(
        f"{QLAUD}/v1/search",
        headers={"x-api-key": MASTER},
        params={"q": query, "end_user_id": end_user_id, "limit": 10},
    ).json()
    return r["data"]   # list of {thread_id, seq, role, snippet, score, created_at}
end_user_id filter scopes results to ONE of your end-users — they only see their own past conversations, never any other customer’s. The underlying Vectorize index handles that filter at the metadata layer.

Step 6 — Bill at month-end

End of month, pull per-key usage and invoice however you want (Stripe, Paddle, in-app credits, custom):
from datetime import datetime, timezone, timedelta

def monthly_bill_run():
    now = datetime.now(timezone.utc)
    from_ms = int((now - timedelta(days=30)).timestamp() * 1000)
    to_ms = int(now.timestamp() * 1000)

    rollup = requests.get(
        f"{QLAUD}/v1/usage",
        headers={"x-api-key": MASTER},
        params={"from_ms": from_ms, "to_ms": to_ms},
    ).json()

    for row in rollup["by_key"]:
        end_user = lookup_user_by_qlaud_key_id(row["key_id"])
        if not end_user:
            continue
        upstream_cost_usd = row["cost_micros"] / 1_000_000
        margin = upstream_cost_usd * 0.50          # whatever you charge
        bill_usd = round(upstream_cost_usd + margin, 4)
        my_billing_tool.charge(end_user, bill_usd)
cost_micros is what qlaud charged YOU (upstream cost × 1.07 markup). Whatever margin you put on top of that is yours.

What you didn’t build

If you didn’t have qlaud you’d have writtenWhere qlaud handles it
Postgres conversations + messages tables/v1/threads/:id/messages auto-loads history
”Drop oldest message when context exceeds N tokens”History capped automatically; token-aware truncation later
Tool-call state machine: parse tool_use, dispatch, append tool_result, re-call assistantrunToolLoop — loops up to 8 turns, dispatches in parallel, retries on 5xx
Embedding pipeline + Pinecone clientAuto-embed on store, Vectorize-backed /v1/search
Per-user cost attribution table/v1/usage?by_key rolls up automatically
Webhook delivery: signing, retries, dedupHMAC-SHA256 signing + 3 retries built in
Streaming SSE handler that ALSO persists the full message after the stream closesTee’d internally — you stream to user, we persist for search
Roughly 300–500 lines of glue per AI app, deleted.

Next steps

  • Switch models per turn — change model: to gpt-5.4 mid-conversation; history persists, qlaud translates the shape.
  • Use /v1/jobs for long-running batch work that shouldn’t block your request thread.
  • Parallel tool calls happen automatically — when the assistant emits multiple tool_use blocks, qlaud fans out via Promise.all. No code change needed.
  • Per-user spend caps are already enforced gateway-side. Once a user hits their max_spend_usd cap, the next request returns 402 before the upstream model is ever called.
Need help wiring this into an existing codebase? Email hello@qlaud.ai.