# Step 4: Inference Make LLM calls through the gateway. Two endpoints: Chat Completions (OpenAI-compatible, function calling) and Responses API (function calling + hosted MCP tools). Auth: end-user key (canonical) or platform key (for server-side testing). --- ## Discover Available Models First **Do NOT hardcode a model slug like `"gpt-4o"`.** Each platform has different provider keys configured, so the set of available models varies. A slug that works on one platform may 404 on another. Call `GET /v1/models` at startup (or on first inference) to get the list of actually-enabled models on this platform, then pick a slug from the response: ```typescript // Fetch once at startup and cache. Auth: end-user or platform key. const modelsRes = await fetch("https://api.assistiv.ai/v1/models", { headers: { Authorization: `Bearer ${key}` }, }); const models = (await modelsRes.json()).data; // [{ id: "gpt-4o-mini", ... }, ...] const MODEL_SLUG = models[0].id; // or pick by name/provider ``` ```python import requests models_res = requests.get( "https://api.assistiv.ai/v1/models", headers={"Authorization": f"Bearer {key}"}, ) available = [m["id"] for m in models_res.json()["data"]] MODEL_SLUG = available[0] # or pick by preference ``` Use `MODEL_SLUG` in every inference call below. The examples in this doc use a literal slug for readability, but your code should always use the dynamically-discovered value. See the full `GET /v1/models` reference at the bottom of this page. --- ## Chat Completions ### POST /v1/chat/completions Request body (use the slug from `GET /v1/models`, not a hardcoded string): ```json { "model": "", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ], "stream": false, "temperature": 0.7, "max_tokens": 256 } ``` Required: - `model` (string) — A slug returned by `GET /v1/models` (e.g. `gpt-4o`, `claude-3-5-sonnet`, `gemini-pro`). Must have an active provider config on your platform. **Always discover via the models endpoint; never hardcode.** - `messages` (array) — OpenAI-format messages. Roles: `system`, `user`, `assistant`, `tool`. Optional: - `stream` (boolean, default false) - `temperature` (number, 0 to 2) - `max_tokens` (integer) — For GPT-5 family models (which route to OpenAI's Responses API under the hood), the minimum is **16**. Values below that return `422 integer_below_min_value`. Set `>= 16` or omit. - `top_p` (number) - `tools` (array) — Tool definitions for function calling - `tool_choice` (`"auto" | "none" | "required"` or specific tool object) - `response_format` (object) — `{"type": "json_object"}` for JSON mode - `frequency_penalty`, `presence_penalty` (-2.0 to 2.0) - `stop` (string or string[]) — Up to 4 stop sequences - `stream_options` — `{"include_usage": true}` to get usage in the final SSE chunk Response (non-streaming): ```json { "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1744113000, "model": "gpt-4o", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help?" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 25, "completion_tokens": 8, "total_tokens": 33 } } ``` --- ## Pre-flight Checks Before calling the upstream LLM, the gateway runs these checks in order. If any fails, the call returns immediately with no upstream charge: 1. **Model validation** — model exists, provider config enabled, scope includes `inference`. 2. **Rate limit check** — user override → platform default → pass-through. 429 if exceeded. 3. **Suspension check** — if the user's budget has `is_suspended=true`, inference is blocked. Returns **`402 payment_required` with `code: budget_suspended`** (distinct from `budget_exhausted`). Manual topups and debits still work on a suspended budget; only inference is blocked. Flip `is_suspended=false` via `PATCH /budget` to resume. 4. **Budget check** — if the user has a budget, `remaining_usd` must be > 0. Returns **`402 payment_required` with `code: budget_exhausted`** if not. (Manual debits can push `remaining_usd` below zero for chargeback accounting; inference still blocks here.) 5. **Wallet check** — platform wallet balance must cover estimated cost. Returns **`402 payment_required` with `code: wallet_insufficient`** if not. **Branching on 402.** Three distinct codes live under `402 payment_required`: | `error.code` | Fix | |---|---| | `budget_suspended` | Admin action — `PATCH /budget { is_suspended: false }` | | `budget_exhausted` | User action — topup, upgrade plan, wait for period reset | | `wallet_insufficient` | Platform action — top up the wallet (Stripe checkout on dashboard) | Parse `error.code` in your inference client and show the right UX. Don't treat all 402s as the same error state. On success: upstream call → measure actual tokens → atomic wallet + budget debit at actual cost (via Postgres `combined_debit` RPC with row lock). A `debit` row lands in the budget ledger and, if you're subscribed, fires the `budget.debited` webhook (and `budget.low_balance` if this debit crosses the threshold). --- ## Streaming Set `stream: true`. Response is an SSE stream; each chunk is a single `data: {json}` line. Example chunks: ``` data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1744113000,"model":"gpt-4o","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1744113000,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1744113000,"model":"gpt-4o","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]} data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1744113000,"model":"gpt-4o","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE] ``` If you set `stream_options: {"include_usage": true}`, the penultimate chunk (before `[DONE]`) contains a `usage` object with `prompt_tokens`, `completion_tokens`, and `total_tokens`. ### Stream interruption semantics If the upstream LLM connection drops, your client disconnects, or the network breaks mid-stream, the gateway treats the interruption as **`finish_reason: "stop"`**. Whatever tokens were already generated and streamed up to that point are billed normally — the wallet (and user budget, if any) are debited for `prompt_tokens` plus the `completion_tokens` actually emitted before the break. There is no partial-rollback. **Implication for retry logic:** if you re-send the request after a drop, you will be billed for both attempts. The log entry for the interrupted call will show `status: "success"` with the truncated token count, not a separate error status — distinguish "completed" from "interrupted" on the client side by checking whether you received the final `[DONE]` sentinel. --- ## Tool Calling Round-trip Send tool definitions in `tools`: ```json { "model": "", "messages": [ { "role": "user", "content": "What's the weather in Paris?" } ], "tools": [ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city", "parameters": { "type": "object", "properties": { "city": { "type": "string" } }, "required": ["city"] } } } ] } ``` The model may return `tool_calls` instead of `content`: ```json { "choices": [{ "index": 0, "message": { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\"}" } }] }, "finish_reason": "tool_calls" }] } ``` Execute the tool in your code, then send the result back as a `role: "tool"` message and continue the loop: ```json { "model": "", "messages": [ { "role": "user", "content": "What's the weather in Paris?" }, { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\":\"Paris\"}" } }] }, { "role": "tool", "tool_call_id": "call_abc", "content": "72°F, sunny" } ], "tools": [...] } ``` Loop until `finish_reason === "stop"`. --- ## Using with OpenAI SDKs Point the base URL at Assistiv and use the end-user key as the API key. Always pass a slug from `GET /v1/models`, not a hardcoded string: ```python from openai import OpenAI client = OpenAI( api_key="sk-eu_your_end_user_key", base_url="https://api.assistiv.ai/v1", ) # Discover which models are actually enabled on this platform models = client.models.list() model_slug = models.data[0].id # or pick by preference response = client.chat.completions.create( model=model_slug, messages=[{"role": "user", "content": "Hello!"}], ) print(response.choices[0].message.content) ``` ```typescript import OpenAI from "openai"; const client = new OpenAI({ apiKey: "sk-eu_your_end_user_key", baseURL: "https://api.assistiv.ai/v1", }); // Discover which models are actually enabled on this platform const models = await client.models.list(); const modelSlug = models.data[0].id; // or pick by preference const response = await client.chat.completions.create({ model: modelSlug, messages: [{ role: "user", content: "Hello!" }], }); console.log(response.choices[0].message.content); ``` --- ## Responses API (OpenAI-compatible) Parallel surface to Chat Completions, matching OpenAI's newer Responses API shape (stateful threads, function calling, structured outputs). This is also the only endpoint that supports hosted MCP tools (see Step 5). ### POST /v1/responses Request body (use the slug from `GET /v1/models`): ```json { "model": "", "input": "Write a haiku about the sea.", "instructions": "You are a poet.", "stream": false, "temperature": 0.8, "max_output_tokens": 200 } ``` Required: - `model` (string) - `input` (string OR array of input items) — Shorthand string for simple prompts, or structured array for multi-turn. Structured input array: ```json { "model": "", "input": [ { "type": "message", "role": "user", "content": [{ "type": "input_text", "text": "Hello" }] } ] } ``` Optional: - `instructions` (string) — System prompt. - `tools` (array) — Tool definitions (same format as chat completions). - `tool_choice` (string or object) - `previous_response_id` (string) — Continue from a prior response. - `thread_id` (string) — Agent thread identifier (for stateful agent models). - `stream` (boolean) - `temperature`, `top_p` (numeric) - `max_output_tokens` (integer) — **Minimum 16** per OpenAI's Responses API contract. Values below 16 return `422 integer_below_min_value`. Set `>= 16` or omit to use the model default. - `response_format` (object) Response (non-streaming): ```json { "id": "resp_abc", "object": "response", "created_at": 1744113000, "status": "completed", "model": "gpt-4o", "output": [ { "type": "message", "id": "msg_abc", "role": "assistant", "content": [{ "type": "output_text", "text": "Crashing waves at dawn..." }] } ], "usage": { "input_tokens": 15, "output_tokens": 30, "total_tokens": 45 } } ``` Function-calling outputs appear as `{ "type": "function_call", "id": "...", "call_id": "...", "name": "...", "arguments": "..." }` items alongside or instead of `message` items. Same pre-flight + billing pipeline as Chat Completions. --- ## Models ### GET /v1/models List models available to the authenticated caller. Auth: end-user or platform key. Only models with an active provider config on your platform are returned. Response (OpenAI format): ```json { "object": "list", "data": [ { "id": "gpt-4o", "object": "model", "created": 1711234567, "owned_by": "openai" }, { "id": "claude-3-5-sonnet", "object": "model", "created": 1711234567, "owned_by": "anthropic" }, { "id": "gemini-pro", "object": "model", "created": 1711234567, "owned_by": "google" } ] } ``` To enable more models: add the corresponding provider key via the website LLM Configs page. --- Next: [Step 5 — MCP Tools](https://www.assistiv.ai/docs/integration/step-5-mcp-tools.txt)