Responses API
OpenAI's /v1/responses is the default API for o-series and gpt-5-series
models in 2026 and the replacement for the sunsetting Assistants API
(2026-08-26). SwarmLLM exposes the full v1 surface plus follow-on
features such as resumable streams, async background runs, MCP tool
integration, and conversation chaining via previous_response_id.
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST | /v1/responses | Create a response (streaming or not, foreground or background) |
GET | /v1/responses/{id} | Fetch a stored response. With ?stream=true&starting_after=N, resume the SSE stream from event N (V5). |
DELETE | /v1/responses/{id} | Delete a stored response. |
POST | /v1/responses/{id}/cancel | Cancel a background response (M9). The cancel flag is checked at completion time; per-token interruption is deferred. |
GET | /v1/responses/{id}/input_items | Paginated input-item listing (V4) for chained previous_response_id flows. |
GET | /api/admin/responses | Admin: list all stored response records (used by the dashboard). |
All endpoints accept the same Bearer-auth header as the rest of the API.
Routing
POST /v1/responses picks one of three execution paths in this order:
- Cloud proxy — when the requested
modelresolves to an OpenAI-routed provider, the request is serialized verbatim and forwarded to the upstream/v1/responsesendpoint. Built-in tools, streaming, background, reasoning effort,text.verbosity,include[],previous_response_id, and any future field round-trip via#[serde(flatten)]extras. - Anthropic-Messages bridge (V3) — when the model resolves to an
Anthropic provider (or the local
claude-subscriptionsubprocess), the Responses request is translated to an Anthropic Messages request, forwarded, and translated back. This lets Claude Code clients drive/v1/responsesend-to-end without losing tool-call or streaming semantics. - Local inference — translates to
/v1/chat/completionsand runs on the local model. Function tools andtool_choicetranslate through; built-in tools (web_search,file_search,computer_use_preview,code_interpreter,image_generation,mcp,custom) are rejected with HTTP 400 because they require backing infrastructure SwarmLLM does not run.
Capabilities
- Multimodal input (V2) —
input_imageandinput_file(UTF-8 only) parts in the structuredinputarray. Binary file payloads (PDF, docx, image bytes viafile_data) are rejected with a clear hint pointing atinput_image. - Function tools —
toolsdefinitions andtool_choicetranslate to OpenAI Chat Completions tool semantics; assistanttool_callsmap back tofunction_calloutput items. - Streaming SSE (M6 + V1) —
stream=trueemits the full Responses event sequence (response.created→response.in_progress→response.output_item.added→response.content_part.added→ per-deltaresponse.output_text.delta→response.output_text.done→response.content_part.done→response.output_item.done→response.completed). The V1 fix shipped in 2026-04-25 cuts first-token latency by emittingcreatedandin_progressbefore model warmup instead of after. - Persistence (M7) —
store=true(the OpenAI default) writes the full response object to redb with a 30-day TTL.previous_response_id(M8) chains follow-up requests by prepending the prior turn's messages before the new input. - Background mode (M9 + V8) —
background=truereturns HTTP 202 with aLocation: /v1/responses/{id}header; the client polls or, withbackground=true && stream=true, opens a resumable SSE connection atGET /v1/responses/{id}?stream=truethat replays buffered events and then tails the live producer.
Validation (ingress)
The handler runs validate_responses_ingress BEFORE any routing decision
so the cloud-proxy and Anthropic-bridge paths can't forward attacker-sized
strings to upstream providers (where they'd burn quota or land in log
lines). Caps:
| Field | Limit |
|---|---|
model | 1..=256 chars |
previous_response_id | ≤64 ASCII alphanumeric (_ / - allowed); generation format is resp_<32-hex> |
instructions | ≤2 MB |
user | ≤256 chars |
truncation, service_tier | ≤64 chars each |
metadata | ≤64 KB total (keys + values) |
Stop / temperature / top_p / max_tokens are clamped or validated at the sampling-params layer.
Dashboard
The admin dashboard exposes a Responses tab (/admin/responses) backed by
GET /api/admin/responses. It shows the most-recent stored response
records with status, model, input snippet, and per-record cancel/delete
actions.
Deferred
POST /v1/responses/compact(V9) — no concrete caller has asked for it.- Token-level cancel for background inference — current cancel flips a
flag checked at completion time; per-token interruption needs hooks in
chat_completionsthat are out of v2 plan scope. - Server-side
conversationresource CRUD — OpenAI'sconversationparameter forwards through cloud proxy verbatim today; a local conversation type with its own endpoints is a separate design. - Built-in tools on the local path — see "Local inference" above.
customtools with Lark / regex grammars — rejected on local, forwarded on cloud. Local grammar-constrained generation is a candle-side project.- Audio input on
/v1/responses—input_audioreturns 400; needs a Whisper-class transcription model SwarmLLM doesn't currently expose. - Binary file inputs in
input_file{file_data}— UTF-8 only; PDF/docx/ image-bytes payloads are rejected with a clear hint pointing atinput_image(for images) or server-side text extraction.